Ghost char that is not in the dataset

carlfm01 · March 19, 2019, 12:29am

Same result without lm, no clue from where is coming the # symbol.

Also manually I have checked that all the .csv and .txt files are utf-8.

carlfm01 · March 19, 2019, 5:11am

If I print the transcriptions column all seems OK.

./train.sh
Preprocessing ['/data/home/neoxz/Desktop/deepspeech/data/train/train.csv']
0        en el estadio hidalgo se develan reconocimient...
1                                       llámame alguna vez
2        sorteando los promontorios de los respaldos lo...
3                                   dónde están mis libros
4                          el interior es de una sola nave
5                                   lo vi la semana pasada
6        flotaba sobre las plantas más allá de estos co...

carlfm01 · March 19, 2019, 7:55am

I’ve run a score test with the windows speech recognition and all the audios scored over 40% of confidence, that means there’s no faulty audios.

lissyx · March 19, 2019, 9:52am

@carlfm01 Last time I had similar failure, it was because there was a bug in my alphabet file, some missing comment. Please make sure you reproduce starting from scratch. Check if any of your # could be a different unicode same-looking

reuben · March 19, 2019, 10:34am

Can you share the output of the xxd tool on your alphabet file?

carlfm01 · March 19, 2019, 6:11pm

@reuben

 xxd alphabet.txt
00000000: efbb bf23 2045 6163 6820 6c69 6e65 2069  ...# Each line i
00000010: 6e20 7468 6973 2066 696c 6520 7265 7072  n this file repr
00000020: 6573 656e 7473 2074 6865 2055 6e69 636f  esents the Unico
00000030: 6465 2063 6f64 6570 6f69 6e74 2028 5554  de codepoint (UT
00000040: 462d 3820 656e 636f 6465 6429 0a23 2061  F-8 encoded).# a
00000050: 7373 6f63 6961 7465 6420 7769 7468 2061  ssociated with a
00000060: 206e 756d 6572 6963 206c 6162 656c 2e0a   numeric label..
00000070: 2320 4120 6c69 6e65 2074 6861 7420 7374  # A line that st
00000080: 6172 7473 2077 6974 6820 2320 6973 2061  arts with # is a
00000090: 2063 6f6d 6d65 6e74 2e20 596f 7520 6361   comment. You ca
000000a0: 6e20 6573 6361 7065 2069 7420 7769 7468  n escape it with
000000b0: 205c 2320 6966 2079 6f75 2077 6973 680a   \# if you wish.
000000c0: 2320 746f 2075 7365 2027 2327 2061 7320  # to use '#' as
000000d0: 6120 6c61 6265 6c2e 0a20 0a61 0a62 0a63  a label.. .a.b.c
000000e0: 0a64 0a65 0a66 0a67 0a68 0a69 0a6a 0a6b  .d.e.f.g.h.i.j.k
000000f0: 0a6c 0a6d 0a6e 0a6f 0a70 0a71 0a72 0a73  .l.m.n.o.p.q.r.s
00000100: 0a74 0a75 0a76 0a77 0a78 0a79 0a7a 0ac3  .t.u.v.w.x.y.z..
00000110: bc0a c3a1 0ac3 a90a c3ad 0ac3 b30a c3ba  ................
00000120: 0ac3 b10a 2320 5468 6520 6c61 7374 2028  ....# The last (
00000130: 6e6f 6e2d 636f 6d6d 656e 7429 206c 696e  non-comment) lin
00000140: 6520 6e65 6564 7320 746f 2065 6e64 2077  e needs to end w
00000150: 6974 6820 6120 6e65 776c 696e 652e 0a    ith a newline..

carlfm01 · March 19, 2019, 6:19pm

Is showing the following chars as dots:

ü
á
é
í
ó
ú
ñ

carlfm01 · March 19, 2019, 7:01pm

Saved the .txt as ASCII and reading it with latin-1 instead of utf-8 seems to work, now is printing the chars correctly, running 1 epoch…

carlfm01 · March 19, 2019, 7:56pm

It works!

Test - WER: 0.785235, CER: 0.373803, loss: 53.251236
--------------------------------------------------------------------------------
WER: 5.333333, CER: 59.000000, loss: 349.425049
 - src: "llámame cuando puedas"
 - res: "no tante los ingas petion extend us on indios por la rima a ate e pura"
--------------------------------------------------------------------------------
WER: 3.000000, CER: 5.000000, loss: 21.311464
 - src: "voltéala"
 - res: "note a la"
--------------------------------------------------------------------------------
WER: 2.500000, CER: 8.000000, loss: 16.086720
 - src: "tiene sobrepeso"
 - res: "y e e s ordres"
--------------------------------------------------------------------------------
WER: 2.500000, CER: 7.000000, loss: 27.520256
 - src: "tosí sangre"
 - res: "to s i am be"
--------------------------------------------------------------------------------
WER: 2.500000, CER: 8.000000, loss: 32.984531
 - src: "quién ayuna"
 - res: "i m a i n"
--------------------------------------------------------------------------------
WER: 2.250000, CER: 10.000000, loss: 40.845211
 - src: "era alrededor del mediodía"
 - res: "e a l e e or de me ilia"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 2.000000, loss: 3.153902
 - src: "oremos"
 - res: "pore mos"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 3.000000, loss: 4.859083
 - src: "tienes"
 - res: "ye es"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 3.000000, loss: 6.062123
 - src: "desafina"
 - res: "de asia"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 3.000000, loss: 6.154467
 - src: "dormí"
 - res: "do i"
--------------------------------------------------------------------------------
I Exporting the model...

lissyx · March 19, 2019, 7:58pm

So, what was the culprit ? We should have no problem with using UTF-8.

carlfm01 · March 19, 2019, 8:09pm

Not working with UTF-8, chars like á and ñ fails, if I print when it reads the alphabet prints white spaces on á,é,í,ó,ú and ñ using UTF-8.

lissyx · March 19, 2019, 8:29pm

Then there is something wrong somewhere, that’s not right.

carlfm01 · March 19, 2019, 11:08pm

The decoding, to decode the alphabet latin-1 is required.

carlfm01 · March 20, 2019, 5:41am

Now the native client prints:

As UTF-8:

CHAR:  
CHAR: a
CHAR: b
CHAR: c
CHAR: d
CHAR: e
CHAR: f
CHAR: g
CHAR: h
CHAR: i
CHAR: j
CHAR: k
CHAR: l
CHAR: m
CHAR: n
CHAR: o
CHAR: p
CHAR: q
CHAR: r
CHAR: s
CHAR: t
CHAR: u
CHAR: v
CHAR: w
CHAR: x
CHAR: y
CHAR: z
CHAR: Ã¼
CHAR: Ã¡
CHAR: Ã©
CHAR: Ã
CHAR: Ã³
CHAR: Ãº
CHAR: Ã±

As ANSI :

CHAR:  
CHAR: a
CHAR: b
CHAR: c
CHAR: d
CHAR: e
CHAR: f
CHAR: g
CHAR: h
CHAR: i
CHAR: j
CHAR: k
CHAR: l
CHAR: m
CHAR: n
CHAR: o
CHAR: p
CHAR: q
CHAR: r
CHAR: s
CHAR: t
CHAR: u
CHAR: v
CHAR: w
CHAR: x
CHAR: y
CHAR: z
CHAR: ü
CHAR: á
CHAR: é
CHAR: í
CHAR: ó
CHAR: ú
CHAR: ñ

But with ANSI prints accent chars as weird symbols. (This is maybe on C# side)

lissyx · March 20, 2019, 5:49am

We have no problem with UTF-8 on other platforms

carlfm01 · March 20, 2019, 5:55am

Even using chars like á or ñ?

lissyx · March 20, 2019, 6:14am

Yes, we have people hacking spanish, russian and many other with utf-8, and no problem like that

reuben · March 20, 2019, 9:29am

This looks like a case of Mojibake: treating UTF-8 encoded data as ISO 8859-1 (Latin-1) or Windows-1252. The xxd output you shared shows proper UTF-8 encoded data, it even has a UTF-8 Byte Order Marker. And the Ã characters are a classic mark of this type of mojibake, see for example https://www.i18nqa.com/debug/bug-utf-8-latin1.html

So I think the problem could be external: for example, DeepSpeech is doing the right thing and outputting proper UTF-8, but then your terminal, your text editor, or something else is interpreting that output with the wrong encoding.

carlfm01 · March 20, 2019, 10:07pm

Hardcoded the alphabet in the client for now, thanks @reuben @lissyx, I’ll hack on the issue later.

carlfm01 · April 4, 2019, 5:14am

@lissyx @reuben any suggestion on this? The first model that I trained was only 80h, then was missing all the spaces using the LM, as I read in one issue that the problem may be the weak prediction of the acoustic model, then I trained one with 230h with the same result, missing spaces with 99% WER. My LM was built from wikipedia text + the transcriptions of the audios, the text only contains from a-z.

Already tried changing the weights of the LM but nothing.

The audios are from librivox, voxforge, usma and few other sources.

Computing acoustic model predictions...
100% (447 of 447) |##############################################################| Elapsed Time: 0:16:20 Time:  0:16:20
Decoding predictions...
100% (447 of 447) |##############################################################| Elapsed Time: 0:22:37 Time:  0:22:37
Test - WER: 0.999991, CER: 0.483454, loss: 57.265446
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 1.557328
 - src: "no es mía"
 - res: "nomia"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 4.000000, loss: 1.772164
 - src: "no no tengo"
 - res: "nanotango"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 1.954995
 - src: "espera por mí"
 - res: "espeapamí"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 2.138840
 - src: "pon la mesa"
 - res: "polama"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 4.000000, loss: 2.865121
 - src: "no tenía prisa"
 - res: "notaníapisa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 3.000000, loss: 3.313169
 - src: "esto nos encanta"
 - res: "estanosaencanta"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 3.449407
 - src: "me lo comentaste"
 - res: "medacamentaste"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 4.000000, loss: 3.533856
 - src: "ve ahora mismo"
 - res: "veaoramisma"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 3.720070
 - src: "yo no supe"
 - res: "yanoup"