Ghost char that is not in the dataset

lissyx · March 19, 2019, 9:52am

@carlfm01 Last time I had similar failure, it was because there was a bug in my alphabet file, some missing comment. Please make sure you reproduce starting from scratch. Check if any of your # could be a different unicode same-looking

reuben · March 19, 2019, 10:34am

Can you share the output of the xxd tool on your alphabet file?

carlfm01 · March 19, 2019, 6:11pm

@reuben

 xxd alphabet.txt
00000000: efbb bf23 2045 6163 6820 6c69 6e65 2069  ...# Each line i
00000010: 6e20 7468 6973 2066 696c 6520 7265 7072  n this file repr
00000020: 6573 656e 7473 2074 6865 2055 6e69 636f  esents the Unico
00000030: 6465 2063 6f64 6570 6f69 6e74 2028 5554  de codepoint (UT
00000040: 462d 3820 656e 636f 6465 6429 0a23 2061  F-8 encoded).# a
00000050: 7373 6f63 6961 7465 6420 7769 7468 2061  ssociated with a
00000060: 206e 756d 6572 6963 206c 6162 656c 2e0a   numeric label..
00000070: 2320 4120 6c69 6e65 2074 6861 7420 7374  # A line that st
00000080: 6172 7473 2077 6974 6820 2320 6973 2061  arts with # is a
00000090: 2063 6f6d 6d65 6e74 2e20 596f 7520 6361   comment. You ca
000000a0: 6e20 6573 6361 7065 2069 7420 7769 7468  n escape it with
000000b0: 205c 2320 6966 2079 6f75 2077 6973 680a   \# if you wish.
000000c0: 2320 746f 2075 7365 2027 2327 2061 7320  # to use '#' as
000000d0: 6120 6c61 6265 6c2e 0a20 0a61 0a62 0a63  a label.. .a.b.c
000000e0: 0a64 0a65 0a66 0a67 0a68 0a69 0a6a 0a6b  .d.e.f.g.h.i.j.k
000000f0: 0a6c 0a6d 0a6e 0a6f 0a70 0a71 0a72 0a73  .l.m.n.o.p.q.r.s
00000100: 0a74 0a75 0a76 0a77 0a78 0a79 0a7a 0ac3  .t.u.v.w.x.y.z..
00000110: bc0a c3a1 0ac3 a90a c3ad 0ac3 b30a c3ba  ................
00000120: 0ac3 b10a 2320 5468 6520 6c61 7374 2028  ....# The last (
00000130: 6e6f 6e2d 636f 6d6d 656e 7429 206c 696e  non-comment) lin
00000140: 6520 6e65 6564 7320 746f 2065 6e64 2077  e needs to end w
00000150: 6974 6820 6120 6e65 776c 696e 652e 0a    ith a newline..

carlfm01 · March 19, 2019, 6:19pm

Is showing the following chars as dots:

ü
á
é
í
ó
ú
ñ

carlfm01 · March 19, 2019, 7:01pm

Saved the .txt as ASCII and reading it with latin-1 instead of utf-8 seems to work, now is printing the chars correctly, running 1 epoch…

carlfm01 · March 19, 2019, 7:56pm

It works!

Test - WER: 0.785235, CER: 0.373803, loss: 53.251236
--------------------------------------------------------------------------------
WER: 5.333333, CER: 59.000000, loss: 349.425049
 - src: "llámame cuando puedas"
 - res: "no tante los ingas petion extend us on indios por la rima a ate e pura"
--------------------------------------------------------------------------------
WER: 3.000000, CER: 5.000000, loss: 21.311464
 - src: "voltéala"
 - res: "note a la"
--------------------------------------------------------------------------------
WER: 2.500000, CER: 8.000000, loss: 16.086720
 - src: "tiene sobrepeso"
 - res: "y e e s ordres"
--------------------------------------------------------------------------------
WER: 2.500000, CER: 7.000000, loss: 27.520256
 - src: "tosí sangre"
 - res: "to s i am be"
--------------------------------------------------------------------------------
WER: 2.500000, CER: 8.000000, loss: 32.984531
 - src: "quién ayuna"
 - res: "i m a i n"
--------------------------------------------------------------------------------
WER: 2.250000, CER: 10.000000, loss: 40.845211
 - src: "era alrededor del mediodía"
 - res: "e a l e e or de me ilia"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 2.000000, loss: 3.153902
 - src: "oremos"
 - res: "pore mos"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 3.000000, loss: 4.859083
 - src: "tienes"
 - res: "ye es"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 3.000000, loss: 6.062123
 - src: "desafina"
 - res: "de asia"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 3.000000, loss: 6.154467
 - src: "dormí"
 - res: "do i"
--------------------------------------------------------------------------------
I Exporting the model...

lissyx · March 19, 2019, 7:58pm

So, what was the culprit ? We should have no problem with using UTF-8.

carlfm01 · March 19, 2019, 8:09pm

Not working with UTF-8, chars like á and ñ fails, if I print when it reads the alphabet prints white spaces on á,é,í,ó,ú and ñ using UTF-8.

lissyx · March 19, 2019, 8:29pm

Then there is something wrong somewhere, that’s not right.

carlfm01 · March 19, 2019, 11:08pm

The decoding, to decode the alphabet latin-1 is required.

carlfm01 · March 20, 2019, 5:41am

Now the native client prints:

As UTF-8:

CHAR:  
CHAR: a
CHAR: b
CHAR: c
CHAR: d
CHAR: e
CHAR: f
CHAR: g
CHAR: h
CHAR: i
CHAR: j
CHAR: k
CHAR: l
CHAR: m
CHAR: n
CHAR: o
CHAR: p
CHAR: q
CHAR: r
CHAR: s
CHAR: t
CHAR: u
CHAR: v
CHAR: w
CHAR: x
CHAR: y
CHAR: z
CHAR: Ã¼
CHAR: Ã¡
CHAR: Ã©
CHAR: Ã
CHAR: Ã³
CHAR: Ãº
CHAR: Ã±

As ANSI :

CHAR:  
CHAR: a
CHAR: b
CHAR: c
CHAR: d
CHAR: e
CHAR: f
CHAR: g
CHAR: h
CHAR: i
CHAR: j
CHAR: k
CHAR: l
CHAR: m
CHAR: n
CHAR: o
CHAR: p
CHAR: q
CHAR: r
CHAR: s
CHAR: t
CHAR: u
CHAR: v
CHAR: w
CHAR: x
CHAR: y
CHAR: z
CHAR: ü
CHAR: á
CHAR: é
CHAR: í
CHAR: ó
CHAR: ú
CHAR: ñ

But with ANSI prints accent chars as weird symbols. (This is maybe on C# side)

lissyx · March 20, 2019, 5:49am

We have no problem with UTF-8 on other platforms

carlfm01 · March 20, 2019, 5:55am

Even using chars like á or ñ?

lissyx · March 20, 2019, 6:14am

Yes, we have people hacking spanish, russian and many other with utf-8, and no problem like that

reuben · March 20, 2019, 9:29am

This looks like a case of Mojibake: treating UTF-8 encoded data as ISO 8859-1 (Latin-1) or Windows-1252. The xxd output you shared shows proper UTF-8 encoded data, it even has a UTF-8 Byte Order Marker. And the Ã characters are a classic mark of this type of mojibake, see for example https://www.i18nqa.com/debug/bug-utf-8-latin1.html

So I think the problem could be external: for example, DeepSpeech is doing the right thing and outputting proper UTF-8, but then your terminal, your text editor, or something else is interpreting that output with the wrong encoding.

carlfm01 · March 20, 2019, 10:07pm

Hardcoded the alphabet in the client for now, thanks @reuben @lissyx, I’ll hack on the issue later.

carlfm01 · April 4, 2019, 5:14am

@lissyx @reuben any suggestion on this? The first model that I trained was only 80h, then was missing all the spaces using the LM, as I read in one issue that the problem may be the weak prediction of the acoustic model, then I trained one with 230h with the same result, missing spaces with 99% WER. My LM was built from wikipedia text + the transcriptions of the audios, the text only contains from a-z.

Already tried changing the weights of the LM but nothing.

The audios are from librivox, voxforge, usma and few other sources.

Computing acoustic model predictions...
100% (447 of 447) |##############################################################| Elapsed Time: 0:16:20 Time:  0:16:20
Decoding predictions...
100% (447 of 447) |##############################################################| Elapsed Time: 0:22:37 Time:  0:22:37
Test - WER: 0.999991, CER: 0.483454, loss: 57.265446
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 1.557328
 - src: "no es mía"
 - res: "nomia"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 4.000000, loss: 1.772164
 - src: "no no tengo"
 - res: "nanotango"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 1.954995
 - src: "espera por mí"
 - res: "espeapamí"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 2.138840
 - src: "pon la mesa"
 - res: "polama"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 4.000000, loss: 2.865121
 - src: "no tenía prisa"
 - res: "notaníapisa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 3.000000, loss: 3.313169
 - src: "esto nos encanta"
 - res: "estanosaencanta"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 3.449407
 - src: "me lo comentaste"
 - res: "medacamentaste"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 4.000000, loss: 3.533856
 - src: "ve ahora mismo"
 - res: "veaoramisma"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 3.720070
 - src: "yo no supe"
 - res: "yanoup"

reuben · April 4, 2019, 10:06am

Hm, I’m not sure if this is a bug, I just think you need to train for longer/with more data.

carlfm01 · April 5, 2019, 7:41pm

The last one I was using transfer learning, this new one was trained with master.

Test - WER: 0.202060, CER: 0.060539, loss: 17.951622
--------------------------------------------------------------------------------
WER: 2.400000, CER: 13.000000, loss: 51.652328
 - src: "pareció el tiempo largoalosquetenían deseosdevolverasupaís"
 - res: "parece el tiempo largo a los que tenian deseos de volver a su pais"
--------------------------------------------------------------------------------
WER: 2.166667, CER: 51.000000, loss: 422.390808
 - src: "como acostumbra si yo tengo sed"
 - res: "como la que seis en dos sombras desnudas y pulidas que corran mordiendo se"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 11.000000, loss: 31.198582
 - src: "quién escribió hamlet"
 - res: "quien es que bio and let"
--------------------------------------------------------------------------------
WER: 1.800000, CER: 31.000000, loss: 198.952866
 - src: "bien está contestó la zorra"
 - res: "bien lo has pensado contestar a las otras vanas ella"
--------------------------------------------------------------------------------
WER: 1.777778, CER: 20.000000, loss: 64.840401
 - src: "sihubierasabidoqueloestabamirandoaquélladequiensentíalleno su corazón hubiera sido para él grande alegría"
 - res: "si hubiera sabido que lo estaba mirando aquella de quien sentia llenos corazon hubiera sido para el gran alegria"
--------------------------------------------------------------------------------
WER: 1.750000, CER: 10.000000, loss: 39.020081
 - src: "estoy aún viviendo todavía"
 - res: "esto a un bien de toda la"
--------------------------------------------------------------------------------
WER: 1.714286, CER: 42.000000, loss: 291.003967
 - src: "yo el mejor día me iré también"
 - res: "yo el mejor dia tambien me ire y no quiero que a la hora de morir"
--------------------------------------------------------------------------------
WER: 1.500000, CER: 11.000000, loss: 30.106270
 - src: "lasangrebrotabademisuñascreíque me haría perder la vida"
 - res: "la sangre brotaba de mis unas crei que me haya perder la vida"
--------------------------------------------------------------------------------
WER: 1.500000, CER: 41.000000, loss: 271.126648
 - src: "he aquí nuestras razones primera vd"
 - res: "era que nuestras razones primera ute la convertido esta expedicion en un"
--------------------------------------------------------------------------------
WER: 1.400000, CER: 23.000000, loss: 129.272705
 - src: "cuando apareció su tío quinientos"
 - res: "cuando parecio su tio don andres saliendo de la"
--------------------------------------------------------------------------------

Clearly my data needs to be cleaned, that long joined words maybe are the source of the issue with the LM, without the LM I’m getting decent transcriptions.

I’ll clean then train again and share what happened.

carlfm01 · April 6, 2019, 6:19pm

Finally I’ve found the issue with the wrong transcriptions without spaces, it was the generate_trie, used my alphabet.h with the harcoded alphabet and then compiled the generate_trie, now the transcriptions are correct. I saw that @reuben is working with a utf-8 branch, this may be helpful. Of course this is not the solution but we know now where is the issue coming from.