Ghost char that is not in the dataset

@lissyx @reuben any suggestion on this? The first model that I trained was only 80h, then was missing all the spaces using the LM, as I read in one issue that the problem may be the weak prediction of the acoustic model, then I trained one with 230h with the same result, missing spaces with 99% WER. My LM was built from wikipedia text + the transcriptions of the audios, the text only contains from a-z.

Already tried changing the weights of the LM but nothing.

The audios are from librivox, voxforge, usma and few other sources.

Computing acoustic model predictions...
100% (447 of 447) |##############################################################| Elapsed Time: 0:16:20 Time:  0:16:20
Decoding predictions...
100% (447 of 447) |##############################################################| Elapsed Time: 0:22:37 Time:  0:22:37
Test - WER: 0.999991, CER: 0.483454, loss: 57.265446
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 1.557328
 - src: "no es mía"
 - res: "nomia"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 4.000000, loss: 1.772164
 - src: "no no tengo"
 - res: "nanotango"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 1.954995
 - src: "espera por mí"
 - res: "espeapamí"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 2.138840
 - src: "pon la mesa"
 - res: "polama"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 4.000000, loss: 2.865121
 - src: "no tenía prisa"
 - res: "notaníapisa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 3.000000, loss: 3.313169
 - src: "esto nos encanta"
 - res: "estanosaencanta"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 3.449407
 - src: "me lo comentaste"
 - res: "medacamentaste"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 4.000000, loss: 3.533856
 - src: "ve ahora mismo"
 - res: "veaoramisma"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 3.720070
 - src: "yo no supe"
 - res: "yanoup"

Hm, I’m not sure if this is a bug, I just think you need to train for longer/with more data.

The last one I was using transfer learning, this new one was trained with master.

Test - WER: 0.202060, CER: 0.060539, loss: 17.951622
--------------------------------------------------------------------------------
WER: 2.400000, CER: 13.000000, loss: 51.652328
 - src: "pareció el tiempo largoalosquetenían deseosdevolverasupaís"
 - res: "parece el tiempo largo a los que tenian deseos de volver a su pais"
--------------------------------------------------------------------------------
WER: 2.166667, CER: 51.000000, loss: 422.390808
 - src: "como acostumbra si yo tengo sed"
 - res: "como la que seis en dos sombras desnudas y pulidas que corran mordiendo se"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 11.000000, loss: 31.198582
 - src: "quién escribió hamlet"
 - res: "quien es que bio and let"
--------------------------------------------------------------------------------
WER: 1.800000, CER: 31.000000, loss: 198.952866
 - src: "bien está contestó la zorra"
 - res: "bien lo has pensado contestar a las otras vanas ella"
--------------------------------------------------------------------------------
WER: 1.777778, CER: 20.000000, loss: 64.840401
 - src: "sihubierasabidoqueloestabamirandoaquélladequiensentíalleno su corazón hubiera sido para él grande alegría"
 - res: "si hubiera sabido que lo estaba mirando aquella de quien sentia llenos corazon hubiera sido para el gran alegria"
--------------------------------------------------------------------------------
WER: 1.750000, CER: 10.000000, loss: 39.020081
 - src: "estoy aún viviendo todavía"
 - res: "esto a un bien de toda la"
--------------------------------------------------------------------------------
WER: 1.714286, CER: 42.000000, loss: 291.003967
 - src: "yo el mejor día me iré también"
 - res: "yo el mejor dia tambien me ire y no quiero que a la hora de morir"
--------------------------------------------------------------------------------
WER: 1.500000, CER: 11.000000, loss: 30.106270
 - src: "lasangrebrotabademisuñascreíque me haría perder la vida"
 - res: "la sangre brotaba de mis unas crei que me haya perder la vida"
--------------------------------------------------------------------------------
WER: 1.500000, CER: 41.000000, loss: 271.126648
 - src: "he aquí nuestras razones primera vd"
 - res: "era que nuestras razones primera ute la convertido esta expedicion en un"
--------------------------------------------------------------------------------
WER: 1.400000, CER: 23.000000, loss: 129.272705
 - src: "cuando apareció su tío quinientos"
 - res: "cuando parecio su tio don andres saliendo de la"
--------------------------------------------------------------------------------

Clearly my data needs to be cleaned, that long joined words maybe are the source of the issue with the LM, without the LM I’m getting decent transcriptions.

I’ll clean then train again and share what happened.

Finally I’ve found the issue with the wrong transcriptions without spaces, it was the generate_trie, used my alphabet.h with the harcoded alphabet and then compiled the generate_trie, now the transcriptions are correct. I saw that @reuben is working with a utf-8 branch, this may be helpful. Of course this is not the solution but we know now where is the issue coming from.