Ghost char that is not in the dataset

carlfm01 · April 4, 2019, 5:12am

@lissyx @reuben any suggestion on this? The first model that I trained was only 80h, then was missing all the spaces using the LM, as I read in one issue that the problem may be the weak prediction of the acoustic model, then I trained one with 230h with the same result, missing spaces with 99% WER. My LM was built from wikipedia text + the transcriptions of the audios, the text only contains from a-z.

Already tried changing the weights of the LM but nothing.

The audios are from librivox, voxforge, usma and few other sources.

Computing acoustic model predictions...
100% (447 of 447) |##############################################################| Elapsed Time: 0:16:20 Time:  0:16:20
Decoding predictions...
100% (447 of 447) |##############################################################| Elapsed Time: 0:22:37 Time:  0:22:37
Test - WER: 0.999991, CER: 0.483454, loss: 57.265446
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 1.557328
 - src: "no es mía"
 - res: "nomia"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 4.000000, loss: 1.772164
 - src: "no no tengo"
 - res: "nanotango"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 1.954995
 - src: "espera por mí"
 - res: "espeapamí"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 2.138840
 - src: "pon la mesa"
 - res: "polama"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 4.000000, loss: 2.865121
 - src: "no tenía prisa"
 - res: "notaníapisa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 3.000000, loss: 3.313169
 - src: "esto nos encanta"
 - res: "estanosaencanta"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 3.449407
 - src: "me lo comentaste"
 - res: "medacamentaste"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 4.000000, loss: 3.533856
 - src: "ve ahora mismo"
 - res: "veaoramisma"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 3.720070
 - src: "yo no supe"
 - res: "yanoup"

reuben · April 4, 2019, 10:06am

Hm, I’m not sure if this is a bug, I just think you need to train for longer/with more data.

carlfm01 · April 5, 2019, 7:41pm

The last one I was using transfer learning, this new one was trained with master.

Test - WER: 0.202060, CER: 0.060539, loss: 17.951622
--------------------------------------------------------------------------------
WER: 2.400000, CER: 13.000000, loss: 51.652328
 - src: "pareció el tiempo largoalosquetenían deseosdevolverasupaís"
 - res: "parece el tiempo largo a los que tenian deseos de volver a su pais"
--------------------------------------------------------------------------------
WER: 2.166667, CER: 51.000000, loss: 422.390808
 - src: "como acostumbra si yo tengo sed"
 - res: "como la que seis en dos sombras desnudas y pulidas que corran mordiendo se"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 11.000000, loss: 31.198582
 - src: "quién escribió hamlet"
 - res: "quien es que bio and let"
--------------------------------------------------------------------------------
WER: 1.800000, CER: 31.000000, loss: 198.952866
 - src: "bien está contestó la zorra"
 - res: "bien lo has pensado contestar a las otras vanas ella"
--------------------------------------------------------------------------------
WER: 1.777778, CER: 20.000000, loss: 64.840401
 - src: "sihubierasabidoqueloestabamirandoaquélladequiensentíalleno su corazón hubiera sido para él grande alegría"
 - res: "si hubiera sabido que lo estaba mirando aquella de quien sentia llenos corazon hubiera sido para el gran alegria"
--------------------------------------------------------------------------------
WER: 1.750000, CER: 10.000000, loss: 39.020081
 - src: "estoy aún viviendo todavía"
 - res: "esto a un bien de toda la"
--------------------------------------------------------------------------------
WER: 1.714286, CER: 42.000000, loss: 291.003967
 - src: "yo el mejor día me iré también"
 - res: "yo el mejor dia tambien me ire y no quiero que a la hora de morir"
--------------------------------------------------------------------------------
WER: 1.500000, CER: 11.000000, loss: 30.106270
 - src: "lasangrebrotabademisuñascreíque me haría perder la vida"
 - res: "la sangre brotaba de mis unas crei que me haya perder la vida"
--------------------------------------------------------------------------------
WER: 1.500000, CER: 41.000000, loss: 271.126648
 - src: "he aquí nuestras razones primera vd"
 - res: "era que nuestras razones primera ute la convertido esta expedicion en un"
--------------------------------------------------------------------------------
WER: 1.400000, CER: 23.000000, loss: 129.272705
 - src: "cuando apareció su tío quinientos"
 - res: "cuando parecio su tio don andres saliendo de la"
--------------------------------------------------------------------------------

Clearly my data needs to be cleaned, that long joined words maybe are the source of the issue with the LM, without the LM I’m getting decent transcriptions.

I’ll clean then train again and share what happened.

carlfm01 · April 6, 2019, 6:19pm

Finally I’ve found the issue with the wrong transcriptions without spaces, it was the generate_trie, used my alphabet.h with the harcoded alphabet and then compiled the generate_trie, now the transcriptions are correct. I saw that @reuben is working with a utf-8 branch, this may be helpful. Of course this is not the solution but we know now where is the issue coming from.

Topic		Replies	Views
Testing result not good? DeepSpeech	10	310	February 22, 2020
No result training model in Brazilian Portuguese with UTF-8 DeepSpeech	3	408	August 5, 2020
Error in Testing DeepSpeech	6	442	March 19, 2019
Spanish: Blank inferences DeepSpeech	9	755	January 8, 2021
Bad training results DeepSpeech	7	730	April 22, 2020

Ghost char that is not in the dataset

Related topics