Same result without lm, no clue from where is coming the # symbol.
Also manually I have checked that all the .csv
and .txt
files are utf-8
.
Same result without lm, no clue from where is coming the # symbol.
Also manually I have checked that all the .csv
and .txt
files are utf-8
.
If I print the transcriptions column all seems OK.
./train.sh
Preprocessing ['/data/home/neoxz/Desktop/deepspeech/data/train/train.csv']
0 en el estadio hidalgo se develan reconocimient...
1 llámame alguna vez
2 sorteando los promontorios de los respaldos lo...
3 dónde están mis libros
4 el interior es de una sola nave
5 lo vi la semana pasada
6 flotaba sobre las plantas más allá de estos co...
I’ve run a score test with the windows speech recognition and all the audios scored over 40% of confidence, that means there’s no faulty audios.
@carlfm01 Last time I had similar failure, it was because there was a bug in my alphabet file, some missing comment. Please make sure you reproduce starting from scratch. Check if any of your #
could be a different unicode same-looking
Can you share the output of the xxd
tool on your alphabet file?
xxd alphabet.txt
00000000: efbb bf23 2045 6163 6820 6c69 6e65 2069 ...# Each line i
00000010: 6e20 7468 6973 2066 696c 6520 7265 7072 n this file repr
00000020: 6573 656e 7473 2074 6865 2055 6e69 636f esents the Unico
00000030: 6465 2063 6f64 6570 6f69 6e74 2028 5554 de codepoint (UT
00000040: 462d 3820 656e 636f 6465 6429 0a23 2061 F-8 encoded).# a
00000050: 7373 6f63 6961 7465 6420 7769 7468 2061 ssociated with a
00000060: 206e 756d 6572 6963 206c 6162 656c 2e0a numeric label..
00000070: 2320 4120 6c69 6e65 2074 6861 7420 7374 # A line that st
00000080: 6172 7473 2077 6974 6820 2320 6973 2061 arts with # is a
00000090: 2063 6f6d 6d65 6e74 2e20 596f 7520 6361 comment. You ca
000000a0: 6e20 6573 6361 7065 2069 7420 7769 7468 n escape it with
000000b0: 205c 2320 6966 2079 6f75 2077 6973 680a \# if you wish.
000000c0: 2320 746f 2075 7365 2027 2327 2061 7320 # to use '#' as
000000d0: 6120 6c61 6265 6c2e 0a20 0a61 0a62 0a63 a label.. .a.b.c
000000e0: 0a64 0a65 0a66 0a67 0a68 0a69 0a6a 0a6b .d.e.f.g.h.i.j.k
000000f0: 0a6c 0a6d 0a6e 0a6f 0a70 0a71 0a72 0a73 .l.m.n.o.p.q.r.s
00000100: 0a74 0a75 0a76 0a77 0a78 0a79 0a7a 0ac3 .t.u.v.w.x.y.z..
00000110: bc0a c3a1 0ac3 a90a c3ad 0ac3 b30a c3ba ................
00000120: 0ac3 b10a 2320 5468 6520 6c61 7374 2028 ....# The last (
00000130: 6e6f 6e2d 636f 6d6d 656e 7429 206c 696e non-comment) lin
00000140: 6520 6e65 6564 7320 746f 2065 6e64 2077 e needs to end w
00000150: 6974 6820 6120 6e65 776c 696e 652e 0a ith a newline..
Is showing the following chars as dots:
ü
á
é
í
ó
ú
ñ
Saved the .txt as ASCII
and reading it with latin-1
instead of utf-8
seems to work, now is printing the chars correctly, running 1 epoch…
It works!
Test - WER: 0.785235, CER: 0.373803, loss: 53.251236
--------------------------------------------------------------------------------
WER: 5.333333, CER: 59.000000, loss: 349.425049
- src: "llámame cuando puedas"
- res: "no tante los ingas petion extend us on indios por la rima a ate e pura"
--------------------------------------------------------------------------------
WER: 3.000000, CER: 5.000000, loss: 21.311464
- src: "voltéala"
- res: "note a la"
--------------------------------------------------------------------------------
WER: 2.500000, CER: 8.000000, loss: 16.086720
- src: "tiene sobrepeso"
- res: "y e e s ordres"
--------------------------------------------------------------------------------
WER: 2.500000, CER: 7.000000, loss: 27.520256
- src: "tosí sangre"
- res: "to s i am be"
--------------------------------------------------------------------------------
WER: 2.500000, CER: 8.000000, loss: 32.984531
- src: "quién ayuna"
- res: "i m a i n"
--------------------------------------------------------------------------------
WER: 2.250000, CER: 10.000000, loss: 40.845211
- src: "era alrededor del mediodía"
- res: "e a l e e or de me ilia"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 2.000000, loss: 3.153902
- src: "oremos"
- res: "pore mos"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 3.000000, loss: 4.859083
- src: "tienes"
- res: "ye es"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 3.000000, loss: 6.062123
- src: "desafina"
- res: "de asia"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 3.000000, loss: 6.154467
- src: "dormí"
- res: "do i"
--------------------------------------------------------------------------------
I Exporting the model...
So, what was the culprit ? We should have no problem with using UTF-8.
Not working with UTF-8, chars like á and ñ fails, if I print when it reads the alphabet prints white spaces on á,é,í,ó,ú and ñ using UTF-8.
Then there is something wrong somewhere, that’s not right.
The decoding, to decode the alphabet latin-1
is required.
Now the native client prints:
As UTF-8:
CHAR:
CHAR: a
CHAR: b
CHAR: c
CHAR: d
CHAR: e
CHAR: f
CHAR: g
CHAR: h
CHAR: i
CHAR: j
CHAR: k
CHAR: l
CHAR: m
CHAR: n
CHAR: o
CHAR: p
CHAR: q
CHAR: r
CHAR: s
CHAR: t
CHAR: u
CHAR: v
CHAR: w
CHAR: x
CHAR: y
CHAR: z
CHAR: ü
CHAR: á
CHAR: é
CHAR: Ã
CHAR: ó
CHAR: ú
CHAR: ñ
As ANSI :
CHAR:
CHAR: a
CHAR: b
CHAR: c
CHAR: d
CHAR: e
CHAR: f
CHAR: g
CHAR: h
CHAR: i
CHAR: j
CHAR: k
CHAR: l
CHAR: m
CHAR: n
CHAR: o
CHAR: p
CHAR: q
CHAR: r
CHAR: s
CHAR: t
CHAR: u
CHAR: v
CHAR: w
CHAR: x
CHAR: y
CHAR: z
CHAR: ü
CHAR: á
CHAR: é
CHAR: í
CHAR: ó
CHAR: ú
CHAR: ñ
But with ANSI prints accent chars as weird symbols. (This is maybe on C# side)
We have no problem with UTF-8 on other platforms
Even using chars like á or ñ?
Yes, we have people hacking spanish, russian and many other with utf-8, and no problem like that
This looks like a case of Mojibake: treating UTF-8 encoded data as ISO 8859-1 (Latin-1) or Windows-1252. The xxd
output you shared shows proper UTF-8 encoded data, it even has a UTF-8 Byte Order Marker. And the à characters are a classic mark of this type of mojibake, see for example https://www.i18nqa.com/debug/bug-utf-8-latin1.html
So I think the problem could be external: for example, DeepSpeech is doing the right thing and outputting proper UTF-8, but then your terminal, your text editor, or something else is interpreting that output with the wrong encoding.
@lissyx @reuben any suggestion on this? The first model that I trained was only 80h, then was missing all the spaces using the LM, as I read in one issue that the problem may be the weak prediction of the acoustic model, then I trained one with 230h with the same result, missing spaces with 99% WER. My LM was built from wikipedia text + the transcriptions of the audios, the text only contains from a-z.
Already tried changing the weights of the LM but nothing.
The audios are from librivox, voxforge, usma and few other sources.
Computing acoustic model predictions...
100% (447 of 447) |##############################################################| Elapsed Time: 0:16:20 Time: 0:16:20
Decoding predictions...
100% (447 of 447) |##############################################################| Elapsed Time: 0:22:37 Time: 0:22:37
Test - WER: 0.999991, CER: 0.483454, loss: 57.265446
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 1.557328
- src: "no es mía"
- res: "nomia"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 4.000000, loss: 1.772164
- src: "no no tengo"
- res: "nanotango"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 1.954995
- src: "espera por mí"
- res: "espeapamí"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 2.138840
- src: "pon la mesa"
- res: "polama"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 4.000000, loss: 2.865121
- src: "no tenía prisa"
- res: "notaníapisa"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 3.000000, loss: 3.313169
- src: "esto nos encanta"
- res: "estanosaencanta"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 3.449407
- src: "me lo comentaste"
- res: "medacamentaste"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 4.000000, loss: 3.533856
- src: "ve ahora mismo"
- res: "veaoramisma"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 5.000000, loss: 3.720070
- src: "yo no supe"
- res: "yanoup"