Ghost char that is not in the dataset

Hi, I’m trying to start testing my setup with small Spanish dataset (80h) then I will start processing the whole dataset, the problem is a that is showing a #.

Problem:

WER: 7.000000, CER: 76.000000, loss: 516.562012
 - src: "iirn nh#ai  ct#hihc  "
 - res: " t#tE f  fh#it #n a  #ch# niit #h fh chs# i #tcnt #hts#i #sn tsn #r #  fh #iih#hts#i #lih c "
--------------------------------------------------------------------------------
WER: 3.000000, CER: 8.000000, loss: 39.043110
 - src: " ih#nn#hssts"
 - res: " t n #it"
--------------------------------------------------------------------------------
WER: 2.500000, CER: 9.000000, loss: 21.328753
 - src: "iinhst#  Ehsit"
 - res: "iinh #it # ihst #it"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 3.000000, loss: 6.723901
 - src: "ctsns"
 - res: "ct #nn"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 2.000000, loss: 8.263452
 - src: "hstnhfnhst "
 - res: "hstnhcn st "
--------------------------------------------------------------------------------
WER: 2.000000, CER: 4.000000, loss: 13.701299
 - src: "iitlne  "
 - res: "it#nn n "
--------------------------------------------------------------------------------
WER: 2.000000, CER: 6.000000, loss: 15.189466
 - src: "lhfh#ch#nn#a   "
 - res: "lhc #nni   "

Here’s my alphabet.txt

# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
# associated with a numeric label.
# A line that starts with # is a comment. You can escape it with \# if you wish
# to use '#' as a label.
 
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
ü
á
é
í
ó
ú
ñ
# The last (non-comment) line needs to end with a newline.

The util/check_characters.py is showing:

### Reading in the following transcript files: ###
### ['/data/home/neoxz/Desktop/deepspeech/data/dev/dev.csv', '/data/home/neoxz/Desktop/deepspeech/data/test/test.csv', '/data/home/neoxz/Desktop/deepspeech/data/train/train.csv'] ###
### The following unique characters were found in your transcripts: ###
['m', 'h', 'z', 'v', 'j', 'k', 't', 'o', 'a', 'ñ', 'd', 'e', 'u', 'c', 'q', 'é', 'r', 'á', 'ü', 'l', 'b', 'ó', 'x', 'i', 'f', 's', 'n', 'g', ' ', 'y', 'ú', 'w', 'í', 'p']

I was using my own LM with 2m sentences with default English to start testing it (including training ones), I’ve removed the lm flags and currently running 1 epoch to see if the lm is the problem.

I’m still getting familiar with the whole process so, let me know if I’m missing any important information about what I’m doing.

The ./bin/run-ldc93s1.sh works perfect with the current setup.

To build the lm I’ve used the generated alphabet chars to check the lm sentences. All checks passed.

Any idea?

Same result without lm, no clue from where is coming the # symbol.

Also manually I have checked that all the .csv and .txt files are utf-8.

If I print the transcriptions column all seems OK.

./train.sh
Preprocessing ['/data/home/neoxz/Desktop/deepspeech/data/train/train.csv']
0        en el estadio hidalgo se develan reconocimient...
1                                       llámame alguna vez
2        sorteando los promontorios de los respaldos lo...
3                                   dónde están mis libros
4                          el interior es de una sola nave
5                                   lo vi la semana pasada
6        flotaba sobre las plantas más allá de estos co...

I’ve run a score test with the windows speech recognition and all the audios scored over 40% of confidence, that means there’s no faulty audios. :confused:

@carlfm01 Last time I had similar failure, it was because there was a bug in my alphabet file, some missing comment. Please make sure you reproduce starting from scratch. Check if any of your # could be a different unicode same-looking

Can you share the output of the xxd tool on your alphabet file?

@reuben

 xxd alphabet.txt
00000000: efbb bf23 2045 6163 6820 6c69 6e65 2069  ...# Each line i
00000010: 6e20 7468 6973 2066 696c 6520 7265 7072  n this file repr
00000020: 6573 656e 7473 2074 6865 2055 6e69 636f  esents the Unico
00000030: 6465 2063 6f64 6570 6f69 6e74 2028 5554  de codepoint (UT
00000040: 462d 3820 656e 636f 6465 6429 0a23 2061  F-8 encoded).# a
00000050: 7373 6f63 6961 7465 6420 7769 7468 2061  ssociated with a
00000060: 206e 756d 6572 6963 206c 6162 656c 2e0a   numeric label..
00000070: 2320 4120 6c69 6e65 2074 6861 7420 7374  # A line that st
00000080: 6172 7473 2077 6974 6820 2320 6973 2061  arts with # is a
00000090: 2063 6f6d 6d65 6e74 2e20 596f 7520 6361   comment. You ca
000000a0: 6e20 6573 6361 7065 2069 7420 7769 7468  n escape it with
000000b0: 205c 2320 6966 2079 6f75 2077 6973 680a   \# if you wish.
000000c0: 2320 746f 2075 7365 2027 2327 2061 7320  # to use '#' as
000000d0: 6120 6c61 6265 6c2e 0a20 0a61 0a62 0a63  a label.. .a.b.c
000000e0: 0a64 0a65 0a66 0a67 0a68 0a69 0a6a 0a6b  .d.e.f.g.h.i.j.k
000000f0: 0a6c 0a6d 0a6e 0a6f 0a70 0a71 0a72 0a73  .l.m.n.o.p.q.r.s
00000100: 0a74 0a75 0a76 0a77 0a78 0a79 0a7a 0ac3  .t.u.v.w.x.y.z..
00000110: bc0a c3a1 0ac3 a90a c3ad 0ac3 b30a c3ba  ................
00000120: 0ac3 b10a 2320 5468 6520 6c61 7374 2028  ....# The last (
00000130: 6e6f 6e2d 636f 6d6d 656e 7429 206c 696e  non-comment) lin
00000140: 6520 6e65 6564 7320 746f 2065 6e64 2077  e needs to end w
00000150: 6974 6820 6120 6e65 776c 696e 652e 0a    ith a newline..

Is showing the following chars as dots:

ü
á
é
í
ó
ú
ñ

Saved the .txt as ASCII and reading it with latin-1 instead of utf-8 seems to work, now is printing the chars correctly, running 1 epoch…

It works! :slight_smile:

Test - WER: 0.785235, CER: 0.373803, loss: 53.251236
--------------------------------------------------------------------------------
WER: 5.333333, CER: 59.000000, loss: 349.425049
 - src: "llámame cuando puedas"
 - res: "no tante los ingas petion extend us on indios por la rima a ate e pura"
--------------------------------------------------------------------------------
WER: 3.000000, CER: 5.000000, loss: 21.311464
 - src: "voltéala"
 - res: "note a la"
--------------------------------------------------------------------------------
WER: 2.500000, CER: 8.000000, loss: 16.086720
 - src: "tiene sobrepeso"
 - res: "y e e s ordres"
--------------------------------------------------------------------------------
WER: 2.500000, CER: 7.000000, loss: 27.520256
 - src: "tosí sangre"
 - res: "to s i am be"
--------------------------------------------------------------------------------
WER: 2.500000, CER: 8.000000, loss: 32.984531
 - src: "quién ayuna"
 - res: "i m a i n"
--------------------------------------------------------------------------------
WER: 2.250000, CER: 10.000000, loss: 40.845211
 - src: "era alrededor del mediodía"
 - res: "e a l e e or de me ilia"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 2.000000, loss: 3.153902
 - src: "oremos"
 - res: "pore mos"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 3.000000, loss: 4.859083
 - src: "tienes"
 - res: "ye es"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 3.000000, loss: 6.062123
 - src: "desafina"
 - res: "de asia"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 3.000000, loss: 6.154467
 - src: "dormí"
 - res: "do i"
--------------------------------------------------------------------------------
I Exporting the model...

So, what was the culprit ? We should have no problem with using UTF-8.

Not working with UTF-8, chars like á and ñ fails, if I print when it reads the alphabet prints white spaces on á,é,í,ó,ú and ñ using UTF-8.

Then there is something wrong somewhere, that’s not right.

The decoding, to decode the alphabet latin-1 is required.

Now the native client prints:

As UTF-8:

CHAR:  
CHAR: a
CHAR: b
CHAR: c
CHAR: d
CHAR: e
CHAR: f
CHAR: g
CHAR: h
CHAR: i
CHAR: j
CHAR: k
CHAR: l
CHAR: m
CHAR: n
CHAR: o
CHAR: p
CHAR: q
CHAR: r
CHAR: s
CHAR: t
CHAR: u
CHAR: v
CHAR: w
CHAR: x
CHAR: y
CHAR: z
CHAR: ü
CHAR: á
CHAR: é
CHAR: í
CHAR: ó
CHAR: ú
CHAR: ñ

As ANSI :

CHAR:  
CHAR: a
CHAR: b
CHAR: c
CHAR: d
CHAR: e
CHAR: f
CHAR: g
CHAR: h
CHAR: i
CHAR: j
CHAR: k
CHAR: l
CHAR: m
CHAR: n
CHAR: o
CHAR: p
CHAR: q
CHAR: r
CHAR: s
CHAR: t
CHAR: u
CHAR: v
CHAR: w
CHAR: x
CHAR: y
CHAR: z
CHAR: ü
CHAR: á
CHAR: é
CHAR: í
CHAR: ó
CHAR: ú
CHAR: ñ

But with ANSI prints accent chars as weird symbols. (This is maybe on C# side)

We have no problem with UTF-8 on other platforms :confused:

Even using chars like á or ñ?

Yes, we have people hacking spanish, russian and many other with utf-8, and no problem like that

This looks like a case of Mojibake: treating UTF-8 encoded data as ISO 8859-1 (Latin-1) or Windows-1252. The xxd output you shared shows proper UTF-8 encoded data, it even has a UTF-8 Byte Order Marker. And the à characters are a classic mark of this type of mojibake, see for example https://www.i18nqa.com/debug/bug-utf-8-latin1.html

So I think the problem could be external: for example, DeepSpeech is doing the right thing and outputting proper UTF-8, but then your terminal, your text editor, or something else is interpreting that output with the wrong encoding.

Hardcoded the alphabet in the client for now, thanks @reuben @lissyx, I’ll hack on the issue later.