I use the french common voice data set (74 hours) that i previously transformed with the command import_cv2.py. My alphabet is
### Reading in the following transcript files: ###
### ['/data/clips/train.csv'] ###
### The following unique characters were found in your transcripts:
###
â
z
'
!
œ
ë
í
g
q
=
ê
n
l
°
ñ
)
a
r
î
i
ç
e
—
ù
j
y
ï
á
…
½
«
û
w
;
p
’
é
/
ô
ö
ÿ
à
d
:
x
h
u
b
k
ü
º
»
–
s
è
v
m
c
o
t
f
### ^^^ You can copy-paste these into data/alphabet.txt ###
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
8
Well, that’s not super surprising, you are on an old version of the dataset, and there is work that we need to do to improve its quality. Again, please join the efforts I linked above, there’s no point in everyone re-doing the same work and hitting the same issues again and again, efforts needs to be shared.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
10
Please be mindful, I never said it was “unusable”. We are just at the beginning, so some work is needed. Common Voice is intended not only for DeepSpeech, so there is cleanup we cannot do on Common Voice but that needs to be done when training with DeepSpeech.