Replacing non ascci characters when process corpus does not replace for the nearest character

oscar.benitez1962 · April 7, 2018, 2:01am

Hi
I’m trying to build a Portuguese DeepSpeech model. I found a Portuguese corpus in Voxforge and modified import_voxforge.py to download and process it.
But the process does not replace the special characters with the nearest ascci character, instead replace with spaces.
I understand that the issue is in:

def _generate_dataset(data_dir, data_set):
    extracted_dir = path.join(data_dir, data_set)
    files = []
    for promts_file in glob(path.join(extracted_dir+"/*/etc/", "PROMPTS")):
        if path.isdir(path.join(promts_file[:-11],"wav")):
            with codecs.open(promts_file, 'r', 'utf-8') as f:
                for line in f:
                    id = line.split(' ')[0].split('/')[-1]
                    sentence = ' '.join(line.split(' ')[1:])
                    # sentence = re.sub("[^a-z']"," ",sentence.strip().lower())
                    sentence = re.sub("[^a-zàâäôéèëêïîçù']"," ",sentence.strip().lower())
                    transcript = ""
                    for token in sentence.split(" "):
                        word = token.strip()
                        if word!="" and word!=" ":
                            transcript += word + " "
                    transcript = unicodedata.normalize("NFKD", transcript.strip())  \
                                              .encode("ascii", "ignore")            \
                                              .decode("ascii", "ignore")
                    wav_file = path.join(promts_file[:-11],"wav/" + id + ".wav")
                    if gfile.Exists(wav_file):
                        wav_filesize = path.getsize(wav_file)
                        # remove audios that are shorter than 0.5s and longer than 20s.
                        # remove audios that are too short for transcript.
                        if (wav_filesize/32000)>0.5 and (wav_filesize/32000)<20 and transcript!="" and \
                            wav_filesize/len(transcript)>1400:
                            files.append((path.abspath(wav_file), wav_filesize, transcript))

where I replaced the original with
sentence = re.sub("[^a-zàâäôéèëêïîçù']"," ",sentence.strip().lower()),
and is suppose that
unicodedata.normalize("NFKD", transcript.strip()) \
.encode("ascii", "ignore") \
.decode("ascii", "ignore")
should replace the non standard ascii with the nearest character and get ride with the rest.
It does not happen.
For example, one original sentence in the corpus (from the PROMPT file) is:

anonymous-20121216-xhy/mfc/088 ONDE FICA A ESTAçãO DO TREM

and the corresponding line in voxforge-dev.csv is

~/dev/anonymous-20121216-xhy/wav/088.wav,124044,onde fica a esta o do trem

I searched the web and tried with unidecode with no luck
I will appreciate any suggestion

kdavis · April 7, 2018, 11:00am

@reuben Has trained Portuguese models and wrote much of the code dealing with localization. So he is, without a doubt, your best resource.

oscar.benitez1962 · April 8, 2018, 11:47pm

@kdavis Thanks! Finally I took a detour using R to the job.

reuben · April 10, 2018, 1:25pm

Hi Oscar,

Sorry for not responding earlier, I’ve been on my inbox lately. One thing you should be careful with is Unicode normalization forms. It’s possible that your data has two codepoints for ç while your regex has one (c + combining cedilha vs. ç).

Maybe your R solution normalized the data and regex to the same form.

oscar.benitez1962 · April 11, 2018, 12:12pm

@reuben
Thanks for your response. I know R much better than Phyton and was a natural workaround for me.
As you know, I’m building a Portuguese model, and when I searched for open source Corpora in the net I found three sources listed on Igor Macedo Quintanilha’s dissertation but I found online only VoxForge’s Corpus. Do you know if Sid and LapsBM1.4 are still available? (and still open source, of course).
Better: Do you know another Portuguese open source Copus?
Thanks in advance
Oscar

reuben · April 11, 2018, 1:38pm

The FalaBrasil download links are currently broken but I found archived copies of LapsBM and MailBenchmark on archive.org. I reached out to the maintainers and they told me they’re re-uploading the data.

oscar.benitez1962 · April 11, 2018, 3:10pm

@reuben
Thanks for the info!
In the meantime, I found one advice from @elpimous_robot about handle the audio files with VoiceCorpusTool and I’m just about to try one model with these ‘expanded’ corpus.
Oscar

eduvenson · April 13, 2018, 2:02am

Hi, here is a link to the LapsBM , but it is a small dataset