Key error "\u200d"

Shruthi_Sridhar · December 1, 2019, 5:47pm

I am fine tuning DeepSpeech pretrained models v0.5.1 with Hindi dataset.
Added alphabet.txt with the unique characters got from util/check_characters.py. but i get the below error:

Key error “\u200d”
Your transcripts contain characters which do not occur in data/alphabet.txt! Use util/check_characters.py to see what characters are in your {train,dev,test}.csv transcripts, and then add all these to data/alphabet.txt.’

\u200d is zero width nonjoiner and \u200c is zero width joiner.They are non printable characters.
How do I overcome this error?

alchemi5t · December 1, 2019, 6:05pm

Just remove the zwj and zwnj from the transcripts. 200d and 200c are not used consistently and should be cleaned. It’s only used for graphical purposes and does not add anything to the data other than noise.

Write a script to remove all zwj and zwnj. You could tweak the check_characters script. Making a cleaner to allow only the characters in devanagari range should work best.

lissyx · December 1, 2019, 6:20pm

You cannot change alphabet when you just re-use the pretrained model to perform extra tuning. The transfer-learning2 branch might allow that, but it’s a bit outdated and not 0.5.1.

Shruthi_Sridhar · December 2, 2019, 4:48am

@lissyx Thanks for the reply. I have around 16 hours of Hindi data.I had added the hindi unique characters along with english alphabets in alphabets.txt for finetuning. Since you mentioned that changing of alphabet cannot be done, Could you kindly suggest 1) Is it possible to finetune a Non english dataset with the pretrained model?If yes,could you give any links which are helpful
2)Or should I have to train my model from scratch?

Shruthi_Sridhar · December 2, 2019, 4:51am

@alchemi5t Thank you for the reply.I will try to remove it from the transcripts.

lissyx · December 2, 2019, 10:09am

You won’t get anything usable from 16 hours of audio.

I just told you how …

Topic		Replies	Views
Warning and error when training the model DeepSpeech	6	3026	January 8, 2019
Double free or corruption (out) Fatal Python error: Aborted DeepSpeech issue	5	5025	December 6, 2020
Alphabet Issue training DeepSpeech for Nepali language DeepSpeech	0	387	April 29, 2022
ValueError DeepSpeech	35	2677	March 23, 2020
Error while training alphabet, says it is missing characters DeepSpeech	19	3246	June 18, 2020

Key error "\u200d"

Related topics