Creating DeepSpeech Model for Hindi

@lissyx Hi bro,

Created a model for Hindi,
after training my data, at the test steps i get an error

I Restored variables from best validation checkpoint at 
   hindi_checkpoint/best_dev-90, step 90
 Testing model on data/test/test.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00                                                                                                                                               
 Fatal Python error: Segmentation fault

What could be the problem?

Without more context, itтАЩs going to be hard тАж How have you setup things ?
I had a similar crash, resolved by re-creating the virtualenv from scratch тАж

Well, recorded few audios of mine and prepared the datasets.

step 1: prepared datasets, vocabs, alphabet.txt created arpa and then lm.bianry with hindi vocabs using kenlm
step 2: using native client bazel build created trie file using lm.binary

step 3: cloned deepspeech and gitcheckout 0.5.1

placed my trie and binary into data/lm/

running deepspeech.py , but after training i am getting segment error at testing steps:

is it because of low data in test?

Unlikely. You have not documented anything on how you did setup virtualenv тАж Did you read my reply ?

Okay ll try again creating venv from scratch.

Started from scratch creating venv. again same error

Well, sorry, but with so much details, I donтАЩt even can try to reproduce тАж

@cryptoaimdy Seriously, I would like to help you, but you keep continuously not sharing your complete STR. This is making both of us loose valuable time. So once again, share detailed and complete STR of everything you do to reproduce the issue. And try to reproduce with our default data (LDC93S1, english model and LM and alphabet, our native client build), to make sure this is not something from there.

There are 10+ variables here in play, we canтАЩt do divination from a single segfault.

Solved it today morning. Model is ready and giving loss rate like 60 average. Now creating more hindi datasets.

Do you care sharing what was the solution ?

As you suggested i started from scratch setting up venv.

According to me i think the problem was a version mismatch(binary version and DeepSpeech version). because earlier i created the lm binary and trie without using virtual env and deepspeech i was running using venv. so looks like a version mismatch.

Right, thanks, at least it confirms my first assumption. Glad to see it is working now.

Yes, setup is all what we need to do carefully.

But, for hindi my src " " part is not accurate. while testing the src is having sentences other than my original test.csv, the sentence in src is not making even sence. its kind of тАШabcsdksbfdfaтАЩ(hindi language abcd)

WER: 0.000000, CER: 0.000000, loss: 7.111163
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/010.wav
 - src: "рд╛рдкрдЦреА рдЦрдЦреИреБреЗреА"
 - res: "рд╛рдкрдЦреА рдЦрдЦреИреБреЗреА"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 8.599924
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/008.wav
 - src: "рдкреЗрд╛рдВреАрдЦрддрд╣рд╛рдЦрд╛рдЦ рдЦреИреБреЗреА"
 - res: "рдкреЗрд╛рдВреАрдЦрддрд╣рд╛рдЦрд╛рдЦ рдЦреИреБреЗреА"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 11.154081
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/007.wav
 - src: "реБрдмрдЦрдВреЗрджрдпрджреА"
 - res: "реБрдмрдЦрдВреЗрджрдпрджреА"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 12.389311
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/001.wav
 - src: " рддрдкреБрдмрдЦреИреАрдмрдЦреБрдЦрддрдЦрдк"
 - res: " рддрдкреБрдмрдЦреИреАрдмрдЦреБрдЦрддрдЦрдк"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 12.756799
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/006.wav
 - src: "рдмрдбреИрдЦрдЦреБрдЦрддрдЦреИрдпреАрдЦрдЦрдкрд░"
 - res: "рдмрдбреИрдЦрдЦреБрдЦрддрдЦреИрдпреАрдЦрдЦрдкрд░"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 17.487480
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/003.wav
 - src: "рд╛реБрдЦрддрд░рдЦрджреЗрддрджреАрдзреАрдЦреБрдЦрд╛рд╣рдЦрд╛рдЦ рдЦреИреБрдЦреБрдкрдЦрдЦрдкрди"
 - res: "рд╛реБрдЦрддрд░рдЦрджреЗрддрджреАрдзреАрдЦреБрдЦрд╛рд╣рдЦрд╛рдЦ рдЦреИреБрдЦреБрдкрдЦрдЦрдкрди"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 18.130671
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/002.wav
 - src: "рддрд░рдЦрдмрдбреИреАрдЦреИреАрдЦрдЦрд╛рдкрдЦреА рдЦрдЦреИреБрдЦрд╛реИ рдЦрдЦрдкрди"
 - res: "рддрд░рдЦрдмрдбреИреАрдЦреИреАрдЦрдЦрд╛рдкрдЦреА рдЦрдЦреИреБрдЦрд╛реИ рдЦрдЦрдкрди

like this i am getting, it is because of low data? i think src should be displayed as it is.

I donтАЩt understand, it looks like you have src == res, which would mean computed transcription matches expected transcription.

WER: 1.000000, CER: 0.600000, loss: 33.510384
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/009.wav
 - src: "реИрдпреАрдЦрдЦрд╛рдЦрджреБрдЦрдЦреМрдЦрдк рд╣рдЦрдкрд░рд╛"
 - res: "рдкрдЦрд╛рдЦ рдЦрдкрд░рд╛"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 6.128268
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/004.wav
 - src: "реБрдЦрдд"
 - res: "реБрдЦрдд"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 6.713559
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/005.wav
 - src: "рд╛рдкрдЦреА рдЦ"
 - res: "рд╛рдкрдЦреА рдЦ"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 7.111163
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/010.wav
 - src: "рд╛рдкрдЦреА рдЦрдЦреИреБреЗреА"
 - res: "рд╛рдкрдЦреА рдЦрдЦреИреБреЗреА"
-------------------------------------------------------------------------

at first i got the WER 1

Well, this is the test set showing worst examples. I donтАЩt see anything strange, please elaborate.

wav_filename,wav_filesize,transcript
001.wav,101000,рддрдорд╣рд░рдЖ рдХрдпрдЖ рдирд╛рдо рд╣
002.wav,138000,рдореИ рдЖрдкрдХреА рдХрдпрд╛ рд╕рд╣рд╛рдпрддрд╛ рдХрд░ рд╕рдХрддрд╛ рд╣реВ
003.wav,125000,рд╕рд░ рдореИ рд╡рд┐рдорд╡реАрд╢рдпреЛрд░ рд╕реЗ рдмрд╛рдд рдХрд░ рд░рд╣рд╛ рд╣реВ
004.wav,78000,рдирд╛рдо
005.wav,99400,рд╕рд╣рд╛рдпрддрд╛
006.wav,106000,рдЖрдкрдХрд╛ рдирд╛рдо рдХреНрдпрд╛ рд╣реИ
007.wav,80900,рдирдИ рджрд┐рд▓реНрд▓реА
008.wav,90700,рд╣рд┐рдВрджреА рдореЗрдВ рдмрд╛рдд рдХрд░рд┐рдП
009.wav,81900,рдХреНрдпрд╛ рдмреЛрд▓рдирд╛ рдЪрд╛рд╣рддреЗ рд╣реИрдВ
010.wav,88700,рд╕рд╣рд╛рдпрддрд╛ рдХрд░рд┐рдП

this is my original test.csv content match it with the src its totally different.

this is the original transcript

this is what i get in src