I am trying to finetune deepspeech model on youtube data.
I downloaaded audio and its subtitle from youtube. then i split that data by every sentenceOneSentence.zip (268.6 KB)
it is training normally,
but when I try to split data on two sentence i am getting following error, IndexError: index 0 is out of bounds for axis 0 with size 0
this is second dataTwosentence.zip (268.2 KB)
there is no difference between these two data, except that in first data-set every sentence is different file and in second data-set two sentence is merged together in one file
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
@lissyx
maybe i am doing something wrong i will try to solve this problem.
what do you think about above dataset. do you think i can increase accuracy of indian accent with this dataset,
I have around 7-8 hour dataset like this from dataset.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
10
I already told you that I don’t understand exactly your error, because your description is too vague and scarce.
I think his problem is that, the “two sentence” data set will trigger some error during training while the “one sentence” set works fine.
@Sushantmkarande I don’t think anybody can answer the question if this is a good dataset or not. The accuracy depends on a lot of parameters (instead of just one fine tuning data set). It’s always good to try with your dataset and see what happens.
thank you. that is exactly my problem.
now I think that problem is occurring because they are not in sync.
sometimes there are not some words in dataset that are really spoken in audio.
so I think model knows how many words are spoken from audio and when it tries to match audio signal to words it does not find that word so it throws an error.
let me know if that’s the case
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
14
if you really concatenate both audio and its transcription, it should work. As much as I recall, CTC should be able to deal with some difference. It does not know “how many words are spoken”