Lesser illegible outputs by decreasing the size of the context window (?)

Hi,

First of all thank you for making such an amazing piece of work open source. I have found no other open source implementation that is this comprehensive and detailed. Once again, THANK YOU :slight_smile:

I speak really fast, and I was running tests with the pre-trained model, and I found that the model occasionally fails to pick up words spoken together quickly, it just concatenates the phonemes of different words together to ultimately produce one big illegible word.

For example, this is the original sentence:

so did you get a chance to look into it and do you have any initial thoughts about it? I’m open to suggestions here.

the transcript goes like this:

so did you get a chance to look into it and you have an initial thoughts about him amwatfwornsussetionshere

without invoking the language model, it looks like this:

so did you get a chance to look into it and you have an anutal thoughts about him amwat worn sussetions here

and this happens rather frequently. So I was wondering if reducing the context window would help alleviate such errors, if it’s the case that it can’t differentiate when one word ends and the other begins.

Or is this related to the language model?

I would greatly appreciate it if someone could shed some light on this, before I start to invest significant time and effort into retraining the model with a new context window.