Script to get word usage stats

I created a script to get word usage info from the sentence files in the voice-web repository:

It’s an easy way to spot misspelled words and words that need more coverage. You can also link it to a dictionary and it will tell you which dictionary words don’t exist in the corpus.

2 Likes

This is interesting, we might want to have something similar in the future in our site as “advanced stats”.

Thanks for this! :slight_smile:

/cc @gregor @mkohler

1 Like

This is really useful! Thank you.

Do you know whether the folder https://github.com/mozilla/voice-web/tree/master/server/data/en really includes all of the existing sentences? What about for example all those old sentences from The Alchemist? They don’t seem to be there - or have I missed them?

The Alchemist, Flickr image descriptions and a bunch of others are no longer part of the repo. But you can find them if you go back to around December 2017.

I designed it to just read one file because I’m assuming that everything will be passed through Sentence Collector eventually and those extra files will no longer be needed.

1 Like

The repo is now called cvtools and contains a new script to validate sentences.

1 Like