Application for building corpus?

Hi,
Before i start from scratch, anybody knows of an application to build a model training corpus? e.g. present a sentence, prompt user to record it, save the file (possibly in a DB). The app needs to work offline. Trying to build a highly domain specific corpus.

Don’t know if it fits all your requirements, but you maybe have a look at this: https://github.com/MycroftAI/mimic-recording-studio

1 Like

interesting, will have a closer look;
there’s going to be a steep learning curve for modifying front end, my Node skills are nonexistent;

thanks for the link

That 100% fit the description of Common Voice, which can be installed locally although it’s not very well supported / documented at the moment. If you feel like you can do it and help improve the project, I’d suggest doing that.

If you are by yourself or with a friend go with Mimic Recording Studio, if you set it up for an organization, go with Common Voice as you will have SSO and S3 for performance built in, but it will take a lot longer to set up. Decoupling SSO and S3 is a mess :slight_smile:

As for Corpus, think of what your future input will be and look for NLP datasets in that area. European parliament or Wikipedia are good for more formal language, we use licensed TV subtitles for a more conversational lingo.

Can’t use S3 as some speakers will definitely be offline. Have no idea how much work would it be to break S3 dependency (is https://github.com/Common-Voice/ the right place to look for CommonVoice code? the repos look like supporting utils)
The idea is to get speakers from different locations to submit their WAV files to a central place where the acoustic/language models will be trained/built, so some kind of DB or at least file management needs to be in place.
Also found this https://developer.mozilla.org/en-US/docs/Web/API/MediaStream_Recording_API/Using_the_MediaStream_Recording_API which was news to me.
Thanks everyone for the replies.

Common Voice Repo is here.

A classic, you’ll need to decide whether to collect data manually with Mimic or invest tens of hours to adapt Common Voice to your needs :slight_smile: