Application for building corpus?

Bernie · March 19, 2020, 10:58pm

Hi,
Before i start from scratch, anybody knows of an application to build a model training corpus? e.g. present a sentence, prompt user to record it, save the file (possibly in a DB). The app needs to work offline. Trying to build a highly domain specific corpus.

dkreutz · March 20, 2020, 2:37am

Don’t know if it fits all your requirements, but you maybe have a look at this: https://github.com/MycroftAI/mimic-recording-studio

Bernie · March 20, 2020, 3:30am

interesting, will have a closer look;
there’s going to be a steep learning curve for modifying front end, my Node skills are nonexistent;

thanks for the link

lissyx · March 20, 2020, 10:12am

That 100% fit the description of Common Voice, which can be installed locally although it’s not very well supported / documented at the moment. If you feel like you can do it and help improve the project, I’d suggest doing that.

othiele · March 20, 2020, 10:31am

If you are by yourself or with a friend go with Mimic Recording Studio, if you set it up for an organization, go with Common Voice as you will have SSO and S3 for performance built in, but it will take a lot longer to set up. Decoupling SSO and S3 is a mess

othiele · March 20, 2020, 10:32am

As for Corpus, think of what your future input will be and look for NLP datasets in that area. European parliament or Wikipedia are good for more formal language, we use licensed TV subtitles for a more conversational lingo.

Bernie · March 21, 2020, 8:15am

Can’t use S3 as some speakers will definitely be offline. Have no idea how much work would it be to break S3 dependency (is https://github.com/Common-Voice/ the right place to look for CommonVoice code? the repos look like supporting utils)
The idea is to get speakers from different locations to submit their WAV files to a central place where the acoustic/language models will be trained/built, so some kind of DB or at least file management needs to be in place.
Also found this https://developer.mozilla.org/en-US/docs/Web/API/MediaStream_Recording_API/Using_the_MediaStream_Recording_API which was news to me.
Thanks everyone for the replies.

othiele · March 21, 2020, 9:43am

Common Voice Repo is here.

A classic, you’ll need to decide whether to collect data manually with Mimic or invest tens of hours to adapt Common Voice to your needs

Topic		Replies	Views
Using Common Voice data with DeepSpeech Common Voice	11	7544	August 21, 2021
Material needed for FRENCH model creation DeepSpeech	32	3415	March 7, 2018
Use speakers with voice assistant to record CommonVoice sentences Common Voice feedback	2	1550	June 18, 2019
Using speech recognition software to collect more data Common Voice	5	763	April 10, 2020
Mozilla Voice STT in the Wild! DeepSpeech	31	11924	August 25, 2020

Application for building corpus?

Related topics