Maybe it’s worthwhile to discuss if Common Voice could source those for pronunciation “sentences”. It might be positive for transcription of addresses or other POIs.
To be honest, I’m not sure if the license is compatible with Common Voice’s public domain policy. Maybe the labels itself are not a level where copyright applies, but IANAL.
I think it makes sense to only include complete sentences, not just names. I thought about navigation use cases, where place names might be helpful to add to the training set, in order to boost that specific use case.
Of course if the DeepSpeech generalizes enough, including specific words would not be needed.
Maybe pull the Wikipedia database and look for geonames records that have a Wikipedia page, extract sentences from those pages, and filter them by ones that contain the page title?
We can not use Wikipedia data directly, as Wikipedia data is not CC-0. There is a legal agreement between Mozilla and Wikipedia, also some scripts and a process for this purpose, taking 3 sentences per page.
Please search for related posts, I’m on mobile now.
Yeah I meant as a starting point for the script. If we whitelist those pages and add a “sentences must include $TEXT” rule, where $TEXT is the place name, then it’s the same process under the same agreement, right?
wikidata is CC0, perhaps we can extract all the instances of road (such as Ketagalan Boulevard) into an CC0 database.
I believe this data can benefit common voice in some way such as some characters of road names appear less in daily dialogs but is essential when creating a general purpose stt model.