Use OpenStreetMap labels for sentences to pronounce

OpenStreetMap offers a lot of labels in different languages, for points of interest, addresses, etc.

https://wiki.openstreetmap.org/wiki/Names

https://wiki.openstreetmap.org/wiki/Multilingual_names

Maybe it’s worthwhile to discuss if Common Voice could source those for pronunciation “sentences”. It might be positive for transcription of addresses or other POIs.

To be honest, I’m not sure if the license is compatible with Common Voice’s public domain policy. Maybe the labels itself are not a level where copyright applies, but IANAL.

Any thoughts welcome.

1 Like

Probably not compatible with cc0, but this data set is:

https://geonames.nga.mil/gns/html/index.html

Yeah, this is my concern, too. Not sure if just the labels alone meet the threshold of originality (Schöpfungshöhe in German law).

Maybe but there’s database rights in Europe, so unlike the US, collections of unoriginal facts can be protected by copyright law.

1 Like

I’ve been informed multiple times in the past that synthetic sentences and non-conversational single words are not good for the corpus.

I also need to feed common proper names, city/location names into the corpus thou…

1 Like

I think it makes sense to only include complete sentences, not just names. I thought about navigation use cases, where place names might be helpful to add to the training set, in order to boost that specific use case.
Of course if the DeepSpeech generalizes enough, including specific words would not be needed.

Maybe pull the Wikipedia database and look for geonames records that have a Wikipedia page, extract sentences from those pages, and filter them by ones that contain the page title?

We can not use Wikipedia data directly, as Wikipedia data is not CC-0. There is a legal agreement between Mozilla and Wikipedia, also some scripts and a process for this purpose, taking 3 sentences per page.
Please search for related posts, I’m on mobile now.

1 Like

Yeah I meant as a starting point for the script. If we whitelist those pages and add a “sentences must include $TEXT” rule, where $TEXT is the place name, then it’s the same process under the same agreement, right?

wikidata is CC0, perhaps we can extract all the instances of road (such as Ketagalan Boulevard) into an CC0 database.

I believe this data can benefit common voice in some way such as some characters of road names appear less in daily dialogs but is essential when creating a general purpose stt model.

local contributor had help write a query for roads in wikidata, althought it’s for local region, you may take it as reference - https://w.wiki/4E$t