Use OpenStreetMap labels for sentences to pronounce

guerda · October 12, 2021, 7:15pm

OpenStreetMap offers a lot of labels in different languages, for points of interest, addresses, etc.

https://wiki.openstreetmap.org/wiki/Names

https://wiki.openstreetmap.org/wiki/Multilingual_names

Maybe it’s worthwhile to discuss if Common Voice could source those for pronunciation “sentences”. It might be positive for transcription of addresses or other POIs.

To be honest, I’m not sure if the license is compatible with Common Voice’s public domain policy. Maybe the labels itself are not a level where copyright applies, but IANAL.

Any thoughts welcome.

david-song · October 14, 2021, 2:53am

Probably not compatible with cc0, but this data set is:

https://geonames.nga.mil/gns/html/index.html

guerda · October 14, 2021, 4:03am

Yeah, this is my concern, too. Not sure if just the labels alone meet the threshold of originality (Schöpfungshöhe in German law).

david-song · October 14, 2021, 4:32am

Maybe but there’s database rights in Europe, so unlike the US, collections of unoriginal facts can be protected by copyright law.

bozden · October 14, 2021, 1:01pm

I’ve been informed multiple times in the past that synthetic sentences and non-conversational single words are not good for the corpus.

I also need to feed common proper names, city/location names into the corpus thou…

guerda · October 14, 2021, 1:48pm

I think it makes sense to only include complete sentences, not just names. I thought about navigation use cases, where place names might be helpful to add to the training set, in order to boost that specific use case.
Of course if the DeepSpeech generalizes enough, including specific words would not be needed.

david-song · October 15, 2021, 1:31am

Maybe pull the Wikipedia database and look for geonames records that have a Wikipedia page, extract sentences from those pages, and filter them by ones that contain the page title?

bozden · October 15, 2021, 8:15am

We can not use Wikipedia data directly, as Wikipedia data is not CC-0. There is a legal agreement between Mozilla and Wikipedia, also some scripts and a process for this purpose, taking 3 sentences per page.
Please search for related posts, I’m on mobile now.

david-song · October 17, 2021, 7:10pm

Yeah I meant as a starting point for the script. If we whitelist those pages and add a “sentences must include $TEXT” rule, where $TEXT is the place name, then it’s the same process under the same agreement, right?

irvin · October 19, 2021, 9:30am

wikidata is CC0, perhaps we can extract all the instances of road (such as Ketagalan Boulevard) into an CC0 database.

I believe this data can benefit common voice in some way such as some characters of road names appear less in daily dialogs but is essential when creating a general purpose stt model.

irvin · October 19, 2021, 12:58pm

local contributor had help write a query for roads in wikidata, althought it’s for local region, you may take it as reference - https://w.wiki/4E$t

Topic		Replies	Views
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3688	September 11, 2019
Common voice sentences are the opposite of "common" Common Voice participation , sentence-collection , feedback , issue	27	3797	September 7, 2024
Problems finding public domain sentences Common Voice sentence-collection	26	2978	June 10, 2019
How can I send sentences to contribute? Common Voice sentence-collection	7	1997	September 5, 2018
I can't speak sentences in portuguese. There is no phrases for the language Common Voice participation , sentence-collection , feedback , issue , dataset	3	987	August 31, 2023

Use OpenStreetMap labels for sentences to pronounce

Related topics