Just as the browser advertises with icons for Amazon, there could be a space for the project. What do you think?
This has been done in the past and was quite effective. We should do it again.
I would love to see this done in a more targeted way. For example, we know that we don’t have a lot of samples from particular geographies where some languages are spoken - for example, we don’t have a lot of samples from people who speak English who identify as having Nordic accents, or who speak English with various African accents, and seeking to get more specific data would be a very good use of this this as a promotional tool without over-saturing people it (where they begin to ignore it).
But why place ads only in the languages that have the most sentences and not the ones with the least? It doesn’t make sense
Because they wanted to avoid that a sentence gets recorded more than one time. The engineers said that this would decrease the quality of the dataset for machine learning.
But it’s a cycle… Would it not be possible to avoid this by requiring the person to add a new phrase when speaking?
@stergro is correct: if we have lots of people speaking the same sentence, this reduces the quality of the voice data for machine learning practitioners training models. For speech recognition, we need more diverse data.
So, a different way to approach this could be to use the browser to elicit sentences rather than voice samples - however donating sentences is more onerous on the data contributor.
One way to use this as a promotional tool could be to prompt the user to finish a sentence, for example:
When I went shopping today, I bought ...
I love the sound of ...
I feel so ... when ...
You could have a lot of variation in the prompts, to encourage diverse sentences.
There are so many ways to fix this, stopping advertising in languages with few contributions does not seem to be logical
In an ideal world, one speaker - one sentence, infinite sentences, and speakers… But the main problem is the scarcity of the text-corpora, because of the CC0 requirement. In my opinion, this is where community work is most necessary.
Otherwise, AFAIK, a sentence being recorded by 2 or 3 different people (accent, gender, age, …) may not be a bad thing, provided that the sentence corpus coverage of the vocabulary (thus phonemes) is not low (not 5000 sentences recorded by 15 people each - which we can encounter in datasets).
Here is an example calculation for the requirements. In my language, I aim to not exceed 2 recs/sentence - for now.
There are many ways for text-corpus generation, but all require a lot of work. The sentences MUST be correct. One problem one needs to solve is the “domain” problem. Every one of us can create sentences from our daily life, which would result in a base, which for example can understand “aspirin”, but not “acetylsalicylic acid” (which is aspirin). IMHO, this is not hard jargon that should be left out. Every knowledge we gain in pre-university majors should be understood by a system we build. For further levels, fine-tuning with specially prepared domain-specific corpus would be needed.
So, if the text corpus is low and recs/sentence is high (say >5), I concur that promotion to record will be pointless, even harmful. But text-corpus generation could be promoted.
This is usually done through campaigns prepared by language community leads, but only a small percentage of languages have such active communities.
Another problem here is: These communities are on their own. They do not have enough time/resources for wider reach, except family, friends, and colleagues - mainly highly educated large-city people. These are very good for creating the text-corpus and for validation works (after some recording), but we need voice diversity from rural areas (local tongues), and people living in other countries (regional dialects).
Such a global campaign will help these goals tremendously. But not “to just record”, to come to the project and join/form communities, so more knowledgable people (read “those who know the dataset coverage and issues in it”) can direct the newcomers to the correct channel.
In short, I’d say: Do it for every language, but in a more civil-societal / community-based manner.
This could be solved by changing the homepage of the ad according to the need of the language, with a message explaining the priority at the moment.