I also heard canadian accents and we need more canadian sources, too. Because there are words that are only used in Canada and not in Europe. Are their public debates available online (at least the ones from Québec)?
- The campaign site does not support more languages, only English;
- The campaign is temporary, not continuous, and it also has no gamification to motivate contributors to continue pushing long-term contributions;
- There is no way to know the progress of language implementation, as we have pushed something around 2k of words in Brazilian Portuguese and we do not know when Common Voice will receive voices in Brazilian Portuguese - some contributors told me that they feel that the work was lost and not used by Mozilla, and I’m sure that in the next campaign they will not help.
This is one element we discussed about with @mikehenrty in the past, and it needs to be dealt with properly, I agree. Only thing, I have not had time to explore those.
Hi, a lot of propositions have been made and some of them are great.
I’ll just add that i’d like a system where we either can validate a written sentence from someone before adding it to the pool or report the sentences when we have to speak them because sometimes we can find very strange sentences (incorrect vocabulary, not the right language, typos etc …).
Also for French, a lot of sentences are coming from automatically parsed datasets (and I thank the French community for its efforts), where some of them are too long to fit in the allowed recording time or are unreadable.
I guess it is then hard to check every sentence, but if we could at least report strange sentences (and maybe remove them after X reports ?) from Common Voice that would be great.
Also, as it has been said in the first post, it is really hard to find a place to contribute sentences so creating a place on Common Voice to contribute would be a huge improvement to me.
Thanks for you work !
For french, we filter on sentences to be within limits of 3 to 15 words.
Yes but 15 french words including 4 or 5 with 4+ syllabs, it takes a lot of time to read. I saw some of these, and they are awfully long.
Agree with Luc, having a button to report sentences that we think are not relevant would be nice.
As much as I remember, the website allows for at least 10 secs of recording, and I don’t think it’s a hard limit, but I don’t know the details. Any contribution to improve the building of the dataset is welcome.
And a button to report bogus sentences would help improve
I’m imagining no constraints as it’s said.
I’d like suggest to tag sentences dealing with Vulgar or Sexual words. We can’t avoid such a vocabulary. So when collecting, we can tag them as sexual, and when exposing them either to record or listen, an option would be added to the profile to select if he would listen and record such sentences.
Thanks everyone for your feedback, this is really helpful to define the requirements to improve Common Voice sentence collection, please keep it coming.
A quick note: I will be taking a few weeks away, so expect less iterations from staff until the end of August, but please, keep the feedback coming to this topic in the meantime, I will be checking it as soon as I’m back to inform a recommendation.
CC0stuff is hard, everyone ‘liberal’ use
- Having a way to tag alternate sentences that mean the exact same thing.
- In Norway we have so many ‘official’ ways to say things, and some people only use one of them, some use the other, some very very few people use both.
- In English “
To be” can be either “
Å verta” or “
- We have a ton of this.
- Having a built-in system for translation of English sentences.
- I’ve done this manually for now, it’s boring, but I’m also a bit concerned that some interesting data (the connection English sentence <=> Norwegian sentence, is getting lost).
- It also needs to have several output sentences for one input sentence.
- When taking in new sentences, check all new words so we can check if it’s in the correct grammar.
- Sadly we even have “choose-your-own-adventure” grammar in Norwegian.
- You have to be internally consistent, but you can choose to either write “
to be” as “
å vera” or “
å vere”. Yes, in addition to “
- We would only want to have one of those forms in the corpus, so that the speech recognition only comes out in one consistent form.
- That’s a hard problem, and I think Norwegian has it worse than most, but anyone would benefit from rules, stats and information on importing (or in review).
- We could have a simple way for people to contribute their blogs as corpus.
- That’s how I’ve gotten most of the sentences I’m preparing for Norwegian.
- However it needs rather intensive proof reading.
- Or even other places like Facebook / Twitter.
- Not for sentence collection: but we also need to be able to say what dialect the person identify as speaking.
- They sound extremely different, so a good Norwegian speech recognition will need to have a good distribution.
- This is also my main interest in this project, as commercial speech recognition I’ve tried won’t understand you unless you change the way you speak.
I think it would be nice to have sentence templates with placeholders for things like cities, countries, female or male names and so on. The reason I like them, is that they could lead to less repetitions of the same word sequences over and over again. For this to happen, it has to be implemented right, though.
In German we currently have a lot of sentences of the form “$A is the capital of $B.” or “Can you walk from $A to $B?”. The main part of these sentences always stays the same, while the variables are substituted by geo-locations. This could lead to overfitting (and might be boring to read at some point).
If Common Voice supported real templates, it would be aware of the fact that there are multiple variations of the same sentence and such a template sentence would not show up more frequently than other sentences (or at least not much more).
Related to sentence collection is sentence correction. We need an interface for that, which takes the sentence’s bucket into account. Currently, if a sentence is in the “train” bucket and someone adds a missing comma via github pull request, the corrected sentence might land in the “test” bucket. That’s a problem.
When there is a public interface for contributing sentences, a set of detailed rules and guidelines would be helpful. The ones at https://voice-sprint.mozilla.community/contributing/ are a good start, but they leave a lot of questions open. Also, they do not contain any language specific hints. Two examples:
- Are colloquial shortcuts or spellings allowed?
en: “want to” -> “wanna”, “going to” -> “gonna”
de: “heran” -> “'ran”, “nichts” -> “nix”, “deine Mutter” -> “deine Mudda” (yes, this is a common term and e.g. Google’s STT engine spells it like this)
Keep in mind: If such alternative spellings are in the text corpus, all voice contributors have to differentiate between these, as well.
- What about different spellings of names? Hanna vs. Hannah, Nils vs. Niels, Gustav vs. Gustaf, Jasmin vs. Yasmin and so on. Should only the most popular form be included (which is often hard to tell)?
As more people contribute to the corpus, more of these questions will arise. They have to be discussed in the community and the guidelines have to be regularly adapted.
News sites could be useful as well with constantly updating content.
Here are some punjabi news sites:
Sites like the BBC that have articles for many different languages could be a good source as well.
Additionally, while it may not be “everyday language”, all European Union documents need to be translated into many different languages by language professionals.
This could serve as a basis for scraping many different sentences: https://europa.eu/european-union/documents-publications/official-documents_en
Thanks everyone for your comments. In the following weeks we will be scoping and defining a set of requirements and features we would like to see in a MVP of the tool and also clearly define the user journey based on all your feedback and existing list of requirements from other groups.
I’ll be sharing a draft as soon as it’s ready and reviewed.
Quick update: We have the MVP requirements draft ready and we will be reviewing it this week. Once we feel it’s ready, I’ll be sharing it here for feedback.
Hi again, sorry for the delay, we had a team off-site last week and I wasn’t able to share this with you.
After checking the feedback in this topic, together with other channels, we drafted a MVP (minimum viable product) that we want to share with you for feedback.
- This MVP includes the things we considered more important for a first release.
- We will be gathering feedback in this topic until September 23rd.
- Based on feedback we will iterate the document and share with our User Experience experts for a final pass.
- Any visuals here are just quick mockups subject to change, they should not represent the final visual direction (no UX expert involved in them)
Common Voice Sentence Collection MVP
- An input of sentences (categorized in language and source)
- A set of validation algorithms (ensuring length, license)
- An input of reviewed sentences.
- A way to transfer reviewed sentenced to the final database.
- General metrics (number of sentences, validated, reviewed)
1. An input of sentences (categorized in language and source)
A web form input for text should be available, this form should:
- Allow single and multiple sentences in a form.
- Allow upload txt files with multiple sentences per line.
- Ask for the source language (auto-detect browser language).
- Ask for the source of the sentence (your own, url, other)
2. A set of validation algorithms (ensuring length, license,)
Once you submit the form, a backend will process all individual sentences and apply different validation algorithms:
- Length: Sentences should be 14 words or less.
- License: Sentence are not recorded as copyrighted material not under public domain.
If issues are presented, the result of this validation will be presented to user, who can edit problematic sentences or just submit the validated ones.
Once submitted, user will be asked to keep helping and presented with the review sentences.
3. An input of reviewed sentences.
User will be presented with sentences from other users in their language (auto-detect from browser) to validate. People should be able to:
- Validate a sentence right away.
- Reject a sentence right away.
- Edit a sentence and submit it for validation
The way information is presented should be really similar to the review system for localization tools:
A way to submit more sentences should also be presented to the user in this screen.
Any user should be able to access the review screen anytime and select the language preference to review.
5. General metrics (number of sentences, validated, reviewed)
At any given moment, user should be able to see a quick reference of how many sentences were validated and reviewed for the current language.
Spanish: Validated (1300) Reviewed (567)
A page with all languages metrics should also exist.
- Guidance page: Where to find sentences? What is a good sentence?
- An input form to write or an upload mechanism (txt files)
- A way to see post-validation output.
- A system to review other people’s sentences.
- General metrics (number of sentences, validated, reviewed)
1. Guidance page: Where to find sentences? What is a good sentence?
A link to a page with documentation should be present in the tool at any given moment as well as very visible from the submission form.
This page should contain:
- Description of the the 3 current good strategies to gather sentences
- How do I get public license sentences from large sources? (examples)
- How do I get linguistics involved in the project? (examples)
- How do I submit original sentences myself?
- Description of what constitutes a good sentences
- Hard requirements: Lenght, license, grammar.
- Nice to have: Names, cities, diverse sounds…
For 2, 3, 4, 5 see explanation in the previous section.
Question is, what will happen to our previous contributions that are awaiting validation? Will they be added to the system by you people or?
Also, I think we may want to reconsider the length of sentences since depending on the languages and the use of long words it may need to differ from the current count.
Good point. What might be better is phoneme or syllable count. Unless maybe if the word count limit is the maximum limit of longest words there will be a chance of unintential “too long” sentences.
Depending on what you mean by “awaiting validation”, if they are already on our repos we can probably c&p into the tool for getting all of them into validation + community review.
The idea is that the tool will help us tackle the current backlog of sentences we have for many languages.
Quick update: We have started some work on the tool backend and we are checking the frontend workflow with our UX experts, we are moving a bit slower than we expected but we are making this a priority in the coming weeks.