We want your feedback: Improving the sentence collection

odinho · July 30, 2018, 9:27am

Finding CC0 stuff is hard, everyone ‘liberal’ use CC-BY.
Having a way to tag alternate sentences that mean the exact same thing.
- In Norway we have so many ‘official’ ways to say things, and some people only use one of them, some use the other, some very very few people use both.
- In English “To be” can be either “Å verta” or “Å bli”.
- We have a ton of this.
Having a built-in system for translation of English sentences.
- I’ve done this manually for now, it’s boring, but I’m also a bit concerned that some interesting data (the connection English sentence <=> Norwegian sentence, is getting lost).
- It also needs to have several output sentences for one input sentence.
When taking in new sentences, check all new words so we can check if it’s in the correct grammar.
- Sadly we even have “choose-your-own-adventure” grammar in Norwegian.
- You have to be internally consistent, but you can choose to either write “to be” as “å vera” or “å vere”. Yes, in addition to “å bli”.
- We would only want to have one of those forms in the corpus, so that the speech recognition only comes out in one consistent form.
- That’s a hard problem, and I think Norwegian has it worse than most, but anyone would benefit from rules, stats and information on importing (or in review).
We could have a simple way for people to contribute their blogs as corpus.
- That’s how I’ve gotten most of the sentences I’m preparing for Norwegian.
- However it needs rather intensive proof reading.
- Or even other places like Facebook / Twitter.
Not for sentence collection: but we also need to be able to say what dialect the person identify as speaking.
- They sound extremely different, so a good Norwegian speech recognition will need to have a good distribution.
- This is also my main interest in this project, as commercial speech recognition I’ve tried won’t understand you unless you change the way you speak.

Topic		Replies	Views
Sentence Collector Community Survey Common Voice	2	510	November 11, 2022
New Version of Sentence Collector and future plans Common Voice sentence-collection	2	583	November 15, 2019
Sentence collection tool development topic Common Voice sentence-collection , announcements	32	3997	January 26, 2019
Common Voice Sentence Collection Tool launch Common Voice sentence-collection , announcements	15	4229	April 2, 2019
A small request: Dashboard - Sentence Collector-Verify connection Common Voice participation , sentence-collection	5	1478	March 18, 2021

We want your feedback: Improving the sentence collection

Related topics