We need a Q&A

日本語版: 何で誰もQ&Aを作らないんですか?

Why doesn't anyone make a collection of questions and answers?
It's crazy!

If you've skimmed this Discourse, you'll understand.
People ask the same questions over and over again.
People give the same answers, over and over again!
How long will this repeat itself?

Good if we want to put ourselves through the hassle.
But the fact is that everyone wants an "answer".

I have so many questions:

  • What topic should I look at first?
  • Isn't there a topic for my language here?
  • How many times does one sentence need to be read by how many people?
  • Why can others read it 1,000 times and I can only read it 300 times?
  • Where is Common Voice's "Report" reported to? Who's going to fix it? When will it be fixed?
  • Can't I edit low quality sentences in the source text? Or delete it?
  • Can't the pagination of the Collector be specified by number?
  • Can't I hide "skipped" sentence? (Whether it's CommonVoice or The Collector.)
  • How many people would have to "OK" (or "No") a sentence to be adopted (or rejected)?
  • Is "total sentences" the sum of the Collector? (Does that include the source text?)
  • When will the adopted sentences be added to the source text?
  • Can't I verify the source of the sentence with the Collector?
  • Why doesn't Mozilla answer my question written in Japanese! (Yes, they're busy....)
  • There's also the question of my language.

Even if the answer is "no" or "can't" or "don't know", a proper answer should be given.
That list of "answers" is!

There's no Q&A on Common Voice. There are no corpus links. In fact, there aren't even topics for each language.
"Why?" I ask the very question itself.

Are we short-handed? Don't we have an idea?
I seek "answers". What are we missing? What do we need?

Insert: 2020-09-26

Foothold

If we want to activate volunteers in each language, I think we need to make it "easy to understand" in some way. To a non-English speaker, the "alphabetical string" is difficult to read on its own. So it's a lot easier just to keep the information centralized. Yes, even if it works, there will be frequent questions, especially in languages other than "English". It may be impossible to answer all of them. But I'd say it's better than it is now because we easily know where to look (and ask).
There is something missing here that we can "look over". Perhaps the best we can do now is to be able to grasp the "whole". It's about being a foothold for people in minor languages. (Of course, even those who are fluent in English will have trouble sifting through the information.)

Volunteers will appear and disappear, but the important thing is to keep them informed. We should be available for someone to start or resume at any time.

Sharing Information

How do you guys do it? Do you write down every time you ask a question and get an answer? Do you share that information with people in the same language? Maybe if we told the writers (bloggers) guys about this project, they would be happy to provide their voices and sentences, but is anyone sharing the project with anyone else? For example, a Firefox freak or a blogger? At the very least, I think there are close to zero Firefox users in Japan who know about Common Voice. How about in your language?

Hmmm. I also feel like it's "too early" for people to know, because the Collector and even the platform is still in its infancy. But we need a lot of people's voices, right?

Let me try to answer at least a few questions:

See Single Sentence Record Limit feature release.

No. We need to make sure edits are approved according to process. What do you mean by “low quality sentences” exactly?

I guess those two questions are directly connected? For the Sentence Collector as it works right now there is no way to explicitly know which sentences have been skipped intentionally. Before adding a third button, I’d definitely want to know the core problem there. Additionally for the Sentence Collector the review queue gets sorted by most votes, so we can assure that most reviewed sentences are getting into the data set. Otherwise we could end up with 10k sentences all having one vote and none of those would make into Common Voice. I don’t think jumping to a specific page has any benefit for the project due to that.

I can’t say for Common Voice right now, but for the Sentence Collector it’s 2 out of 3 votes being approvals.

If you mean the number within the Sentence Collector, hat’s for the Sentence Collector only.

Roughly every week, and these are then deployed to the Common Voice website when the Common Voice website gets a new release.

Right now no, but that might indeed be a low hanging fruit to fix. I’ve filed https://github.com/Common-Voice/sentence-collector/issues/330 for that.

Some of these questions have already seen some traction from other volunteers. You additionally might want to read Ongoing Common Voice & Mozilla Updates.

Sorry Michael. I didn't mean to give you a hard time. But you've been a big help. Thank you.

I think this is probably the first question a volunteer would have. So hopefully one day there will be a Q&A page where we can easily find it.


low quality sentences are those that are "unnatural" or "difficult to read" to a native speaker.

Some sentences:

  • are overlong or verbose (like the same phrase over and over again).
  • feel like a competition for the smoothness of the tongue. (In short, it's a hard word to say, like a word of foreign origin or a tongue twister.)
  • aren't complete (they're in the middle).
  • have foreign words mixed in.

I think you're right, third party approval is necessary.


As for sorting the sentences, I don't think it's a problem.

I "ignore" sentences that I can't decide if they are "yes" or "no". That could be 100 or 200. Then, when I resumed work, I would have to click the arrow button again and again to see the sentence I hadn't seen before. Or is there a way to move around the page with ease?

...... Maybe, but I'm the only one on the Japanese language Collector right now. The numbers don't work at all while I'm gone. So there are a lot of sentences left that I can't decide on. I don't know how the other languages are working.

I feel tremendous about having to review another 100,000 or 150,000 sentences, maybe alone. So, I tend to give a thumbs up to my own sentences. As for self-review, you mentioned it in Sentence Collector - Review before Submit.

Insert: 2020-09-27

Chronological information

I also think it's very important to summarize the information in chronological order. For example, isn't there a difference in the information you get from the Collector tool development stage and after it's released? With We want your feedback: Improving the sentence collection and Extending our sentence collection capabilities, we can't determine whether it's pre-release or later by title only. We can certainly sort the search results in chronological order. But searching for information would be easier if we could easily know when the Collector tool was released (i.e., get the facts) and whether the topic is no longer useful (i.e., has finished its function). Ideally, we are able to know before we read the topic.

Clear dates

The omission of dates is also a problem. Why do we omit dates? That's an important clue for anyone looking for information! (It's a hassle to stack the mouse on top of each other.)


Hmmm, should I post these two issues in the Meta category? But I think it's definitely important to the project.

I’m happy to see interest in the project from a Japanese speaker. As you may know, there are very few free speech resources available for Japanese. I’ve been surprised that Japanese contributions to Common Voice have grown so slowly. I hope your enthusiasm might translate into making the project more accessible and attractive to Japanese speakers.

I agree that documentation regarding sentence collection is currently difficult to find. I have to assume that some of the recent shortcomings have been due to 25% of Mozilla employees being fired in August. I’m glad to see though that Mozilla has not completely abandoned the project. You can follow development on the sentence collector source code here, for example: https://github.com/Common-Voice/sentence-collector/commits/master There is also ongoing work on automatically importing sentences from Wikipedia for many different languages.

I would suggest that once the sentence collection process is stable and the major kinks are worked out, we develop a short guide with topics like those you’ve suggested, translate and adapt it into all of the languages, and link to it from the main page (not Discourse). Since you are making the effort to find answers to these questions specifically related to Japanese, perhaps you could begin drafting such a document in Japanese, collecting all of your findings into one place to help other contributors? I wonder if the project could host a wiki where we might work on this sort of documentation?

I think one of the best ways to contribute to Common Voice is recruiting new volunteers. I know very few Japanese people, but you probably know many. Wouldn’t collecting sentences be more fun and interesting with another reviewer? Another problem with collecting voices is that the majority are often male, which is not as good as gender-balanced data for training ASR; it would be a great contribution to recruit and include more female volunteers, as well as any volunteers in general.

Hi Craig. Thank you.
I don't know why I am so actively involved in this myself. I simply love to write and vocalize, and I'm clueless about voice recognition systems.

Yes, you're right, I should let people know. For starters, I wrote a quick guide on my website. In Japan we have an anonymous bulletin board called "5ちゃんねる5Chan", so I think I'll start a thread there. Once I get a little more information.

Yes, for all intents and purposes, I think it should be a wiki. It's too much for me to handle on my own!

It seems that discourse can create a wiki, but I don't seem to have the authority to do so.
Ref: What is a Wiki Post? - howto / faq - Discourse Meta

1 Like