Spontaneous Speech Mode is Coming to Common Voice

Spontaneous Speech mode and new dataset is coming to Common Voice

At Common Voice, we’re adding a new way to contribute your voice in a more natural way, Spontaneous Speech. In Spontaneous Speech, you’ll be able to answer questions in your own words, which will then be transcribed. This will create a new parallel dataset to accompany the existing read speech datasets on the platform.

Spontaneous Speech is available now and be accessed from the “Speak” menu dropdown at commonvoice.mozilla.org

How to add my language to Spontaneous Speech mode on Common Voice:

You can request a new language using this form, being sure to tick that your request is to add this language to Spontaneous Speech. Adding a new language can optionally include localization of the Spontaneous Speech UI into your language, or you can allow your users to select from any existing UI localization.

Adding a new language to Spontaneous Speech mode will require writing 60 or more questions for contributors to answer. These prompts will be unique to each language, and can bring out the kind of responses that will be valuable to your community.

To better support our language communities we’ll be running two drop in open office hours calls to walk you through writing good prompts at May 20th at 7:00pm UTC (Registration here) and May 22 at 6:00am UTC (Registration here) We’ll also share slides from these sessions with the community next week, for those unable to attend.

Questions? Concerns? Feedback? Ideas on what we should build next? We always want to hear from you! Reply to this thread, message me directly or email the team at commonvoice@mozilla.com

3 Likes

I think this feature is going to be great. I’ve transcribed a number of clips and have some feedback.

  1. Transcribers need a fully-featured audio player. E.g., navigation, selecting a region, playback speed, spectrograms, etc. Oftentimes, there are one or two difficult-to-transcribe words that need to be listened to repeatedly. The only way to do that now is to listen to the entire clip and wait for it. This is very time consuming and surely will result in errors as transcribers give up.
  2. On transcribing numbers and symbols: contemporary ASR models transcribe denormalized text (“100%”, not “one hundred percent”). Though I understand this data is not solely for ASR, speech datasets are generally making this shift. Perhaps this is a good time to use a new standard, since this will be an entirely new dataset. If this rule remains unchanged, this dataset will need to be preprocessed by users (likely with expensive-to-run LLMs) to meet current needs.
  3. It may be a good idea to encourage speakers to transcribe their own data, or at least provide some information related to tricky proper nouns. E.g., I transcribed a clip where I heard “light district” and “hills of the penine”. However, I’ve been around enough speech datasets to know that these were proper nouns and probably spelled differently than they sound. So, given that the speaker had a UK accent, I googled “UK hills of the penine” and discovered that the correct transcript is “Pennine” (not penine)… and that there was a nearby “Lake District” (not light district). However, a typical transcriber will not put in this much effort, nor can I for every transcript, and a lot of (the most important) transcripts will be wrong.
3 Likes

Just gonna dump some more feedback here since I don’t know of a better place…

  1. It would help to display to the transcriber the source question, especially to disambiguate the first few words (which are typically related to the question). E.g., I transcribed a clip where I heard the beginning as “asacorwarder”. After listening to it many, many times, I realized it was “As I grew older…” which was part of the question and would have been very easy to resolve had I seen it. Many other examples I’ve seen would’ve benefitted as well.
  2. I don’t see in the guidelines anything about “unsure” annotations, e.g., a lot of ASR datasets adapt a standard such as “this transcript contains ((unsure words))”. This allows for difficult data to be transcribed without forcing the transcriber to produce an error, and also allows for downstream users to choose how to handle it before training a model (e.g., some people may simply remove the brackets and train with it as-is, others may discard the sample).
  3. Reporting a clip requires too many clicks, and there are many clips either in the wrong language, silence, user simply reads the question… I think it’s best to drop the minimalistic GUI in favor of more features that allow for faster transcribing. I “skip” clips instead of reporting them because it’s much faster, passing the burden onto the next transcriber. I would prefer one-click options to report for each common reason.
  4. The “Skip” button does not reset the state of the audio player or text area. After skipping, I have to highlight and delete what I’ve transcribed of the last clip, and click to cycle through pause/play for the next clip.
  5. Buttons should be closer together to minimize cursor movement. E.g., when validation clips, the cursor needs to move barely an inch between yes/play/no; however, transcribing requires a lot more cursor movement and clicking: play → text field → play again → text field → (repeat play/text field) → done → submit → repeat (occasionally report/skip, too). Given the difficulty of transcribing, this project will probably leverage “power users” much more than validation, and these users want to move fast.
1 Like

Possibility to record one question as many times as I want

It should be possible to record the same questions many times for the same speaker. Maybe just reset recorded status of all questions after they ended (with confirmative clicking from a user). Even if my language doesn’t have many questions, answering to them is still more interesting and produces more original content than reading the same phrases in scripted speech. If the reason why it is limited is a boredom, then it is strange, that I can read read one phrase many times in scripted speech mode. This is much more boring and frustrating, especially because many of them outdated (old words are used because of the requirement that sentences should be in PD/CC0).

Question collection process

A good question for public participation datasets should:

  • Be easy to understand and respond to
  • Be generally relevant
  • Not use, or solicit, harmful or offensive language

Source: Guidelines

My question: Do only questions are accepted or I can ask about something too? I mean something like that: “Tell a some random story”.

Questions which might solicit personally identifiable information

Where can I find a full list of information that is considered as personally identifiable? It is really difficult to write a new questions if I don’t know it… I think all types of this information should be listed in the guidelines. For instance, is “Tell something about your profession” classified as sensitive personal information too? Then what is not personal info? And how should I follow this recommendation then: “Speak naturally, as you would with a friend - use your own real variant, dialect and accent”? Most of the people speak with their friends about their life and personal experience and that’s totally fine. And these themes actually will be more interesting for recorder/transcriber. Can I tell them if I want/ready to share this info? Is it possible to add a checkbox “I’m ready to share this info” or separate category for it or it is strictly forbidden by laws of your jurisdiction? What if I change all personal info such as names? “I guarantee, that if my answer had any personal info, I changed it to alternatives and now identification is not possible”?

Are questions strictly linked to answers?

As far as I understand it, the questions were added with a goal to give other people theme, if they don’t want to originate it. But what if I want to record my monologue about some random topic? Can I do it or I should strictly answer to the question I get? For now I thinks it is not a big problem to do so, because the question is not showed to my transcribers. But if these question-answer pairs will be used both and not only transcripts with audio will be stored and used, then it might a little bit problematic. In this case, do you have any plans to implement additional type of spontaneous speech, in which will be possible to record spontaneous speech without linking it to some question?

Answer Questions. General guidelines

  • Record in a reasonably quiet place

Is it desirable or necessary? Noises is a part of life and clips with them can be useful when user’s speech still can be heard and transcribed. In earlier guidelines for the scripted speech it was permitted.

  • Speak naturally, as you would with a friend - use your own real variant, dialect and accent

But artistic and dramatic speech is useful too! I like variants, dialects and accent, but why only natural voice and intonation should be preferred? I suggest it can be something as “Feel free to speak as you want. Don’t fear to use your natural voice and intonations”.

  • Keep your volume consistent - don’t shout or sing.

Shouting is a usual case in real life. If I clean my room and went to another part of it or another room, but I continue to speak something to my phone/other device that is on a table, for instance.

Transcribe Audio

  • Labeling noise events like coughing or laughing

It should be marked with [special tags], right?

The following special tags should be used to mark disfluencies, fillers and other types of non-verbal content (in English).

What does mean ‘in English’ in this case? Does it mean that special tags should be written in English language only or that examples are given for English language? Can I create my own special tag or there is a specific pool of them (those that are in the guidelines for example). I think that it is a correct strategy to use only English for storing, but then you need to create a full list of them and add them to Pontoon. Then in CV’s front-end it should be localized for a user:

User sees/enters: “Mi ne kompre-[bruo] vin” (I don’t under-[noise] you).
Stored as: “Mi ne kompre-[noise] vin”.

So the user see and write localized version of these tags, but in the system itself the special tags will be stored in English to make it easier to work with them. Also when I enter the ‘[’ symbol it should open list of special tags with finding the most appreciate option while I write other symbols (similar to how it works for language choosing now).

Ah, yeah. And special tags should be highlighted by other color. Not so important which one it will be, but it will make easier to check a transcript, that the user made by other users.

  • Writing down disfluencies, including hesitations and repetitions
  • [disfluency] - A filler word or sound used as a placeholder whilst a speaker decides what to say. In English, some common hesitation sounds are “err”, “um”, “huh”, etc.

What should I prefer? Tag [disfluency] or interjections such as “um”? Or should I use this tag only if I’m not sure which interjection should I use? Or will be some addition for this tag specifically? For instance [disfluency](Huh… Err…). I mean that I specify it as disfluency with a special tag and in brackets I try to give a transcription for this ‘disfluency’? I think it is a good option.

Same question for “[laugh]”. Should I even try to transcribe it with “AHAHAAHA” interjection or something similar to that?

  • Grammatical variation and slang should be recorded exactly as it occurs. Do not correct or edit people’s speech.

What are exact cases that this rule controls? I mean, for instance in Russian there is a word ‘сейчас’ (now) and it almost never pronounced as “sejchas”. Almost every native speaker will read it as “schas”. Usually everyone uses ‘сейчас’ orthography, but rarely there are people who use phonetic spelling and write “щас”. Pronunciation is the same in both cases, the difference is only in orthography. I think it is similar to “you = u” in English. It is much more accurate from point of view writing a phonetic transcription but it doesn’t feel right for me. What is an existing consensus for cases like these here, on CV?

  • Acronyms should be written as they are normally written in the language, following standard capitalization rules. They should not be transcribed phonetically. Example:

What about mixed up words? In modern Russian language there is plenty anglicisms and other borrowings, especially in slang and professional jargon, that almost never is written by Cyrillic script or is written partially. Some examples: mp3-плеер (MP3 player, mp3 is not translated in many cases), dawка (DAW - digital audio workstation, daw is written with English letters and only suffix -ка is added by Russian language), html-разметка (HTML markup), “Смотря какой fabric, смотря сколько details”. It is still Russian words, because they have at least some part in Russian and/or they have Russian declension and agreement. What to do in cases like these? Can I just transcribe them, even if they have non-Cyrillic characters? Because earlier in scripted speech it was strictly forbidden to add them in sentences.

Code-switching

I don’t understand this part of the documentation at all, sorry. Does it explain, that I can mix many languages in one question/answer? But isn’t it against the rule "Don’t Add … Culturally specific questions → Questions which are very culturally specific, or make a lot of assumptions about the responder ". Code-switching is really culture specific in many cases and not everyone will can understand questions/answers from this category. For example, most of Russian speakers will not understand next sentences:

  • Я в шени метапелю у бабушки неподалёку от супера, где мы фрукты обычно покупаем. (Russian with Hebrew elements, will be understood by Russian speakers who live in Israel)
  • Наслайсить чизу? (Russian with English elements, will be understood by Russian speakers who live in anglophone countries usually)

Or for this case we have an exception from this rule?

Outdated parts of the guidelines

Questions which someone would struggle to respond to in 15 seconds (the maximum clip length)

I saw many clips at least in Russian language that were around 30 seconds. Even below on the same page: “Try to keep your response to 15-30 seconds”.

OK, after some research

Yes, many languages use them and examples given in the table confirm that too: Spontaneous Speech Prompts Drafts - Google Sheets :

  • Describe how to book a doctor’s appointment in your country.
  • Describe some foods that are healthy and nourishing.
  • Describe a visit to a cinema in your country.

moz-bozden:
Hallo @grotzbot, actually there is currently no hard limit on recording time. The suggestion for 15 seconds is for making the transcriptions more possible (in our experiments 1 minute recording can be transcribed in 10 minutes).

The contributors should not be stressed about timing, it might corrupt the naturality. In the communities I help, I ask them minimum 15 sec, target 30-60 secs, but do not exceed 1 minute much.

The whole project is about natural speech and when people add more sentences to their answers, it generates more intonations and naturality… Of course do not cite an encyclopedia or tell your life story :slight_smile:

1 Like

Dear @shane.carroll and @Libra, we so much appreciate these feedback.

Some of these points are already in the plans, some are new to us - and that is great!

As you know the project is in beta phase, and the first dataset release is in 10 days and feedback from the first dataset users will be valuable.

There are already pre-defined steps which get implemented one-by-one, and the team is logging these issues and feature requests, and they will be prioritized for implementation.

Currently there are many upcoming projects in our hand, including the Code Switching upgrade to Spontaneous Speech (which you mentioned) - which is in alpha phase. Searching for natural speech patterns, the project does not put hard limits on people’s speech - they might mix some words from other languages, but “real” code-switching will be handled in that extension.

I’m sure the team will analyze these, and give more satisfying answers and solutions…

Again, thank you for all these feedback.

Shane, I wanted to thank you so much for this feedback. I’ve shared this with the engineering team and have booked time to go through this in more detail with our UX expert.

Having speakers transcribe their own data is something we had looked to avoid, to try and add extra layers of validation by having different people looking at the clips and transcriptions at different stages, but I’ll bring this back to the team to discuss. This is so helpful and thank you for taking the time!

Thanks for the answer, but my post had not only the feedback and proposals, but the questions too :slight_smile: