Spontaneous Speech Mode is Coming to Common Voice

jesslynnrose · May 14, 2025, 3:43pm

Spontaneous Speech mode and new dataset is coming to Common Voice

At Common Voice, we’re adding a new way to contribute your voice in a more natural way, Spontaneous Speech. In Spontaneous Speech, you’ll be able to answer questions in your own words, which will then be transcribed. This will create a new parallel dataset to accompany the existing read speech datasets on the platform.

Spontaneous Speech is available now and be accessed from the “Speak” menu dropdown at commonvoice.mozilla.org

How to add my language to Spontaneous Speech mode on Common Voice:

You can request a new language using this form, being sure to tick that your request is to add this language to Spontaneous Speech. Adding a new language can optionally include localization of the Spontaneous Speech UI into your language, or you can allow your users to select from any existing UI localization.

Adding a new language to Spontaneous Speech mode will require writing 60 or more questions for contributors to answer. These prompts will be unique to each language, and can bring out the kind of responses that will be valuable to your community.

To better support our language communities we’ll be running two drop in open office hours calls to walk you through writing good prompts at May 20th at 7:00pm UTC (Registration here) and May 22 at 6:00am UTC (Registration here) We’ll also share slides from these sessions with the community next week, for those unable to attend.

Questions? Concerns? Feedback? Ideas on what we should build next? We always want to hear from you! Reply to this thread, message me directly or email the team at commonvoice@mozilla.com

shane.carroll · August 25, 2025, 2:43pm

I think this feature is going to be great. I’ve transcribed a number of clips and have some feedback.

Transcribers need a fully-featured audio player. E.g., navigation, selecting a region, playback speed, spectrograms, etc. Oftentimes, there are one or two difficult-to-transcribe words that need to be listened to repeatedly. The only way to do that now is to listen to the entire clip and wait for it. This is very time consuming and surely will result in errors as transcribers give up.
On transcribing numbers and symbols: contemporary ASR models transcribe denormalized text (“100%”, not “one hundred percent”). Though I understand this data is not solely for ASR, speech datasets are generally making this shift. Perhaps this is a good time to use a new standard, since this will be an entirely new dataset. If this rule remains unchanged, this dataset will need to be preprocessed by users (likely with expensive-to-run LLMs) to meet current needs.
It may be a good idea to encourage speakers to transcribe their own data, or at least provide some information related to tricky proper nouns. E.g., I transcribed a clip where I heard “light district” and “hills of the penine”. However, I’ve been around enough speech datasets to know that these were proper nouns and probably spelled differently than they sound. So, given that the speaker had a UK accent, I googled “UK hills of the penine” and discovered that the correct transcript is “Pennine” (not penine)… and that there was a nearby “Lake District” (not light district). However, a typical transcriber will not put in this much effort, nor can I for every transcript, and a lot of (the most important) transcripts will be wrong.

shane.carroll · September 4, 2025, 1:04pm

Just gonna dump some more feedback here since I don’t know of a better place…

It would help to display to the transcriber the source question, especially to disambiguate the first few words (which are typically related to the question). E.g., I transcribed a clip where I heard the beginning as “asacorwarder”. After listening to it many, many times, I realized it was “As I grew older…” which was part of the question and would have been very easy to resolve had I seen it. Many other examples I’ve seen would’ve benefitted as well.
I don’t see in the guidelines anything about “unsure” annotations, e.g., a lot of ASR datasets adapt a standard such as “this transcript contains ((unsure words))”. This allows for difficult data to be transcribed without forcing the transcriber to produce an error, and also allows for downstream users to choose how to handle it before training a model (e.g., some people may simply remove the brackets and train with it as-is, others may discard the sample).
Reporting a clip requires too many clicks, and there are many clips either in the wrong language, silence, user simply reads the question… I think it’s best to drop the minimalistic GUI in favor of more features that allow for faster transcribing. I “skip” clips instead of reporting them because it’s much faster, passing the burden onto the next transcriber. I would prefer one-click options to report for each common reason.
The “Skip” button does not reset the state of the audio player or text area. After skipping, I have to highlight and delete what I’ve transcribed of the last clip, and click to cycle through pause/play for the next clip.
Buttons should be closer together to minimize cursor movement. E.g., when validation clips, the cursor needs to move barely an inch between yes/play/no; however, transcribing requires a lot more cursor movement and clicking: play → text field → play again → text field → (repeat play/text field) → done → submit → repeat (occasionally report/skip, too). Given the difficulty of transcribing, this project will probably leverage “power users” much more than validation, and these users want to move fast.

Libra · September 5, 2025, 10:31pm

Possibility to record one question as many times as I want

It should be possible to record the same questions many times for the same speaker. Maybe just reset recorded status of all questions after they ended (with confirmative clicking from a user). Even if my language doesn’t have many questions, answering to them is still more interesting and produces more original content than reading the same phrases in scripted speech. If the reason why it is limited is a boredom, then it is strange, that I can read read one phrase many times in scripted speech mode. This is much more boring and frustrating, especially because many of them outdated (old words are used because of the requirement that sentences should be in PD/CC0).

Question collection process

A good question for public participation datasets should:

Be easy to understand and respond to

Be generally relevant

Not use, or solicit, harmful or offensive language

Source: Guidelines

My question: Do only questions are accepted or I can ask about something too? I mean something like that: “Tell a some random story”.

Questions which might solicit personally identifiable information

Where can I find a full list of information that is considered as personally identifiable? It is really difficult to write a new questions if I don’t know it… I think all types of this information should be listed in the guidelines. For instance, is “Tell something about your profession” classified as sensitive personal information too? Then what is not personal info? And how should I follow this recommendation then: “Speak naturally, as you would with a friend - use your own real variant, dialect and accent”? Most of the people speak with their friends about their life and personal experience and that’s totally fine. And these themes actually will be more interesting for recorder/transcriber. Can I tell them if I want/ready to share this info? Is it possible to add a checkbox “I’m ready to share this info” or separate category for it or it is strictly forbidden by laws of your jurisdiction? What if I change all personal info such as names? “I guarantee, that if my answer had any personal info, I changed it to alternatives and now identification is not possible”?

Are questions strictly linked to answers?

As far as I understand it, the questions were added with a goal to give other people theme, if they don’t want to originate it. But what if I want to record my monologue about some random topic? Can I do it or I should strictly answer to the question I get? For now I thinks it is not a big problem to do so, because the question is not showed to my transcribers. But if these question-answer pairs will be used both and not only transcripts with audio will be stored and used, then it might a little bit problematic. In this case, do you have any plans to implement additional type of spontaneous speech, in which will be possible to record spontaneous speech without linking it to some question?

Answer Questions. General guidelines

Record in a reasonably quiet place

Is it desirable or necessary? Noises is a part of life and clips with them can be useful when user’s speech still can be heard and transcribed. In earlier guidelines for the scripted speech it was permitted.

Speak naturally, as you would with a friend - use your own real variant, dialect and accent

But artistic and dramatic speech is useful too! I like variants, dialects and accent, but why only natural voice and intonation should be preferred? I suggest it can be something as “Feel free to speak as you want. Don’t fear to use your natural voice and intonations”.

Keep your volume consistent - don’t shout or sing.

Shouting is a usual case in real life. If I clean my room and went to another part of it or another room, but I continue to speak something to my phone/other device that is on a table, for instance.

Transcribe Audio

Labeling noise events like coughing or laughing

It should be marked with [special tags], right?

The following special tags should be used to mark disfluencies, fillers and other types of non-verbal content (in English).

What does mean ‘in English’ in this case? Does it mean that special tags should be written in English language only or that examples are given for English language? Can I create my own special tag or there is a specific pool of them (those that are in the guidelines for example). I think that it is a correct strategy to use only English for storing, but then you need to create a full list of them and add them to Pontoon. Then in CV’s front-end it should be localized for a user:

User sees/enters: “Mi ne kompre-[bruo] vin” (I don’t under-[noise] you).
Stored as: “Mi ne kompre-[noise] vin”.

So the user see and write localized version of these tags, but in the system itself the special tags will be stored in English to make it easier to work with them. Also when I enter the ‘[’ symbol it should open list of special tags with finding the most appreciate option while I write other symbols (similar to how it works for language choosing now).

Ah, yeah. And special tags should be highlighted by other color. Not so important which one it will be, but it will make easier to check a transcript, that the user made by other users.

Writing down disfluencies, including hesitations and repetitions

[disfluency] - A filler word or sound used as a placeholder whilst a speaker decides what to say. In English, some common hesitation sounds are “err”, “um”, “huh”, etc.

What should I prefer? Tag [disfluency] or interjections such as “um”? Or should I use this tag only if I’m not sure which interjection should I use? Or will be some addition for this tag specifically? For instance [disfluency](Huh… Err…). I mean that I specify it as disfluency with a special tag and in brackets I try to give a transcription for this ‘disfluency’? I think it is a good option.

Same question for “[laugh]”. Should I even try to transcribe it with “AHAHAAHA” interjection or something similar to that?

Grammatical variation and slang should be recorded exactly as it occurs. Do not correct or edit people’s speech.

What are exact cases that this rule controls? I mean, for instance in Russian there is a word ‘сейчас’ (now) and it almost never pronounced as “sejchas”. Almost every native speaker will read it as “schas”. Usually everyone uses ‘сейчас’ orthography, but rarely there are people who use phonetic spelling and write “щас”. Pronunciation is the same in both cases, the difference is only in orthography. I think it is similar to “you = u” in English. It is much more accurate from point of view writing a phonetic transcription but it doesn’t feel right for me. What is an existing consensus for cases like these here, on CV?

Acronyms should be written as they are normally written in the language, following standard capitalization rules. They should not be transcribed phonetically. Example:

What about mixed up words? In modern Russian language there is plenty anglicisms and other borrowings, especially in slang and professional jargon, that almost never is written by Cyrillic script or is written partially. Some examples: mp3-плеер (MP3 player, mp3 is not translated in many cases), dawка (DAW - digital audio workstation, daw is written with English letters and only suffix -ка is added by Russian language), html-разметка (HTML markup), “Смотря какой fabric, смотря сколько details”. It is still Russian words, because they have at least some part in Russian and/or they have Russian declension and agreement. What to do in cases like these? Can I just transcribe them, even if they have non-Cyrillic characters? Because earlier in scripted speech it was strictly forbidden to add them in sentences.

Code-switching

I don’t understand this part of the documentation at all, sorry. Does it explain, that I can mix many languages in one question/answer? But isn’t it against the rule "Don’t Add … Culturally specific questions → Questions which are very culturally specific, or make a lot of assumptions about the responder ". Code-switching is really culture specific in many cases and not everyone will can understand questions/answers from this category. For example, most of Russian speakers will not understand next sentences:

Я в шени метапелю у бабушки неподалёку от супера, где мы фрукты обычно покупаем. (Russian with Hebrew elements, will be understood by Russian speakers who live in Israel)
Наслайсить чизу? (Russian with English elements, will be understood by Russian speakers who live in anglophone countries usually)

Or for this case we have an exception from this rule?

Outdated parts of the guidelines

Questions which someone would struggle to respond to in 15 seconds (the maximum clip length)

I saw many clips at least in Russian language that were around 30 seconds. Even below on the same page: “Try to keep your response to 15-30 seconds”.

Libra · September 7, 2025, 7:55pm

OK, after some research

Yes, many languages use them and examples given in the table confirm that too: Spontaneous Speech Prompts Drafts - Google Sheets :

Describe how to book a doctor’s appointment in your country.
Describe some foods that are healthy and nourishing.
Describe a visit to a cinema in your country.

github.com/common-voice/common-voice

Spontaneous Speech New Language: German

открытые 06:17PM - 07 Jul 25 UTC

закрытые 08:34PM - 23 Jul 25 UTC

PinguDEV-original

Thank you for your interest in adding a new language to the Common Voice Spontan…eous Speech platform and dataset. - Language to add to Spontaneous Speech: German - ISO code: de-DE - Name of your language: Deutsch - Name of your language in English, if known: German - List of variants associated with this language, if known: A list of 60 or more questions in your language that speakers will respond to as they contribute to the Spontaneous Speech dataset. These questions will act as prompts and should allow a speaker to respond to them in roughly 10-25 seconds. These prompts should be culturally appropriate for the language community. These questions should not encourage speakers to share personally identifiable information. Writing prompts can be hard at first, to get you started here are some examples of prompts that have been used in other languages include: - What do people in your community usually do to celebrate birthdays? - What is your favourite season and why? - How do you think access to technology is changing the way people connect to each other? It’s very important to avoid including prompts that could encourage people to share personally identifiable information. Examples of personally identifiable information that speakers should not use in their responses to prompts includes: - Names of the speaker, or their family members. - Phone numbers, email addresses or physical addresses/locations - Information about their medical, financial or business information Please type or copy your prompts/questions here, with each question on a new line: ``` Welche Jahreszeit magst du am liebsten und warum? Liest du gerne Bücher und wenn ja, welche? Wohin fährst du am liebsten in den Urlaub? Was ist dein Lieblingsfilm und warum? Was ist dein Lieblingsgetränk und warum? Welche Sportart schaust du am liebsten und warum? Welches Tier magst du am meisten und warum? Welche Farbe magst du am meisten und warum? Was ist dein Lieblingsgericht und warum? Welche Art von Wetter magst du am liebsten und warum? Welche Art von Filmen magst du und warum? Kochst du gerne und wenn ja, was? Welche Art von Hobby würdest du gerne einmal ausprobieren und warum? Welche Art von Essen würdest du gerne einmal probieren und warum? Welche Art von Musik würdest du gerne einmal live sehen und warum? Welche Art von Kunst würdest du gerne einmal schaffen und warum? Wie ist das Wetter in deiner Lieblingsstadt? Wohin würdest du gerne einmal fahren und warum? Welchen Sport machst du am liebsten und warum? Wie ändert die Nutzung von KI die Gesellschaft? Warum denkst du, ist es wichtig, verschiedene historische Persönlichkeiten zu studieren? Warum ist es wichtig, verschiedene Hobbys zu haben? Was findest du, denkt die Gesellschaft über den Klimawandel? Hast du ein Haustier und wenn ja, welches und warum? Welche Sprache würdest du gerne lernen und warum? Was war oder ist dein Lieblingsfach in der Schule und warum? Welche Art von Essen magst du am meisten und warum? Welche Art von Kleidung trägst du am liebsten und warum? Welche Art von Musikinstrument möchtest du gerne spielen können und warum? Welche Art von Sport würdest du gerne einmal ausprobieren und warum? Welche Art von Beruf würdest du gerne ausüben und warum? Welche Art von Unterkünften bevorzugst du beim Reisen (Hotel, Airbnb, Camping etc.) und warum? Wenn du eine berühmte Person treffen könntest, wer wäre das und warum? Wenn du eine Zeitmaschine hättest, in welche Zeit würdest du reisen und warum? Welche Ziele hast du dir für die nächsten fünf Jahre gesetzt? Welche Gewohnheiten haben dein Leben am meisten verbessert? Wie bleibst du motiviert, wenn du auf Hindernisse stößt? Was ist für dich das Wichtigste in einer Beziehung? Welche Traditionen pflegst du in deiner Familie? Wenn du für einen Tag unsichtbar wärst, was würdest du tun? Wenn du eine Milliarde Euro hättest, was würdest du damit machen? Wenn du ein Tier wärst, welches wärst du und warum? Wenn du auf einer einsamen Insel strandest, welche drei Dinge würdest du mitnehmen? Wenn du die Welt verändern könntest, was würdest du als erstes tun? Was magst du an deinem Beruf am meisten? Welche Fähigkeiten sind deiner Meinung nach am wichtigsten für den beruflichen Erfolg? Wie siehst du die Zukunft deiner Branche? Was war dein bisher größter beruflicher Erfolg? Welche Herausforderungen siehst du in deiner aktuellen Position? Was tust du im Alltag, um die Umwelt zu schützen? Wie siehst du die Zukunft der erneuerbaren Energien? Wie siehst du die Zukunft der Bildung? Welche Fähigkeiten sollten deiner Meinung nach in der Schule mehr gefördert werden? Wie siehst du die Rolle der sozialen Medien in der heutigen Gesellschaft? Was denkst du über die aktuelle Debatte um Datenschutz? Wie wichtig ist dir politische Beteiligung? Welche gesellschaftlichen Veränderungen würdest du gerne sehen? Wie siehst du die Zukunft der Arbeit in einer digitalen Welt? Welches historische Ereignis hat dich am meisten beeinflusst und warum? Wie siehst du die Zukunft der künstlichen Intelligenz in den nächsten zehn Jahren? ``` Those above are my questions, it took a long time to write them and maybe there are some duplicates, but I hope there arent. If you can understand german/are german, please answer to this if you dont like a question or like this post if you think those questions should be added for the german spotaneous speech!

moz-bozden:
Hallo @grotzbot, actually there is currently no hard limit on recording time. The suggestion for 15 seconds is for making the transcriptions more possible (in our experiments 1 minute recording can be transcribed in 10 minutes).

The contributors should not be stressed about timing, it might corrupt the naturality. In the communities I help, I ask them minimum 15 sec, target 30-60 secs, but do not exceed 1 minute much.

The whole project is about natural speech and when people add more sentences to their answers, it generates more intonations and naturality… Of course do not cite an encyclopedia or tell your life story

bozden · September 7, 2025, 8:50pm

Dear @shane.carroll and @Libra, we so much appreciate these feedback.

Some of these points are already in the plans, some are new to us - and that is great!

As you know the project is in beta phase, and the first dataset release is in 10 days and feedback from the first dataset users will be valuable.

There are already pre-defined steps which get implemented one-by-one, and the team is logging these issues and feature requests, and they will be prioritized for implementation.

Currently there are many upcoming projects in our hand, including the Code Switching upgrade to Spontaneous Speech (which you mentioned) - which is in alpha phase. Searching for natural speech patterns, the project does not put hard limits on people’s speech - they might mix some words from other languages, but “real” code-switching will be handled in that extension.

I’m sure the team will analyze these, and give more satisfying answers and solutions…

Again, thank you for all these feedback.

jesslynnrose · September 10, 2025, 2:40pm

Shane, I wanted to thank you so much for this feedback. I’ve shared this with the engineering team and have booked time to go through this in more detail with our UX expert.

Having speakers transcribe their own data is something we had looked to avoid, to try and add extra layers of validation by having different people looking at the clips and transcriptions at different stages, but I’ll bring this back to the team to discuss. This is so helpful and thank you for taking the time!

Libra · September 19, 2025, 9:02pm

Thanks for the answer, but my post had not only the feedback and proposals, but the questions too

Topic		Replies	Views
Use speakers with voice assistant to record CommonVoice sentences Common Voice feedback	2	1552	June 18, 2019
Discussion of new guidelines for recording validation Common Voice feedback	81	20136	November 29, 2021
Help create Common Voice's first target segment Common Voice announcements	34	16470	November 14, 2020
Translation of sentences from other-language corpuses Common Voice sentence-collection	14	2159	November 25, 2022
Common Voice for Healthcare (Edge Cases) Common Voice	6	559	August 26, 2024