SpS. Questions and Problems in English Dataset

As as far as I can see, Common Voice Spontaneous Speech V3 datasets were released. Just for interesting I downloaded the English dataset and found many strange things, that I want to mention them in this topic.

Clips in other languages in English dataset

This the one I knew even before my download. It is really hard to make any contribution in “Transcribe” tab, when I need to report almost every audio clip, because it is in other language.

For some reason there are really many audio clips in other languages in English dataset and there is no any clear explanation in guidelines how to behave in this case. Because of that people behave differently. I met at least three options:

  • Reports. Some users (includes me now) reports audio in other languages, because it is not the right dataset for them. Cause it is available as the report reason, I believe, it is the right choice. But, again, because it is not documented anywhere, there are other user behaviors.
Illustrations for 'Reports' option
  • Using [other language]/[language name] special tags.
Illustrations for language special tags
  • Writing a transcription in the audio language, even it is not the main language for the dataset.
Illustrations for transcription in the non-dataset language option

I think it should be defined in the guidelines, which practice should be used.

All table records that don't have any English letter in transcriptions
client_id	audio_id	audio_file	duration_ms	prompt_id	prompt	transcription	votes	age	gender	accents	variant	language	prompt_upvotes	prompt_reports	is_edited	split	char_per_sec	quality_tags
8c562d3b5077db1fafe584ad21c353d2	76492	spontaneous-speech-en-76492.mp3	1836	8187	Which language’s recordings do you enjoy reviewing the most, and why?	あ	1	twenties	female_feminine			English	1	0	0	train	0.544662309368192	short-audio|transcription-length
93f5a435221b9c28adee627fbd21ce66	76545	spontaneous-speech-en-76545.mp3	2448	5198	Do you exercise often?	あ	1		female_feminine			English	0	0	0	train	0.408496732026144	transcription-length
25b6abbc19fbdbd74ce904b3bfc1b9a3	76782	spontaneous-speech-en-76782.mp3	1980	6311	Why do you use common voice?	わかりません。	1		female_feminine			English	1	0	0	dev	3.53535353535354	short-audio
25b6abbc19fbdbd74ce904b3bfc1b9a3	76784	spontaneous-speech-en-76784.mp3	2556	6666	How do you choose climbing locations?	気分です	1		female_feminine			English	1	0	0	dev	1.56494522691706	mixed-script-words|transcription-length
25b6abbc19fbdbd74ce904b3bfc1b9a3	76787	spontaneous-speech-en-76787.mp3	3168	135	Has the weather changed in your country over the last 10 years?	とても、熱くなりました	1		female_feminine			English	0	0	0	dev	3.47222222222222	mixed-script-words
02d32c193f011d23485868e7acba2be5	76820	spontaneous-speech-en-76820.mp3	4608	7216	What’s the weirdest place you have had a meal?	今まで食べた中で最も奇妙な食事場所はどこですか。	1	twenties	intersex			English	1	0	0	test	5.20833333333333	mixed-script-words
02d32c193f011d23485868e7acba2be5	76825	spontaneous-speech-en-76825.mp3	3420	6297	Are there any foods you hate to eat?	食べるのを嫌いな食べ物はありますか。	1	twenties	intersex			English	1	0	0	test	5.26315789473684	mixed-script-words
02d32c193f011d23485868e7acba2be5	76841	spontaneous-speech-en-76841.mp3	5148	169	How do people vote in your country?	あなたの国では人々はどのように投票するのでしょうか。	1	twenties	intersex			English	0	0	0	test	5.05050505050505	mixed-script-words
02d32c193f011d23485868e7acba2be5	76850	spontaneous-speech-en-76850.mp3	3168	126	What kinds of plants or flowers have you seen this week?	今週はどんな植物や花を見ましたか。	1	twenties	intersex			English	0	0	0	test	5.36616161616162	mixed-script-words
02d32c193f011d23485868e7acba2be5	76862	spontaneous-speech-en-76862.mp3	4500	168	What was your favourite subject in school and why?	学校で一番好きだった科目はなんですか またその理由はなんですか	1	twenties	intersex			English	0	0	0	test	6.66666666666667	mixed-script-words
02d32c193f011d23485868e7acba2be5	76863	spontaneous-speech-en-76863.mp3	3636	6311	Why do you use common voice?	なぜ共通音声を使用するのですか	0	twenties	intersex			English	1	0	0		4.12541254125412	mixed-script-words
02d32c193f011d23485868e7acba2be5	76878	spontaneous-speech-en-76878.mp3	3096	6685	What are your thoughts on indoor smoking?	屋内の喫煙についてどう思いますか。	1	twenties	intersex			English	1	0	0	test	5.49095607235142	mixed-script-words
02d32c193f011d23485868e7acba2be5	76916	spontaneous-speech-en-76916.mp3	3960	132	What seasons or weather variations do you have in your country?	あなたの国にはどんな季節や天候の変化がありますか。	1	twenties	intersex			English	0	0	0	test	6.31313131313131	mixed-script-words
f97bbf62128acbca0578b0e0b4efddaa	77044	spontaneous-speech-en-77044.mp3	3060	8838	When you have a cold, what is your go-to remedy for feeling better?	よく寝ることです。	1	twenties	female_feminine			English	1	0	0	train	2.94117647058824	transcription-length|mixed-script-words
9bee29873139997442278775acae3634	77122	spontaneous-speech-en-77122.mp3	2880	6376	What do you value most?	しょうたくん。	1	twenties	female_feminine			English	1	0	0	train	2.43055555555556	transcription-length
dc21763baa0cd20c370e10d4e8ee1613	77129	spontaneous-speech-en-77129.mp3	5400	7239	for voice cloning inside parents home.	両親の家の中での音声クローン作成用	0	twenties				English	1	0	1		3.14814814814815	mixed-script-words
3a645782468d201226d2c14185a8c8c9	77181	spontaneous-speech-en-77181.mp3	5760	7215	What language would you want to learn, if you had the time?	時間があったら 学びたいです	1	twenties				English	1	0	0	dev	2.25694444444444	transcription-length|mixed-script-words
523b94bed5eb57a88074ad5240b1adf5	77259	spontaneous-speech-en-77259.mp3	2664	6874	How many cats do you want?	猫は一匹ほしいです	0	twenties	female_feminine			English	1	0	0		3.37837837837838	mixed-script-words
523b94bed5eb57a88074ad5240b1adf5	77266	spontaneous-speech-en-77266.mp3	1692	193	How do you feel about participating in team sports versus individual sports?	良いと思う	1	twenties	female_feminine			English	0	0	0	train	2.95508274231678	transcription-length|short-audio|mixed-script-words
523b94bed5eb57a88074ad5240b1adf5	77272	spontaneous-speech-en-77272.mp3	4176	8187	Which language’s recordings do you enjoy reviewing the most, and why?	英語を録音するのが好きです	0	twenties	female_feminine			English	1	0	0		3.11302681992337	mixed-script-words
4c5826d54d465d51a918645c2c9b9450	77345	spontaneous-speech-en-77345.mp3	2736	9266	大学で何を学んでいますか?	社会学	0	teens	female_feminine			English	1	1	0		1.09649122807018	transcription-length
3d5b3299fbc0ec35ef6d137944f9112a	77467	spontaneous-speech-en-77467.mp3	9756	132	What seasons or weather variations do you have in your country?	自分の国には四つの国の天気の変化があります。春夏秋冬があります。	1	twenties				English	0	0	0	test	3.280032800328	mixed-script-words
a53e186525adb393e49d4364a56bbf69	77472	spontaneous-speech-en-77472.mp3	1944	6368	What are the assumptions you make about people?	バカ	1	twenties				English	1	0	0	dev	1.02880658436214	short-audio|transcription-length
00a1abf4c4bd78a254176b6f91d10972	77633	spontaneous-speech-en-77633.mp3	2520	5349	Describe your favorite comfort food.	説明してください	0	teens				English	0	0	0		3.17460317460317	mixed-script-words
8d4aabc5e44eedcd0226c3238532ec49	77885	spontaneous-speech-en-77885.mp3	1548	6305	What is your favorite video game?	エッチ	1	twenties	female_feminine			English	1	0	0	test	1.93798449612403	short-audio|transcription-length
8d4aabc5e44eedcd0226c3238532ec49	77995	spontaneous-speech-en-77995.mp3	2340	6329	What do you believe about god?	分からない	0	twenties	female_feminine			English	1	0	0		2.13675213675214	mixed-script-words|transcription-length
f80e04bc95527e8d1dee5868e9597c08	78148	spontaneous-speech-en-78148.mp3	1980	233	What is your favorite local radio program?	素のまんま	0	twenties	female_feminine			English	0	0	0		2.52525252525253	transcription-length|short-audio|mixed-script-words
657984e2becb4df020e32de96ba7434c	78358	spontaneous-speech-en-78358.mp3	1728	6374	What is your philosophy of life based on?	انګرېزي	1	twenties				English	1	0	0	train	4.05092592592593	short-audio
6669d9e9df207cfff13b04a81a034947	78391	spontaneous-speech-en-78391.mp3	1836	9269	Do you have a hobby? Why is this hobby worth the effort? Would you recommend it to others?	スポーツをすること	0	sixties	intersex			English	1	0	0		4.90196078431372	short-audio|mixed-script-words
d0c90c765ae4575d52230d175d95d1ca	78526	spontaneous-speech-en-78526.mp3	2880	167	How did you learn the languages that you speak?	日常生活で	0	thirties	female_feminine			English	0	0	0		1.73611111111111	mixed-script-words|transcription-length
76d5a99607ec39eb40049d6e47c0da92	78654	spontaneous-speech-en-78654.mp3	2520	6329	What do you believe about god?	あなたは神について何を知っていますか	0	teens				English	1	0	0		7.14285714285714	mixed-script-words
76d5a99607ec39eb40049d6e47c0da92	78673	spontaneous-speech-en-78673.mp3	3240	7358	What is one bad quality about yourself?	ތަކުރާރުކޮށް އެކި ކުށްތައް ކުރާ 1،000 މީހުން މާލެތެރެއަށް ދޫވެގެން އުޅޭ މައްސަލާގައި ހައްލު ހޯދަން، އެ މީހުން އެއްފަރާތްކުރަން ޕްރޮސިކިއުޓާ ޖެނެރަލް (ޕީޖީ)އާ އެކު މަޝްވަރާ ކުރަމުން އަންނަ ކަމަށް ފުލުހުން މިއަދު ބުނެފި އެވެ.	0	teens				English	1	0	0		60.8024691358025	speech-rate
ad50474e224ccf111785341d32cbd1e3	79643	spontaneous-speech-en-79643.mp3	31068	199	What do you wish you could use technology for?	زه خوشحالیږم چې ټلیفون او کمپیوټر ایفیډ ګوشکۍ قلم دا وکاروم  او دي نه ښه استفاده وکړم سره له ګوشکیانو سره او ټکنالوژي خو زیاته لوړه ده ووس دي موټرونو ټکنالوژی خوف لري سې ټول اتوماتیک دي همدا ډول باسیکلونه موټر سایکلونه ټکنالوژی خو یو  غټ نعمت ده البته دا ټول نعمتونه په استفاده خوشحالیږم سې په مثبت اړخ یې ګټه ورنه واخلم نوي دونیا دي ټکنالوژی ده	1	twenties		United States English		English	0	0	0	test	8.88373889532638	
5f49b5068910e71c1669f621c1384fa6	79692	spontaneous-speech-en-79692.mp3	6840	7214	If there’s one thing you could say to the leader of your country, what would it be?	リーダーシップを発揮してほしいです。 ただ、人の意見をしっかり聞いてほしいです	1	fifties				English	1	0	0	train	5.55555555555556	mixed-script-words
5f49b5068910e71c1669f621c1384fa6	79715	spontaneous-speech-en-79715.mp3	2808	5194	Do you like to work?	内容によります	0	fifties				English	0	0	0		2.49287749287749	mixed-script-words|transcription-length
6657d468e3b5e7833c5c313f63f1864f	80032	spontaneous-speech-en-80032.mp3	7920	134	What do you think about climate change?	気候変動についてどう思いますか これは温暖化の影響があると思います	0	fifties	female_feminine			English	0	0	0		4.04040404040404	mixed-script-words
d0da4607333af18f84f92d15ff000832	80622	spontaneous-speech-en-80622.mp3	3996	6672	What safety precautions do you take during water activities?	ライフジャケットの着用	1	teens				English	1	0	0	train	2.75275275275275	mixed-script-words|transcription-length
646ff82677d0cf57b53f50b6bf4f380d	80733	spontaneous-speech-en-80733.mp3	4716	7216	What’s the weirdest place you have had a meal?	今まで一番変わった食事をした場所はどこですか	0	fifties	female_feminine			English	1	0	0		4.66497031382528	mixed-script-words
8d24ca31d519723d2953adcecd199961	80924	spontaneous-speech-en-80924.mp3	2736	9388	What genre of book it that?	方角	0	twenties				English	1	0	0		0.730994152046784	transcription-length
8d24ca31d519723d2953adcecd199961	80939	spontaneous-speech-en-80939.mp3	3348	9420	Do you know what colour is Henry the fourth's white horse?	しるんじゃねえのかよ	0	twenties				English	1	0	1		2.9868578255675	transcription-length
8d24ca31d519723d2953adcecd199961	80963	spontaneous-speech-en-80963.mp3	7776	6652	What safety precautions are unique to adventure activities?	点検をしっかり	0	twenties				English	1	0	0		0.900205761316872	mixed-script-words|transcription-length
8d24ca31d519723d2953adcecd199961	80975	spontaneous-speech-en-80975.mp3	1260	6381	What do you do if someone farts nearby?	俺がした	1	twenties				English	1	0	0	train	3.17460317460317	short-audio|mixed-script-words
cdcb810c14b61f73d0ee10efb0519265	81554	spontaneous-speech-en-81554.mp3	2772	6308	What is your favorite flower?	サクラです	1		female_feminine			English	1	0	0	train	1.8037518037518	mixed-script-words|transcription-length
56ca7adde571980e45e39e05d9f5e2b5	81581	spontaneous-speech-en-81581.mp3	3096	7218	What is the most beautiful language?	日本語	0	fourties				English	1	0	0		0.968992248062015	transcription-length

Approved prompts in English dataset but with content in other language

Here there is their list. I found with SQLite command only those, that didn’t include any English letters at all:

Below in separate message because I reached out Discourse text limits for one post

I think prompts that include more than some percent (IDK, >50%-60% for instance) of letters, that are not used in the chosen dataset language should be denied automatically even without human check.

Transcriptions that use angle brackets instead of using square brackets

Table records with transcriptions that uses <special tag> instead of [special tag] syntax
client_id	audio_id	audio_file	duration_ms	prompt_id	prompt	transcription	votes	age	gender	accents	variant	language	prompt_upvotes	prompt_reports	is_edited	split	char_per_sec	quality_tags
3cf166c6b73f030b4f67eeaeba301103	1	spontaneous-speech-en-1.mp3	10044	152	How do you take care of your body and health?	I workout a lot, running, stretching, and taking care of myself through <disfluency> a healthy diet	1					English	0	0	0	test	8.36320191158901	
3cf166c6b73f030b4f67eeaeba301103	2	spontaneous-speech-en-2.mp3	11880	142	What are you proud of?	I'm really proud of my wife, she's quite a fantastic person; <disfluency> inspiring, and just <disfluency> full of a really positive a good energy	1					English	0	0	1	test	10.3535353535354	
12b668a1ada1828ba795332f419d4ef7	7	spontaneous-speech-en-7.mp3	10584	188	Which sports do you think are fun to watch?	I really enjoy watching weightlifting like the Olympics and specifically <disfluency> hang cleans.	1					English	0	0	1	test	8.1254724111867	
1e65040d77567934e4ffed55c656a3cc	12	spontaneous-speech-en-12.mp3	12492	212	How do you try and save money in your family?	With a spreadsheet and budgeting for specific items like gas, food, entertainment, rent, <disfluency> insurance so many other things	1					English	0	0	0	train	9.12584053794428	
b476828992f393a09339cf6270d30aa8	15	spontaneous-speech-en-15.mp3	17748	204	What technology do you worry about?	I worry about any technology that is collecting more information about me than it needs to provide the service that I am currently using <disfluency> and I worry about technology that use ups people from creative roles	1					English	0	0	0	train	10.254676583277	
b476828992f393a09339cf6270d30aa8	18	spontaneous-speech-en-18.mp3	18720	202	Tell me about the time you got your first phone.	I got my first phone when I was 10 because my dad was out a lot <disfluency> working, he was a single dad.  And so, he got me a phone so that I could call him if I needed to when me and my little brother were home alone or walking home from school alone.	1					English	0	0	0	train	10.6303418803419	
3b782fa395459a50d498b04c3ed93093	47	spontaneous-speech-en-47.mp3	22032	221	How do banks keep your money safe?	<disfluency> they use security protocols like making sure that <disfluency> I know what my last few transactions were or that I remember passwords or secret answers; <disfluency> my current bank sometimes scans my face through my phone <disfluency> in order to make sure that I am who I say I am	1	thirties	non-binary			English	0	0	1	train	11.1201888162672	
3b782fa395459a50d498b04c3ed93093	49	spontaneous-speech-en-49.mp3	19008	137	What advice would you give a friend who is away from home and missing their family?	I would tell them to actually try and spend as little time as possible communicating with their family <disfluency> and try and go cold turkey [um] and really focus on <disfluency> immersing themselves in the place where they are, finding new friends, new activities.	1	thirties	non-binary			English	0	0	1	train	11.7845117845118	
344cd7b3f46503a43c5da437ee031ce4	124	spontaneous-speech-en-124.mp3	38520	169	How do people vote in your country?	Um in my, <noise> <disfluency> let me show you first and then I will talk about that so so I might say something like every four years, everyone in the country above the age of eighteen can vote they can either vote by post or they will turn up to their to their constituent polling station and then they vote there <disfluency> and then at the end at the end everyone votes they count the votes and it first passes the <unclear> and then the <unclear> the king will invite the the winning politicians party to form a government and you can see I have now spoken for thirty-five seconds so and okay fine	1					English	0	0	0	test	12.7206645898235	
851a915d9143a0a0a20344fc13d674ea	127	spontaneous-speech-en-127.mp3	36000	146	Does your family tell stories about their history?	<disfluency> well, so my family sometimes tells stories about the history maybe in particular <disfluency> my my mother's side <unclear> likes to tell stories about the history, especially during <disfluency> for example Chinese New Year or during someone's birthday or during an anniversary of my grandparent's death or <disfluency> something is coming up <disfluency> in the next few months, in the next few in the next few days <disfluency> which is a <unclear> festival so we might visit the the ancestor's tombs and that's a good place to tell stories and then we tell stories and also we will enjoy some snacks together <unclear> oh my goodness I talk too much already	1					English	0	0	0	train	15.6388888888889	
66a9c834fdd45fe172084eec0ff41c01	1917	spontaneous-speech-en-1917.mp3	21600	198	Do people of different ages use technology differently?	I think that younger people use technology more I see my children trying to find out answers to questions online where I would previously have consulted a book <disfluency> they also seem to want to stay in touch with their friends more through intermediaries like pictures and videos rather than talking to them directly where as my mom still calls me all the time.	1					English	0	0	0	train	14.0277777777778	
5efffd79ee6208229036f4ecaa306075	2069	spontaneous-speech-en-2069.mp3	16200	144	What wishes do you have for your children?	I hope that my children grow up healthy that they appreciate all the ways in which they are lucky <disfluency> and that they live lives that are full of joy.  I also really hope that they take what they have been given seriously and work very hard.	1	thirties	female_feminine			English	0	0	0	train	12.4074074074074	
81d64edb7df3bfb27007e4d752266039	6823	spontaneous-speech-en-6823.mp3	17856	152	How do you take care of your body and health?	<disfluency> I take care of my body and health by eating lots of vegetables not eating too much meat, not <disfluency> many sweet things, I don't drink too many sweet drinks, and <disfluency> I exercise as much as possible and that mostly means carrying my baby around everywhere because he is very heavy.	1					English	0	0	1	train	14.1689068100358	
cec36a169af48b14e81bdba2b8ca7dd1	8052	spontaneous-speech-en-8052.mp3	19764	202	Tell me about the time you got your first phone.	I think I got my first phone when I was thirteen or fourteen years old. I know that these days kids get them even younger, maybe even when they are ten.  It was not a smart phone.  It was a Nokia three three one o It was very heavy <disfluency> we used to say that it's like a brick. I mostly used it to call my parents and to play snake on it.	1					English	0	0	1	dev	13.6612021857923	
7d1fdfc85a38a4ea03ea03f85f4710ac	8087	spontaneous-speech-en-8087.mp3	20520	192	Describe a creative activity or project that you've ever worked on	My favorite creative project that I've ever worked on <am> was when I started my Ph.D., which doesn't sound very creative because it is in informatics but actually it is really fun because I get to think about questions I would like to ask younger people and children which my work doesn't allow	1					English	0	0	1	train	11.8421052631579	
a9fd17921ce5aea9c300189e7b4373d8	20204	spontaneous-speech-en-20204.mp3	17388	153	How do you relax and keep your mind healthy?	<disfluency> I have two children under the age of five so I do almost nothing to relax at all <disfluency> but the thing that I try and do is I try and read a book sometime before bed to help me go to sleep <disfluency> and I always feel better if I manage to get a walk in during the day	1					English	0	0	0	train	13.1124913733609	
5774e3cb76e3ad4370fba8ad982324c8	20215	spontaneous-speech-en-20215.mp3	26136	211	Does your family have a strict budget?	A strict budget for a family--I don't have <unclear> strict budget for a family, we just spend what is necessary and <disfluency> we don't overspend, we don't spend <unclear>  just buy what are needed	1	seventies	male_masculine			English	0	0	1	test	6.35139271502908	
5774e3cb76e3ad4370fba8ad982324c8	20219	spontaneous-speech-en-20219.mp3	38520	227	Describe your favourite movie, TV show, or play?	I don't have any favorite <unclear> movie, TV shows or play ... but <disfluency> when I switch on TV I prefer to watch <disfluency> documentary films, or <disfluency> series on traveling, travel <disfluency> places, something like <unclear>	1	seventies	male_masculine			English	0	0	1	test	5.29595015576324	
5774e3cb76e3ad4370fba8ad982324c8	20224	spontaneous-speech-en-20224.mp3	10260	229	Who is your favourite musician, singer, or songwriter?	No I don't have any particular or favourite <unclear> singers or songwriter	1	seventies	male_masculine			English	0	0	1	test	6.23781676413255	
5774e3cb76e3ad4370fba8ad982324c8	20237	spontaneous-speech-en-20237.mp3	35028	216	What kinds of businesses do you wish your community had more of?	<silence> the place where I came from currently <disfluency> they are focusing on the <disfluency> oil palm plantation <disfluency> which they could get <disfluency> more out of the <disfluency> oil palm	1	seventies	male_masculine			English	0	0	1	test	4.93890601804271	
5774e3cb76e3ad4370fba8ad982324c8	20238	spontaneous-speech-en-20238.mp3	14940	146	Does your family tell stories about their history?	<noise> unfortunately my family <silence> don't tell much about the history or history	1	seventies	male_masculine			English	0	0	1	test	4.95314591700134	
5774e3cb76e3ad4370fba8ad982324c8	20241	spontaneous-speech-en-20241.mp3	23436	133	Describe any plants you have grown or harvested	yeah, I like <disfluency> to plant some vegetables, I did plant veggie <disfluency> plants like okra, <disfluency> long beans and sweet potatoes, et cetera	1	seventies	male_masculine			English	0	0	1	test	5.63236047107015	
5774e3cb76e3ad4370fba8ad982324c8	20262	spontaneous-speech-en-20262.mp3	17496	231	Describe a visit to a cinema in your country	<noise> <unclear> it was a long long time ago but it was <unclear> experience	1	seventies	male_masculine			English	0	0	0	test	3.65797896662094	
5774e3cb76e3ad4370fba8ad982324c8	20268	spontaneous-speech-en-20268.mp3	7668	185	How do you feel about long walks? Why?	<noise> I would love to have long walks	1	seventies	male_masculine			English	0	0	0	test	4.17318727177882	
5774e3cb76e3ad4370fba8ad982324c8	20271	spontaneous-speech-en-20271.mp3	8928	125	What kinds of pets are common in your community?	cats <unclear> pets in Malaysia	1	seventies	male_masculine			English	0	0	0	test	3.0241935483871	

Was it used while testing Spontaneous Speech? If so, then why they weren’t changed after move to ‘[special tag]’ syntax? Or is it user mistakes?

Not all table fields are described on the dataset page

Сomparison Table
Table Dataset Page on Mozilla Collective Both
client_id client_id - hashed UUID of a given user :white_check_mark:
audio_id audio_id - numeric id for audio file :white_check_mark:
audio_file audio_file - audio file name :white_check_mark:
duration_ms duration_ms - duration of audio in milliseconds :white_check_mark:
prompt_id prompt_id - numeric id for prompt :white_check_mark:
prompt prompt - question for user :white_check_mark:
transcription transcription - transcription of the audio response :white_check_mark:
votes votes - number of people that who approved a given transcript :white_check_mark:
age age - age of the speaker1 :white_check_mark:
gender gender - gender of the speaker1 :white_check_mark:
accents :cross_mark:
variant :cross_mark:
language language - language name :white_check_mark:
prompt_upvotes :cross_mark:
prompt_reports :cross_mark:
is_edited :cross_mark:
split split - for data modelling, which subset of the data does this clip pertain to :white_check_mark:
char_per_sec char_per_sec - how many characters of transcription per second of audio :white_check_mark:
quality_tags quality_tags - some automated assessment of the transcription–audio pair, separated by | :white_check_mark:

Quality and Special Tags

In addition, what happened to other ‘quality tags’ and [special tags]? They exist in table values (at least, mixed-script-words), but are not mentioned on dataset page anymore. Are they deprecated now? If so, then what is the reason for that and why they are not deleted from the data? And why [special tags] are not mentioned at all now?

All table records with prompts without any English letter
client_id	audio_id	audio_file	duration_ms	prompt_id	prompt	transcription	votes	age	gender	accents	variant	language	prompt_upvotes	prompt_reports	is_edited	split	char_per_sec	quality_tags
8361028fcdc492c06e20197577636a46	74478	spontaneous-speech-en-74478.mp3	6984	8929	عام وکړي		0	twenties	female_feminine			English	1	1	0			
bf629861b5d1c513e52ac9d47f39b480	75749	spontaneous-speech-en-75749.mp3	2448	9263	なんで生きているの		0	sixties		European English|French improved|Some say Pakistani|international accent		English	1	0	0			
bf629861b5d1c513e52ac9d47f39b480	75920	spontaneous-speech-en-75920.mp3	3636	9264	バレーボールが好きです。		0	sixties		European English|French improved|Some say Pakistani|international accent		English	1	1	0			
73f01ed66ad8307cb8da35e9aa149d6a	76205	spontaneous-speech-en-76205.mp3	2808	9264	バレーボールが好きです。		0	twenties				English	1	1	0			
06936db04243d8a57c5cc90e3c5f45e7	76241	spontaneous-speech-en-76241.mp3	6768	9263	なんで生きているの		0	twenties				English	1	0	0			
95e880ea490c9c894be09e422e606859	76337	spontaneous-speech-en-76337.mp3	3312	9265	日本語は外国人にとって難しいです。		0	twenties	female_feminine			English	1	1	0			
3b3461a14621a27a500846bb7c6b802a	76396	spontaneous-speech-en-76396.mp3	2916	9264	バレーボールが好きです。		0	twenties				English	1	1	0			
8c562d3b5077db1fafe584ad21c353d2	76491	spontaneous-speech-en-76491.mp3	900	9263	なんで生きているの		0	twenties	female_feminine			English	1	0	0			short-audio
f75cacc81c989b9caa2fb4720bf8f309	76590	spontaneous-speech-en-76590.mp3	5328	9265	日本語は外国人にとって難しいです。		0	twenties	do_not_wish_to_say			English	1	1	0			
e4e8281d2890cc84ec6b8c4ab2556e55	76702	spontaneous-speech-en-76702.mp3	7164	9265	日本語は外国人にとって難しいです。		0	twenties				English	1	1	0			
02d32c193f011d23485868e7acba2be5	76827	spontaneous-speech-en-76827.mp3	3420	9265	日本語は外国人にとって難しいです。		0	twenties	intersex			English	1	1	0			
02d32c193f011d23485868e7acba2be5	76888	spontaneous-speech-en-76888.mp3	2016	9264	バレーボールが好きです。		0	twenties	intersex			English	1	1	0			
3140741074e360591b97c50ce6d3e742	77012	spontaneous-speech-en-77012.mp3	3060	9264	バレーボールが好きです。		0					English	1	1	0			
3140741074e360591b97c50ce6d3e742	77024	spontaneous-speech-en-77024.mp3	3816	9265	日本語は外国人にとって難しいです。		0					English	1	1	0			
f97bbf62128acbca0578b0e0b4efddaa	77048	spontaneous-speech-en-77048.mp3	6516	9265	日本語は外国人にとって難しいです。		0	twenties	female_feminine			English	1	1	0			
5e12a73c58060fe0300ba70b23756e2a	77166	spontaneous-speech-en-77166.mp3	3240	9265	日本語は外国人にとって難しいです。		0	twenties	female_feminine			English	1	1	0			
5e12a73c58060fe0300ba70b23756e2a	77171	spontaneous-speech-en-77171.mp3	4500	9264	バレーボールが好きです。		0	twenties	female_feminine			English	1	1	0			
b869dd668d2d71a3b036334d7cbe0541	77194	spontaneous-speech-en-77194.mp3	4140	9263	なんで生きているの		0	teens				English	1	0	0			
67d9e934bf1cf8055bafd7adb09515f4	77223	spontaneous-speech-en-77223.mp3	5868	9264	バレーボールが好きです。		0	teens	female_feminine			English	1	1	0			
67d9e934bf1cf8055bafd7adb09515f4	77232	spontaneous-speech-en-77232.mp3	2268	9263	なんで生きているの		0	teens	female_feminine			English	1	0	0			
9ee6fb4a3781fb86904930ec45f77c6b	77334	spontaneous-speech-en-77334.mp3	2916	9263	なんで生きているの		0	twenties				English	1	0	0			
4c5826d54d465d51a918645c2c9b9450	77342	spontaneous-speech-en-77342.mp3	2700	9264	バレーボールが好きです。		0	teens	female_feminine			English	1	1	0			
4c5826d54d465d51a918645c2c9b9450	77345	spontaneous-speech-en-77345.mp3	2736	9266	大学で何を学んでいますか?	社会学	0	teens	female_feminine			English	1	1	0		1.09649122807018	transcription-length
4cd8e84fc90f1f42885cbaf124e05813	77427	spontaneous-speech-en-77427.mp3	4968	9265	日本語は外国人にとって難しいです。		0	twenties				English	1	1	0			
881cb6ae1b925ddba9a81233e73b99b9	77528	spontaneous-speech-en-77528.mp3	3852	9265	日本語は外国人にとって難しいです。		0	twenties				English	1	1	0			
00a1abf4c4bd78a254176b6f91d10972	77582	spontaneous-speech-en-77582.mp3	3600	9264	バレーボールが好きです。		0	teens				English	1	1	0			
3af2ab5be747757784ab846bdb90ca70	77691	spontaneous-speech-en-77691.mp3	3960	9266	大学で何を学んでいますか?		0	teens	female_feminine			English	1	1	0			
3af2ab5be747757784ab846bdb90ca70	77695	spontaneous-speech-en-77695.mp3	3060	9263	なんで生きているの		0	teens	female_feminine			English	1	0	0			
3af2ab5be747757784ab846bdb90ca70	77736	spontaneous-speech-en-77736.mp3	3348	9264	バレーボールが好きです。		0	teens	female_feminine			English	1	1	0			
edbdcf2b74201fbf1a64c4b7d0480dfe	77791	spontaneous-speech-en-77791.mp3	4968	9265	日本語は外国人にとって難しいです。		0	twenties				English	1	1	0			
3f52e83d753bfdb58dfa1bf8da1ff2fd	77834	spontaneous-speech-en-77834.mp3	1620	9263	なんで生きているの		0	twenties				English	1	0	0			short-audio
3f52e83d753bfdb58dfa1bf8da1ff2fd	77844	spontaneous-speech-en-77844.mp3	2376	9266	大学で何を学んでいますか?		0	twenties				English	1	1	0			
8d4aabc5e44eedcd0226c3238532ec49	77866	spontaneous-speech-en-77866.mp3	1368	9263	なんで生きているの		0	twenties	female_feminine			English	1	0	0			short-audio
1b54eb67011a365112f5f982daab2383	77903	spontaneous-speech-en-77903.mp3	1836	9265	日本語は外国人にとって難しいです。		0	teens				English	1	1	0			short-audio
3eceba229c1562f52e42971b58f8e77b	77907	spontaneous-speech-en-77907.mp3	2196	9266	大学で何を学んでいますか?		0	twenties				English	1	1	0			
8d4aabc5e44eedcd0226c3238532ec49	78021	spontaneous-speech-en-78021.mp3	2016	9268	好きなスポーツは何ですか?		0	twenties	female_feminine			English	1	0	0			
f80e04bc95527e8d1dee5868e9597c08	78165	spontaneous-speech-en-78165.mp3	1944	9265	日本語は外国人にとって難しいです。		0	twenties	female_feminine			English	1	1	0			short-audio
72d40f1272fcf3f1faa2e53f13fce74d	78272	spontaneous-speech-en-78272.mp3	3420	9265	日本語は外国人にとって難しいです。		0	thirties				English	1	1	0			
6669d9e9df207cfff13b04a81a034947	78387	spontaneous-speech-en-78387.mp3	3708	9263	なんで生きているの		0	sixties	intersex			English	1	0	0			
9b82dc245e7511d34b0862e45e13f914	78600	spontaneous-speech-en-78600.mp3	1188	9264	バレーボールが好きです。		0	twenties	female_feminine			English	1	1	0			short-audio
30c19ae2451fdb00cf6a24ca76e8efe3	78635	spontaneous-speech-en-78635.mp3	6480	9268	好きなスポーツは何ですか?		0	twenties	female_feminine			English	1	0	0			
30c19ae2451fdb00cf6a24ca76e8efe3	78640	spontaneous-speech-en-78640.mp3	9468	9266	大学で何を学んでいますか?		0	twenties	female_feminine			English	1	1	0			
3f1884a35e190dd9e93d746b6cbd6e6d	78706	spontaneous-speech-en-78706.mp3	3492	9265	日本語は外国人にとって難しいです。		0	twenties	female_feminine			English	1	1	0			
d30c84982e3235d004dc4c6c92f2d820	79472	spontaneous-speech-en-79472.mp3	2088	9266	大学で何を学んでいますか?		0	thirties	female_feminine			English	1	1	0			
ad50474e224ccf111785341d32cbd1e3	79644	spontaneous-speech-en-79644.mp3	9720	9263	なんで生きているの		0	twenties		United States English		English	1	0	0			
5f49b5068910e71c1669f621c1384fa6	79729	spontaneous-speech-en-79729.mp3	5400	9263	なんで生きているの		0	fifties				English	1	0	0			
203f28e5027ab49d9017121f658abcaa	80028	spontaneous-speech-en-80028.mp3	6516	9266	大学で何を学んでいますか?	[silence]	1					English	1	1	0	test	1.38121546961326	transcription-length
920a519e3cc1b4b4f327ef967a13a85d	80206	spontaneous-speech-en-80206.mp3	2088	9445	تاسی ولي زموږ مخه نیسي؟		0	twenties	female_feminine			English	1	1	0			
3a29e13b24eda367b83f34b30fae9d00	80617	spontaneous-speech-en-80617.mp3	2340	9266	大学で何を学んでいますか?		0	twenties	female_feminine			English	1	1	0			
d0da4607333af18f84f92d15ff000832	80632	spontaneous-speech-en-80632.mp3	3060	9263	なんで生きているの		0	teens				English	1	0	0			
cbe5728321e3fbeb0e140205ab5c10be	80893	spontaneous-speech-en-80893.mp3	5760	9266	大学で何を学んでいますか?		0	twenties				English	1	1	0			
8d24ca31d519723d2953adcecd199961	80932	spontaneous-speech-en-80932.mp3	2556	9263	なんで生きているの		0	twenties				English	1	0	0			
14f25e305d7cc51033fcaa2022eeaaa3	85954	spontaneous-speech-en-85954.mp3	5256	9268	好きなスポーツは何ですか?		0	sixties	do_not_wish_to_say			English	1	0	0			
646d7e2507e17edbb5beddf36f15ad70	87620	spontaneous-speech-en-87620.mp3	2916	9447	ستاسو په کلې کې د واده مراسم څنګه وي؟ په خپله لهجه یې په پنځو دقیقو کې موږ ته بیان کړئ		0	fifties	female_feminine	England English		English	1	0	0			
646d7e2507e17edbb5beddf36f15ad70	87625	spontaneous-speech-en-87625.mp3	2916	9268	好きなスポーツは何ですか?		0	fifties	female_feminine	England English		English	1	0	0			

Hi @Libra, thank you for the report and feedback. We appreciate it.

You are right on some accounts, here is some info:

  • As you might be aware, we did not release SPS English dataset before, because of the reasons you mentioned. We even ran Whisper language identification for research. In this release we removed all those non-English answers we auto-identified by adding admin level reports. All of these were coming from the early alpha contributions. But there can be more.added (at least report stats show this).
  • We re-designed/re-programmed the whole data pipeline for the release. As you might be aware, there are a lot of more statistics/info on the Datasheets (what you see in Download page). There might be some missing parts wrt earlier, and we can add them for the next release. But, many of those stats are data dependent, e.g. for English, where no Variant is defined, there cannot be a variant :slight_smile: Or, having prompt upvote count in datasheet does not provide valuable info to the dataset user.
  • The whole issue is an UX problem, where English is shown as default language, and people coming form scripted speech / also not very well informed or not reading are ending in English dataset. We are currently redesigning the whole workflow to prevent that for the future.

With your feedback, we decided to take the SPS English dataset down, it is also not suitable for our QA practices.

Then wasn’t it easier to delete all records from the alpha test period, if there are so much problems with them? Or there are some problems with this decision such as you don’t have dates when specific recording was recorded or something like that?

Then why this info is in dataset at all? Is it hard to change this list to needs of every language separately? Sorry for my stupid questions. Anyway, I think it can be useful to add at least something like “[field] is not relevant/used/defined in this language dataset”. Because when it exists but doesn’t have any mention or any explanation, it is strange, strains the mind and muddles it

I guessed it would be the problem. It muddled me couple of times as well.

A significant amount of data was fine and had sufficient quality. So if we take into account you have plans to continue clearing it, perhaps it would be OK to publish it, maybe with some warning about possible quality issues

1 Like

Hey @Libra,

There were many projects involved which were correct, and only a portion of the data is this way, so deleting everything was no go. I had to listen to many of the flagged ones to find the pattern… There was none, except those flagged by Whisper, so we decided to depend on the contributors.

But, we (read “I did”) made a mistake while including the SPS English dataset… It was not properly set-up in our QA pipeline, e.g. `ForeignScriptCheck` was missing, and tag replacements (e.g. => `[disfluency]` conversions I told you before) were not in place.

I’m already fixing documentation related ones you flagged, the docs will be in the cv-dataset repo (https://github.com/common-voice/cv-dataset), where it belongs. But others would need some more (manual) work…

I think we will just update the dataset file after that…

1 Like

I wasted today 2.5-3 hours and reported something like 100-200 audio clips, that had audio in other languages and at some point recognized they often are recorded by same users. Doesn’t it mean it should be possible to choose recordings with incorrect audio language based on client-ids (if user with ‘x’ client-id has many reports about incorrect language, then check if it is possible to take down all recordings with this client id)? Or are there some problems with this method? Maybe client ids were dynamical while testing period?

Exactly this. SPS did not have any user connection to SCS, so on each visit a user got another client_id. I did clean-up 5-6 user records which were outstanding (300-400 audio) after listening, but I had to stop after that.

Thank you for your efforts… Much valuable - not a single second wasted… :star_struck: