Treatment of acronyms

jf99 · February 3, 2019, 8:25pm

Reviewing German sentences of other contributors, I found a bunch of sentences where the author circumvented the acronym detection by inserting dashes (BKA -> B-K-A). I don’t think this is how we should deal with abbreviations. The How-To should be adapted to be more explicit on this issue.

I must say though, that I don’t like the strict prohibition of acronyms. In many cases (such as BKA), there really is only one way to pronounce them. Maybe we should have a peer-reviewed whitelist of acronyms?

dabinat · February 3, 2019, 9:26pm

Maybe treat them as individual letters? e.g. B K A, N S A, U S A.

Then acronyms that are supposed to be pronounced like a word, e.g. NASA could be spelt nasa or Nasa.

This would allow the author to indicate the intended pronunciation and ensure all speakers say it the same.

rillke · March 23, 2019, 11:57am

@nukeador Is acronym prohibition really necessary? The engine developed is speech to text. Therefore multiple pronunciations can be mapped to a single acronym without confusing the engine, right?

eXeMeL → XML
iXeMeL → XML

We have a lot of recorded lectures making extensive use of acronyms. How will the engine be able to properly recognize use of acronyms without training?

jf99 · March 23, 2019, 3:01pm

I think this is a good point @rillke, as long as the data set is used for STT only. Is anybody planning to build a TTS engine from it, as well? If yes, I’d still vote for the whitelist solution.

nukeador · March 25, 2019, 11:41pm

I’ll defer to @kdavis or @josh_meyer to answer this one but my understanding is that the engine should know how to pronounce individual letters.

rillke · April 4, 2019, 9:28pm

A quick glance at Google Schoolar didn’t show anything like “Don’t use acronyms for training STT”. However, this was not an in-depth search.

josh_meyer · April 4, 2019, 11:16pm

@nukeador - There should be no acronyms at all.

It’s not as simple as “spelling them out” (for example “NASA” vs “CIA”). Furthermore, different people may say the same acronym differently, making the transcripts even less reliable.

This is not just an English-specific phenomenon… Russian acronyms can be really complicated.

Topic		Replies	Views
Sentences that include groups of uppercase characters Common Voice sentence-collection	7	1034	January 9, 2019
Heteronyms of homographs = No Common Voice	2	604	January 27, 2020
How to handle abbreviations and names/words from other language Common Voice feedback	3	1188	January 23, 2020
How to deal with numbers and abbreviations? Common Voice	3	495	December 6, 2023
"Listen" Guidance Common Voice feedback	8	2133	March 3, 2020

Treatment of acronyms

Related topics