Treatment of acronyms


Reviewing German sentences of other contributors, I found a bunch of sentences where the author circumvented the acronym detection by inserting dashes (BKA -> B-K-A). I don’t think this is how we should deal with abbreviations. The How-To should be adapted to be more explicit on this issue.

I must say though, that I don’t like the strict prohibition of acronyms. In many cases (such as BKA), there really is only one way to pronounce them. Maybe we should have a peer-reviewed whitelist of acronyms?


Feedback about validated sentences
Rules for German sentence contribution / Deutsche Sprache

Maybe treat them as individual letters? e.g. B K A, N S A, U S A.

Then acronyms that are supposed to be pronounced like a word, e.g. NASA could be spelt nasa or Nasa.

This would allow the author to indicate the intended pronunciation and ensure all speakers say it the same.


(Rainer Rillke) #3

@nukeador Is acronym prohibition really necessary? The engine developed is speech to text. Therefore multiple pronunciations can be mapped to a single acronym without confusing the engine, right?

eXeMeL -> XML
iXeMeL -> XML

We have a lot of recorded lectures making extensive use of acronyms. How will the engine be able to properly recognize use of acronyms without training?



I think this is a good point @rillke, as long as the data set is used for STT only. Is anybody planning to build a TTS engine from it, as well? If yes, I’d still vote for the whitelist solution.


(Rubén Martín [away until April 24th]) #5

I’ll defer to @kdavis or @josh_meyer to answer this one but my understanding is that the engine should know how to pronounce individual letters.


(Rainer Rillke) #6

A quick glance at Google Schoolar didn’t show anything like “Don’t use acronyms for training STT”. However, this was not an in-depth search.



@nukeador - There should be no acronyms at all.

It’s not as simple as “spelling them out” (for example “NASA” vs “CIA”). Furthermore, different people may say the same acronym differently, making the transcripts even less reliable.

This is not just an English-specific phenomenon… Russian acronyms can be really complicated.

1 Like