Sentences that include groups of uppercase characters

txopi · November 29, 2018, 10:08pm

I have a doubt about how to prepare the sentences that include groups of uppercase characters to express abbreviated terms. I’ll put examples in English, although my target language is Basque, but I think this issue is similar in many languages.

Abbreviations like FBI, BBC, KGB… are spelled letter-by-letter. Should I let them just as they are in the text ("FBI")? Or perhaps I should separate the letters with spaces ("F B I") so the after-processing work will recognize that they are separate letters and will expect people pronouncing spelled letters in the recordings?

Abbreviations like NATO, UNESCO, NASA… are pronounced as words because their syllabic configuration. Should I let them just as they are in the text ("NATO")? Or perhaps I should put them in lowercase ("nato") to differentiate them from the letter abbreviations?

I understand that writing both type of abbreviations as they are written in normal situations, will make the trained result poorer. But perhaps I’m wrong and nothing has to be done.

This is my doubt I would appreciate if someone can put some light in this subject or I’ll decide just to avoid all the abbreviations in the collected sentences. Unfortunately I suspect this decision also can produce a worse trained system, but I have no clue on deep learning techniques and perhaps I’m absolutely wrong.

What do you recommend me to do with Basque letter and syllable abbreviations?

davidak · December 4, 2018, 12:18pm

I think abbreviations should get pronounced like people would normally do since a STT engine should be able to handle that.

txopi · December 4, 2018, 1:56pm

Hi dabidak. My doubt isn’t how a speaker has to pronounce the words, but how to write them in the data set. You mean just write them as usual (NATO, BBC, UNESCO, FBI…) and the STT engine will make all the work by itself?
NOTE: There are also hybrid abbreviations like JPEG.

davidak · December 4, 2018, 3:05pm

Also there is would just write them how they occur in normal text.

nukeador · December 4, 2018, 11:05pm

@r_LsdZVv67VKuK6fuHZ_tFpg and I will be meeting with the deep speech team soon to talk about this “cleaning” of the sentences so they are fully useful for the engine. We will let you know once we have the details.

nukeador · January 9, 2019, 11:24am

The sentence collection tool how to now has some guidance on requirements

https://common-voice.github.io/sentence-collector/#/how-to

dabinat · January 9, 2019, 5:45pm

Here’s a question: if the engine is capable of recognizing the letters U, S and A individually, does it really need to be taught the acronym USA?

txopi · January 9, 2019, 9:36pm

As far as I know, the engine doesn’t do what you say. You can find this explanation in the how to: Abbreviations and acronyms like “USA” or “ICE” should be avoided in the source text because they may be read in a way that does not coincide with their spelling. Additionally, there may be multiple accurate readings for a single abbreviation. For example, the acronym “ICE” could be pronounced “I-C-E” or as a single word.

Topic		Replies	Views
Treatment of acronyms Common Voice sentence-collection	6	1169	April 4, 2019
How to deal with numbers and abbreviations? Common Voice	3	495	December 6, 2023
How to handle abbreviations and names/words from other language Common Voice feedback	3	1188	January 23, 2020
Satzzeichen in den Texten Deutsch (de)	14	1159	July 4, 2021
Heteronyms of homographs = No Common Voice	2	604	January 27, 2020

Sentences that include groups of uppercase characters

Related topics