Abbreviations in Dataset Transcript

  1. When creating a dataset, what is the be best way to transcribe abbreviated words?
    Here is list of few examples:
    CAN which sounds “see a n” rather than “can” - should it be “C A N”
    BEE which sounds “be e e” rather than “bee” - should it be “B E E”

  2. what is the be best way to transcribe a year?
    Example 1985. Should it be nineteen eighty five or simply 1985?

Your advice would be highly appreciated.

With 2 it’s safest to transcribe it to words as it was actually said, since then you’ll be sure that no matter how the text is processed it will match. This is useful in cases where it is ambiguous, like 2012 where sometimes it’s said “two thousand and twelve” and others “twenty twelve”

With 1 it depends on how phonemizer would say it - you want the way you’ve transcribed it to be turned into the same sounds/words by phonemizer as your recording actually used.
I’m not by my computer to test it, but I think generally with English abbreviations when using espeak-ng through phonemizer it would generally turn capitalised words into the equivalent of the letters. Thus “BBC” becomes the equivalent of “be be see” and provided that was how it was said then it would be fine to transcribe as BBC in that case.