Translation of sentences from other-language corpuses

FYI, both it and ru locales have special rules in SC. Pls. have a look at these:

The whole CV system is designed for max 10 sec recordings. AFAIK, this is chosen for a couple of reasons, but two important ones are:

  1. With anything longer, people’s breath can be out to read it in one step, also increasing the error rate in recordings.
  2. It is optimized for training with commonly found 8GB VRAM GPU’s, with nice batch sizes.

After 10 sec in a recording, the system gives an error and a slow speaking volunteer might have problems. Therefore, before putting rules in those validation files -either word count limit , default 14 in English; and/or character limit- one should get a good sample and calculate character speed. Btw, Italian only has char limit, which is 125.
For cleaning/validation of characters please check this Discourse, there are two recent discussion topics on them.

I try to follow the contributors in our language community rather closely. There are some speech artists with good pronunciation and give pauses on commas etc, and whenever the sentence length reaches >100-110, they have problems. I pre-process our texts with SC rules before entering them into SC (normalization of those chars, elimination of illegals, converting numbers to text, then checking the lengths etc). After that I re-read/correct the sentences, then I get statistics of each resource (see here). It is a lengthy process but it pays…

2 Likes