Translation of sentences from other-language corpuses

Flay · October 5, 2022, 10:06pm

Hi. Can I translate sentences for Russian sentence collection from corpuses of other languages? Example, from Belarusian, Ukrainian and English? And if yes, how it will be more good to do? Should I do it in «Sentence collector» or in a separate file?

manalog · October 8, 2022, 12:47am

I guess there is no problem especially if you translate from languages of the same family so that the translation is good. In general at Common Voice “original” sentences are preferred because the point is to find sentences that are more natural possibly, but I guess if you translate from Ukrainian or Belarusian it should work. But also I suggest: why not writing sentences on your own? I think the work of translating a sentence or inventing a new one is roughly the same or, even better, try to find some CC-0 source in Russian.

Flay · October 8, 2022, 10:09am

OK, thanks for your response!

I know I can try to find CC0 sources or to write new original sentences, but both of these variants have problems.

First is very difficult task because there’re too few websites, what use license CC0 or public domain in Russian. And 99% of them are about Russian classic literature. It’s means that often they have the same texts. These licenses are very rarely. I exactly know, that ru.wikisource.org have big library of public domain text, but it’s not universally. Plus in Russia Copyright is ended after 70 years of author death, not after 90. It’s a problem.
Honestly, I don’t understand, why CV decided to use this particular specific license. I know some Russian sites, which use License CC-BY and many English sites, which use the same license. But I know no one website, what use modern Russian language and have license CC0. I can understand, why CV don’t want to use CC-BY-SA, CC-BY-NC etc, they have limitation of using, but why not CC-BY? CC-BY have only one term: we should do it list of sentence sorces. And everyone, who will want to use the voice dataset, should just write: “We use dataset of Common Voice”. It’s all! We still can use our dataset for all. Commercial using, private using, changing of the dataset, everything is allowed. And thanks to it, we will can get sentences with modern words. Many sentences. Sorry, Manalog, do you know, who, when and why decided to use CC0 in Common Voice?

Second. Yes, I can write new sentences and I do it sometimes. I’m trying to create sentences with modern slang and words. But often I have no idea for writing them. And limit of words for one sentence enrage me. When I create new sentences I often write more 14 words, therefore I must count number of words always before I’ll add sentences or change these sentences after sending. And It’s very enrage me. If I translate sentences of Belarusian or Ukrainian languages I don’t have this problem, because Russian translation have the same number of words almost always

Flay · October 8, 2022, 10:06am

Oh, and sorry, why translated sentences can’t be good? I don’t use machine translation

manalog · October 8, 2022, 12:55pm

About CC-0, I think it’s a good policy.
The idea of the common voice dataset is to have a dataset that can be used easily by anyone interested in ASR and TTS, from the single amateur to the big corporation. Having a Creative Commons license could be problematic.
An example that come to my mind is applications from big corporation: without getting in the topic of how much is better open source rather than closed and speaking super practical, nowadays there is a big gap in some services; for example Google Translate doesn’t even offer audio for a lot of languages. It can be that even Google could use our dataset, and probably they would stop if they have to acknowledge Mozilla. Ok probably it’s not a nice thing for the spirit of the project, but can be immensely helpful for all the communities whose language is not represented. And I think this is very important.
Having a public domain license means that this dataset can have really broad applications, probably will never get lost, some parts of it can revive in other datasets and so on… it’s very cool and easy in this way.
On the other hand, imagining if we import sentences that requires attribution. Then imagine the mess to acknowledge each single source of the sentence… Imagine an academical paper with the list of one thousands website to acknowledge. It would be a complication and not so worth one because is not hard to find sentences.
Russian for example already has 46643 sentences, that is a good number. The number to improve is 217 hours of recording, that is very bad.

About sentence length for Sentence Collector, I agree and we should signal it. It only accepts very short sentences, shorter than many that are already in production. One day I tried to add a great collection of sentences wasting time because 90% of them were rejected or for length or because they contained " ’ " character, that is very common in Italian.

manalog · October 8, 2022, 1:01pm

About translations, I think the point is to try not to add literally translated sentences. It depends from the quality of the translation and the original language. I don’t know Russian but probably if you translate from Ukrainian or Belarusian you will get nice sentences while from English you should put a bit of carefulness in not putting english-constructed sentences translated in Russian.
I say this because for example my language is FULL, especially in scientific topics and especially in websites like wikipedia, of sentences that are correct, but very ugly because it’s evident that the construction is english and not italian. It’s a pity because is damaging the language and creating things less easy to understand and less beautiful.

If you want to translate, it’s ok because they are all CC-0 sentences, just do it in a very Russian way and not mixed with other styles. I mean, it’s not a great deal even if sentences are not nice because then they will be used to train a neural net and not for a beauty context, but still it’s nice to maintain the quality high when is possible and not introduce word that doesn’t exist. In my language sometimes you can see some words that on first hand looks correct but then you realized that are just english words in which “-tion” is changed in “-zione” so they look italian but they do not exist.

Considered this, if you want you can proceed with translations, I cannot see any issue in your idea.
Good work

bozden · October 8, 2022, 3:33pm

FYI, both it and ru locales have special rules in SC. Pls. have a look at these:

github.com

common-voice/sentence-collector/blob/main/server/lib/validation/VALIDATION.md

# How to add a new language validation

1. Copy the `en.js` file in the `languages` folder and name it according to the new language
2. Adjust the content of the file to represent the new requirements for this specific language
3. Feel free to add comments to better explain what each validation does - if the error message can't be phrased descriptive enough
4. In `index.js` add a new require (as example for German - de)

```
const de = require('./languages/de');
```

5. Expose the new require in the `VALIDATORS` object

```
const VALIDATORS = {
  en,
  de,
};
```

This file has been truncated. show original

The whole CV system is designed for max 10 sec recordings. AFAIK, this is chosen for a couple of reasons, but two important ones are:

With anything longer, people’s breath can be out to read it in one step, also increasing the error rate in recordings.
It is optimized for training with commonly found 8GB VRAM GPU’s, with nice batch sizes.

After 10 sec in a recording, the system gives an error and a slow speaking volunteer might have problems. Therefore, before putting rules in those validation files -either word count limit , default 14 in English; and/or character limit- one should get a good sample and calculate character speed. Btw, Italian only has char limit, which is 125.
For cleaning/validation of characters please check this Discourse, there are two recent discussion topics on them.

I try to follow the contributors in our language community rather closely. There are some speech artists with good pronunciation and give pauses on commas etc, and whenever the sentence length reaches >100-110, they have problems. I pre-process our texts with SC rules before entering them into SC (normalization of those chars, elimination of illegals, converting numbers to text, then checking the lengths etc). After that I re-read/correct the sentences, then I get statistics of each resource (see here). It is a lengthy process but it pays…

Francis_Tyers · October 10, 2022, 1:35pm

Others have already responded, but in general my opinion is that the focus should be on authentic sentences in the language from specific domains of interest. For example, dialogue, subtitles, commands, anything that you can imagine. Translations, unless done by a very competent translator usually result in some degree of translationese. From Ukrainian or Belarusian it might be ok because they are related languages, but I wouldn’t translate from English.

manalog · October 12, 2022, 9:00pm

It’s interesting to read about the 10 second topic for breathing and optimization of training model. Do you know why the deepspeech playbook talks about clips between 10 and 20 seconds length?
I am sure Mozilla developers choose carefully the 10 seconds limit, so I am curious if there is an error on the playbook or if there are other criteria that make preferable for Common Voice to have very short clips (I rapidly checked the IT corpus as an example and it seems that the average clip is around 30KB, 4 second).

My worry, considered the importance (as also @Francis_Tyers pointed out) of authentic sentences in the dataset, is that the current mix of Wikipedia-style + short clips could cause some bias. I recorded and validated many clips and I realized that in this way I almost always forced to speak in the same way, a pretty unnatural and scholastic way of talking… (for my language I can describe it as: voice going up + voice going down for text between commas + voice going up for the conclusion in a very rough way, and it is the usually way primary schools kids are taught how to read sentences) I don’t know if this it can be an issue or not, I still haven’t experienced to train something like deepspeech, but for sure I can say that, at least for my language, 90% of clips have this basic scheme that doesn’t reflect dialogue.

bozden · October 12, 2022, 9:20pm

I didn’t know that. I was talking about the actual importers (line 31):

github.com

mozilla/DeepSpeech/blob/aa1d28530d531d0d92289bf5f11a49fe516fdc86/bin/import_cv2.py#L31


    get_imported_samples,
    get_importers_parser,
    get_validate_label,
    print_import_report,
)
from ds_ctcdecoder import Alphabet

FIELDNAMES = ["wav_filename", "wav_filesize", "transcript"]
SAMPLE_RATE = 16000
CHANNELS = 1
MAX_SECS = 10
PARAMS = None
FILTER_OBJ = None


class LabelFilter:
    def __init__(self, normalize, alphabet, validate_fun):
        self.normalize = normalize
        self.alphabet = alphabet
        self.validate_fun = validate_fun

It is also valid for the successor, Coqui STT…

kathyreid · October 25, 2022, 3:10am

My guess is that it’s because of the memory requirements of larger samples. The way I would investigate this is to increase the parameter (possibly even passing it as a command line argument to DeepSpeech.py) and see if it causes memory failures.

An even better approach might be to identify samples in the .tsv file for the language that are > 20 seconds long, and split them into 2 x slices of data.

I was curious about how many utterances in the CV dataset are > 10 seconds long. Without running a Python script over all the .mp3 files in a dataset (which I could do but don’t want to go down that rabbit hole), I took a look at the average utterance duration for all the languages - so this visualisation:

Most of the languages have an average clip duration of well under 7 seconds.

There might be some outliers, but based on this data, I don’t think we have a lot of clips that are > 10 seconds, or that could be split (at say 2 x 10 second or even 2 x 7 seconds) chunks.

bozden · October 25, 2022, 3:39am

Hi @kathyreid , no need to re-analyze the durations, I’ve already done it during my Our Voices application. Although the values are not CV wide (it is in my to-do list), currently the distributions can be reached on dataset/split based approach. The current beta is here:

https://cv-dataset-analyzer.netlify.app/

E.g. here are the values for v11.0 validated clips for Turkish. As can be seen, there are 106 recordings out of 82,351 which are longer than 10 seconds.

Actually the recording duration is limited by the CV software by 10 secs, so these outliers should have come from before this limit is put or caused by measuring/rounding errors.

kathyreid · October 25, 2022, 3:56am

That’s super interesting @bozden, thank you! I actually wonder if the 10 second limit is now a poor limit to have, given that GPU hardware has increased in capacity so much in the past few years. Would we have more accurate models if we didn’t apply a max duration limit on the utterances? Would we be able to elicit more natural sentences, that are a better reflection of language as spoken in the “real world”?

bozden · October 25, 2022, 4:27am

I also think that the limitation can be relaxed a bit. AFAIK, it was optimized for 8GB VRAM, and nowadays 16/24 GB ones are becoming more common on GTX series (although still pricy). We should check some market values

On the other hand, the whole workflow has been based on this, starting from Sentence Collector. The rules on the SC (number of words and/or number of characters) are based on total recording duration, usually specified by communities measuring the character speed by sampling actual recordings.

So, if we double the duration to 20 sec for 16GB VRAM, the whole workflow data should double. Also the limits on exporters/importers should double. That would be quite an undertaking…

If it is done, that would be wonderful news for CC-0 text-corpus collection, because many sentences are longer than 14 words (default value) and left out.

On Coqui STT group there are people working on longer recordings from other sources (e.g. French), they would know better than me on the results. But scientifically, to get a measure, one should fix the training duration for comparison because it would very much effect the results more than the length of the individual recordings.

Zebastjan_Johanzen · November 25, 2022, 8:18pm

The 10 second limit is rubbish. People do not speed read academic jargon in real life–but academic jargon is needed for the full range of vocabulary. As for breathing, people do that in real life, so your system will need to handle it. Also in real life different folks speak at different rates, and your foolish limit is pushing away all but the most rapid speakers, biasing your sample–garbage in, garbage out. The memory issues are something that your programmers can look into. Bottom line, get rid of this limit!!