Sentence collection for Serbian

ftyers · May 3, 2021, 8:35pm

There has been some discussion about collecting sentences for Serbian. See here.

TL;DR:

Here are 6747 sentences. These are from SETimes, a public domain news site.
Here is a list of the top-5000 utterances from OpenSubtitles. These will need to be checked for orthography and to make sure that they are public domain (e.g. do not contain any identifiable proper names).

I have included both Latin and Cyrillic.

darigovresearch · May 3, 2021, 9:05pm

Continuing the discussion here yeah feel free to submit more Cyrillic sentences from the source and we can review them in batches.

Unless anyone has any issues they would like to announce here

ftyers · May 3, 2021, 9:59pm

I’ve added another batch of 200 or so. Let me know when you finish and I can add some more. I want to start the batches off small, but if you get into a rhythm we can gradually make them bigger. The most important thing is to not swamp people with stuff.

darigovresearch · May 3, 2021, 10:29pm

Thanks for making another submission, don’t think it will effect their system as there are other languages that have 1000x more sentences and many thousands outstanding.

No need to select the shortest ones to submit. A cursory look at the dataset appears that the longest sentence length is still under their guidelines. May be easier for you to go through it top to bottom but we leave it with you.

Thanks again for the help and guidance, looking forward to getting this live!

ftyers · May 4, 2021, 12:08am

It won’t have an effect on their system, what I fear is that the people doing the reviewing might not like to have a lot of sentences not of their choosing dumped on them.

If you and your team are happy to go through and review all of them (two reviewers per sentence) then I’m happy to dump them in. But I’d rather not add a lot of sentences that other people might not think are good.

Also, there might be mistakes in the transliteration (I just found one bug with Lj → Лј because I missed characters in title case

ftyers · May 4, 2021, 12:57am

Note that it would also be good to be able to include sentences in ijekavski too. I wonder if there are any good systems for doing conversion or if there are good sources.

irvin · May 4, 2021, 12:36pm

Are you sure it is Public Domain? Cause the corpus I found stats out it’s released under CC:BY-SA license
http://nlp.ffzg.hr/resources/corpora/setimes/

irvin · May 4, 2021, 12:42pm

saw this discussion, looks good. Interesting source.

github.com/common-voice/common-voice

Add possibility for users to choose preferred orthography

opened 01:52PM - 23 Oct 20 UTC

ftyers

Up front: I do not expect this feature request to be implemented any time soon (…if ever), I'm just filing it so that I don't forget about it. **Is your feature request related to a problem? Please describe.** Some languages have multiple competing orthographies used in the same country. The sound is the same, but the letters are different. In many cases the conversion (at least in one direction) is mechanical. For example: - Serbian Cyrillic and Serbian Latin - Kazakh Cyrillic and Kazakh Latin - Basaa General and Basaa Missionary - Punjabi Shahmukhi and Punjabi Gurmukhi In some cases this is because of a stated move from one orthography to another (Kazakh) in other cases both orthographies may be maintained (Serbian). **Describe the solution you'd like** For certain languages it should be possible to provide sentences in more than one orthography and for users specify a preferred orthography. In order for this to be valid: - All sentences must be in both orthographies **Describe alternatives you've considered** We definitely do not want to split these into separate locales, and have separate communities for e.g. Serbian Latin and Serbian Cyrillic. The audio will be identical and we would just end up fragmenting the community of contributors and validators. **Additional context** In some countries, some languages are not taught in schools. They may be taught at private institutions or organisations or they may be self-taught or taught in the home. There may be two or more competing orthographies. And some speakers may be more or less able to read in each of them. When there is an official orthography, this should be used, but we should also consider the needs of people who might not have received an education in that particular orthography. This is purely a user interface consideration. On the ASR side of things the orthographies could be converted automatically. But if we want to display them to people we should be able to display them in the orthography of their choosing.

ftyers · May 4, 2021, 2:40pm

I also managed to extract some short sentences from the public domain sources at Gutenberg that you mentioned. They have been added with a link to the source text file.

darigovresearch · May 8, 2021, 3:49pm

Yeah don’t worry about the transliteration like Lj → Лј we caught it and we’ll catch others like it as well. Thanks again for uploading!

ftyers · May 9, 2021, 5:04am

No problem! Let me know if you need anything else

Fooftilly · May 21, 2021, 5:23pm

Hey, I’ve just finished reviewing all of those added sentences. Can someone add about 180 more sentences needed for the 5000 sentences goal. I would like to contribute to the voice part too. I would have added more sentences myself, but I couldn’t find public domain sources for the Serbian language. Government files aren’t under public domain license so I can’t use them, and I have no idea where to look for sentences next. The Internet Archive has only files in old or archaic Serbian, and those are of no use.

ftyers · May 21, 2021, 5:31pm

Done. There are more sentences here: https://models.omnilingo.cc/sr/setimes.cand.Latn-Cyrl.txt

Fooftilly · May 26, 2021, 4:53pm

The 5000 sentences are validated. Any idea on how long it would take for the voice part to get started?

mkohler · May 26, 2021, 5:08pm

Once the 5000 sentences are reached and the website translations are mostly done, the automatic export will notice this and enable it in the config files. Then it’s just a matter of getting released. As far as I know the next release is planned for next Wednesday, so it should be part of that if everything is done.

Topic		Replies	Views
Russian speech Common Voice sentence-collection	25	5476	March 4, 2019
📖 Readme: How to see my language on Common Voice Common Voice announcements	40	14291	May 10, 2022
We want your feedback: Improving the sentence collection Common Voice sentence-collection , feedback	39	8917	January 9, 2019
Sentence collection for Belarusian – request for advice Common Voice sentence-collection	16	1157	July 9, 2021
Problems finding public domain sentences Common Voice sentence-collection	26	2996	June 10, 2019

Sentence collection for Serbian

Related topics