Sentence collection for Serbian

There has been some discussion about collecting sentences for Serbian. See here.

TL;DR:

  • Here are 6747 sentences. These are from SETimes, a public domain news site.
  • Here is a list of the top-5000 utterances from OpenSubtitles. These will need to be checked for orthography and to make sure that they are public domain (e.g. do not contain any identifiable proper names).

I have included both Latin and Cyrillic.

Continuing the discussion here yeah feel free to submit more Cyrillic sentences from the source and we can review them in batches.

Unless anyone has any issues they would like to announce here

I’ve added another batch of 200 or so. Let me know when you finish and I can add some more. I want to start the batches off small, but if you get into a rhythm we can gradually make them bigger. The most important thing is to not swamp people with stuff.

Thanks for making another submission, don’t think it will effect their system as there are other languages that have 1000x more sentences and many thousands outstanding.

No need to select the shortest ones to submit. A cursory look at the dataset appears that the longest sentence length is still under their guidelines. May be easier for you to go through it top to bottom but we leave it with you.

Thanks again for the help and guidance, looking forward to getting this live!

It won’t have an effect on their system, what I fear is that the people doing the reviewing might not like to have a lot of sentences not of their choosing dumped on them.

If you and your team are happy to go through and review all of them (two reviewers per sentence) then I’m happy to dump them in. But I’d rather not add a lot of sentences that other people might not think are good.

Also, there might be mistakes in the transliteration (I just found one bug with Lj → Лј because I missed characters in title case :slight_smile:

Note that it would also be good to be able to include sentences in ijekavski too. I wonder if there are any good systems for doing conversion or if there are good sources.

Are you sure it is Public Domain? Cause the corpus I found stats out it’s released under CC:BY-SA license
http://nlp.ffzg.hr/resources/corpora/setimes/

saw this discussion, looks good. Interesting source.

1 Like

I also managed to extract some short sentences from the public domain sources at Gutenberg that you mentioned. They have been added with a link to the source text file.

Yeah don’t worry about the transliteration like Lj → Лј we caught it and we’ll catch others like it as well. Thanks again for uploading!

1 Like

No problem! Let me know if you need anything else :slight_smile:

Hey, I’ve just finished reviewing all of those added sentences. Can someone add about 180 more sentences needed for the 5000 sentences goal. I would like to contribute to the voice part too. I would have added more sentences myself, but I couldn’t find public domain sources for the Serbian language. Government files aren’t under public domain license so I can’t use them, and I have no idea where to look for sentences next. The Internet Archive has only files in old or archaic Serbian, and those are of no use.

1 Like

Done. There are more sentences here: https://models.omnilingo.cc/sr/setimes.cand.Latn-Cyrl.txt

1 Like

The 5000 sentences are validated. Any idea on how long it would take for the voice part to get started?

1 Like

Once the 5000 sentences are reached and the website translations are mostly done, the automatic export will notice this and enable it in the config files. Then it’s just a matter of getting released. As far as I know the next release is planned for next Wednesday, so it should be part of that if everything is done.

2 Likes