Sentences Collector tricks for non programmers?

Hi all
I was just wondering if there’s a way to collect sentences without those complicated scripts and other command line headaches that many non-programmer users do not have access to!
My technique is quite rudimentary!

  1. I collect my texts
  2. Split sentences with some text tool (by first comma, then period, question mark etc…)
  3. Correct any semantic incoherences
  4. Correct casing (first letter) with a text tool
  5. Then copy and paste to Sentences Collector

That was just to see how the Sentences Collector works! I have another trick up my sleeve. But first answer my question. :slight_smile:

Cheers all

1 Like

You can do quite a lot with the search and replace feature of a text editor.

Many text editors support search and replace with new lines. One simple way to split sentences is to create a new line after every full stop, every question mark and every explanation mark. This creates some wrong lines when you have abbreviations, but it is a good starting point for manual (or command line^^) filtering.

1 Like

Actually I finds most text editors quite limited. Many online text tools do much better. Like adding a period at the end of a sentence or sorting lines by length.
It’s time consuming but I sumitted 200 sentences collected manually today only.
Cheers

1 Like

Actually I find most text editors quite limited. Many online text tools do much better. Like adding a period at the end of a sentence or sorting lines by length.
It’s time consuming but I sumitted 200 sentences collected manually today only.
Cheers

Which online tools do you use?

1 Like

My favorite is https://sortmylist.com/
Been using it for year for all purposes.

1 Like

Learn to program in python. It’s not that hard. I can share my code with you. How I do it:

  1. I take the book as a text file.
  2. Using a python script and the sentenize library, I break it up into sentences
  3. Then I added the rule, in a sentence of at least 2 words, etc.
  4. I load the result
2 Likes

Thank you,
I do know some Python, I am just looking for something simple that other people can use because we will be probably dozens to add sentences.
But I am interested in your method too. Can you elaborate a little or share it?
Cheers

Which text editors have you found limited in this way?

Thanks
Forked. What language was it originally made for?

for bashkort. also i have a script for autoreview all sentences as here https://commonvoice.mozilla.org/sentence-collector/#/review
because i upload a text from books, so i know that the sentences are correct

Auto-review is awesome, if sentences come from books,yes. I also have some books took. Will check that out.
Thanks

Auto review can be useful in some situation, but please make sure that you review the sentences at least once. In my experience also sentences from books can have many errors for example when the extraction went wrong or when foreign words are part of the story or sometimes a character speaks in a broken or foreign language. (For example Robinson Crusoe is full of wrong sentences because Friday speaks in a broken language)

But I think that the obligation to review every sentence twice is a little too much for small languages and automating the second review can be very useful to speed things up when you know that you reviewed everything once and when you know that no second people will be there to review the sentences. The Selenium IDE plugin is another simple way to automate little tasks in the browser.

The official position of Mozilla is that we should not use automation at all because they want to keep the quality high.

yes i do it before adding sentences

IMO there are some situations where automation is clearly okay. For example, when you reviewed everything manually and only a big collection of wrong sentences are left (with one yes and one no vote). In such a situation it is completely okay to automate the rejection of all these wrong sentences.

i use excel for it: first column has value 1 by default, second column contains sentences. if sentence isn’t correct i change value of first column to 0.

1 Like

Ah, okay this sounds good to me. Much better than just scrolling over the sentences or something.

When you have big collections of sentences you can import them directly after a manual review like this. For example, for the EUROPARL dataset we reviewed a random selection of 4000 sentences in an Excel file, taken from a bigger collection. And after that, we could import hundreds of thousands of sentences without individual review of every sentence. So, you don’t have to use the sentence collector, but the other process is a little more complicated.

1 Like