Add Esperanto

tirifto · November 28, 2017, 9:59am

Original post

Hello!

I’d like to propose for Esperanto to be added among the next supported languages. It is the most widely used (neutral) international language, which makes it appropriate for inclusion in this open and international project.

Esperanto has a very regular and phonetic pronunciation, which might facilitate speech recognition. Its speakers also come from all over the world, so the many possible accents to compare would be very diverse, with none of them being native.

The language has been around for over a century now, so there should be some content in the public domain to take sentences from. The age of those shouldn’t be a problem, as the basic grammar and vocabulary are defined in the the “Fundamento”, and supposed to remain unchanged (so that the language won’t fall apart). The general usage has respected this, and so the language hasn’t changed in any substantial way.

(Note: “Fundamento” itself contains a set of exemplary sentences in Esperanto, which should all be in the public domain.)

Can I do something to help make this happen? Should I collect valid texts in the public domain and possibly restructure them in some way?

Text sources

Fundamenta Krestomatio (Contains example phrases, stories, dialogue, and poetry.)

mhenretty · November 28, 2017, 10:03am

Great idea! Our goal is to open up to multi-language in early 2018, so watch this space!

Yes! This would be extremely helpful if you can find public domain text, preferably conversational (e.g. movie scripts are better than poetry), in Esperanto. That way we can move faster!

tirifto · November 28, 2017, 10:58am

Alright! Should I modify the first post whenever I find new sources, to add the links to them? (Or: Is that the best way I could submit text, and if yes, can I edit the post unlimited times?)

I suppose I could also write up some sentences or conversations myself and release them as public domain. Would such contribution be welcome and appropriate?

mhenretty · November 28, 2017, 11:08am

Sure, you can post links here! You can modify the original post, or send a new one.

And yes, personal sentences are definitely welcome!

But, to make a good speech database, you need many thousand sentences (10K is ok, 100K is better, 1 Million is idea). So writing that all yourself would take some time, and it’s better to find a large source to pull from.

Pablo_Busto · December 5, 2017, 2:25pm

Here you can find more sentences:

http://tekstaro.com

Pablo_Busto · December 5, 2017, 2:58pm

Those numbers, are for the numbers of collaborations orienta for the numbers of sentences?

mhenretty · December 5, 2017, 3:03pm

if i understand your question correctly, the numbers i mentioned are indeed for the number of sentences. for number of people reading those sentences, the more the merrier! for instance, for english we have over 20K speakers.

tirifto · December 5, 2017, 6:57pm

Can the works included in “Tekstaro” be accessed whole, without having to search for keywords? Also, could you please point out a statement or indication that all of them are in public domain? (I couldn’t navigate to either in the website, so I haven’t added it to the list yet.)

Pablo_Busto · December 5, 2017, 7:05pm

It seems not directly through the web, but if this project goes ahead, probably will not be difficult having access to them.

Mte90 · January 2, 2018, 3:20pm

Esperantistoj!
I missed that post!

nicolaruggiero1986 · March 26, 2018, 1:25am

I would like to help adding Esperanto!

Mte90 · March 26, 2018, 10:19am

Pli itala esperantistoj?
There are many video on youtube in pure esperanto, maybe we can organize a working group to find all this resources? I can remember on duolingo and on reddit there was list of a lot of stuff(also podcast).

liordon · December 22, 2018, 4:45pm

there are also Esperanto ebooks on the gutenberg project

Now that this data-set exists, Is there a pre-trained model of it on DeepSpeech? It seems as if only English is available pre-trained.

nukeador · August 26, 2019, 12:11pm

6 posts were split to a new topic: Esperando TTS

stergro · September 4, 2019, 8:38am

Hey,
I created the wiki import script for Esperanto. To get the sentences into the website we need at least two individuals to guess the error rate and post it here on GitHub:

github.com/common-voice/cv-sentence-extractor

rules and blacklist for Esperanto

common-voice:master ← stefangrotz:master

opened 07:07PM - 02 Sep 19 UTC

stefangrotz

+1218198 -0

- The list of disallowed words is produced with the scripts from the readme-file…. I've chosen to exclude all words less frequently used than 80 times. - The script produces 128 000 sentences, around 96k without repetitions - Two people have read over 300 random sentences, one fluent speaker and an intermediate speaker. Both guessed that the error rate is between 7-10%. I added a few more rules and more abbreviations to the file based on their feedback. Is this enough or should someone confirm this here on github? - The rule file excludes most letters that are not part of the Esperanto alphabet and a lot of abbreviations. I also exclude sign-patterns that are extremely unusual for Esperanto but used very often in other languages. (like "the" or "sch") - I have a pretty long list of sentences and patterns I want to delete manually when the official file is available. Will it be sorted alphabetically? This would make things a lot easier. I am just doing a rerun of the script. I just do the rerun to see if the few new rules broke something big in general but I could also create another file with 300 random sentences and get confirmation here on github if this is necessary. The extraction generally takes a little more than 2 h on my four years old thinkpad E540.

@liordon @Mte90 @nicolaruggiero1986 @Pablo_Busto @tirifto @mhenretty could you help me with this please? There is a linked file in the pull request with 300 random sentences and I need a few people reading through at least 100 sentences to guess the error rate.

The script creates 96 000 sentences, this is enough for the next two or three years to create unique recordings if we keep working at our present speed.

But I belive that it is still a good idea to add different kind of sentences from Esperanto books, because wikipedia articles have a very calm style and we could miss emotional phrases and spoken language if we only use them. We need a mix of all sort of sentences, that’s why we should keep adding more phrases from other sources. I thought about adding sentences from Alico en mirlando. We can add and review new sentences now with the Sentence Collector. I already put some sentences from the Declaration of Human Rights there.

Topic		Replies	Views
I ran the wiki-scraper script for Esperanto, what are the next steps now? Common Voice sentence-collection	2	1207	August 29, 2019
Esperanto TTS TTS (Text-to-Speech)	6	4780	August 26, 2019
Request: Interim release for Esperanto Common Voice	0	308	March 27, 2021
Russian speech Common Voice sentence-collection	25	5447	March 4, 2019
Multi-language Update for Common Voice Common Voice announcements	16	5568	June 28, 2018

Add Esperanto

Original post

Text sources

Related topics