Add Esperanto

Original post

Hello!

I’d like to propose for Esperanto to be added among the next supported languages. It is the most widely used (neutral) international language, which makes it appropriate for inclusion in this open and international project.

Esperanto has a very regular and phonetic pronunciation, which might facilitate speech recognition. Its speakers also come from all over the world, so the many possible accents to compare would be very diverse, with none of them being native.

The language has been around for over a century now, so there should be some content in the public domain to take sentences from. The age of those shouldn’t be a problem, as the basic grammar and vocabulary are defined in the the “Fundamento”, and supposed to remain unchanged (so that the language won’t fall apart). The general usage has respected this, and so the language hasn’t changed in any substantial way.

(Note: “Fundamento” itself contains a set of exemplary sentences in Esperanto, which should all be in the public domain.)

Can I do something to help make this happen? Should I collect valid texts in the public domain and possibly restructure them in some way?

Text sources

Fundamenta Krestomatio (Contains example phrases, stories, dialogue, and poetry.)

4 Likes

Great idea! Our goal is to open up to multi-language in early 2018, so watch this space!

Yes! This would be extremely helpful if you can find public domain text, preferably conversational (e.g. movie scripts are better than poetry), in Esperanto. That way we can move faster!

Alright! Should I modify the first post whenever I find new sources, to add the links to them? (Or: Is that the best way I could submit text, and if yes, can I edit the post unlimited times?)

I suppose I could also write up some sentences or conversations myself and release them as public domain. Would such contribution be welcome and appropriate?

Sure, you can post links here! You can modify the original post, or send a new one.

And yes, personal sentences are definitely welcome!

But, to make a good speech database, you need many thousand sentences (10K is ok, 100K is better, 1 Million is idea). So writing that all yourself would take some time, and it’s better to find a large source to pull from.

Here you can find more sentences:

http://tekstaro.com

Those numbers, are for the numbers of collaborations orienta for the numbers of sentences?

if i understand your question correctly, the numbers i mentioned are indeed for the number of sentences. for number of people reading those sentences, the more the merrier! for instance, for english we have over 20K speakers.

Can the works included in “Tekstaro” be accessed whole, without having to search for keywords? Also, could you please point out a statement or indication that all of them are in public domain? (I couldn’t navigate to either in the website, so I haven’t added it to the list yet.)

It seems not directly through the web, but if this project goes ahead, probably will not be difficult having access to them.

Esperantistoj!
I missed that post!

I would like to help adding Esperanto!

Pli itala esperantistoj?
There are many video on youtube in pure esperanto, maybe we can organize a working group to find all this resources? I can remember on duolingo and on reddit there was list of a lot of stuff(also podcast).

there are also Esperanto ebooks on the gutenberg project

Now that this data-set exists, Is there a pre-trained model of it on DeepSpeech? It seems as if only English is available pre-trained.

1 Like

6 posts were split to a new topic: Esperando TTS

Hey,
I created the wiki import script for Esperanto. To get the sentences into the website we need at least two individuals to guess the error rate and post it here on GitHub:

@liordon @Mte90 @nicolaruggiero1986 @Pablo_Busto @tirifto @mhenretty could you help me with this please? There is a linked file in the pull request with 300 random sentences and I need a few people reading through at least 100 sentences to guess the error rate.

The script creates 96 000 sentences, this is enough for the next two or three years to create unique recordings if we keep working at our present speed.

But I belive that it is still a good idea to add different kind of sentences from Esperanto books, because wikipedia articles have a very calm style and we could miss emotional phrases and spoken language if we only use them. We need a mix of all sort of sentences, that’s why we should keep adding more phrases from other sources. I thought about adding sentences from Alico en mirlando. We can add and review new sentences now with the Sentence Collector. I already put some sentences from the Declaration of Human Rights there.