Basque dataset ready

sentence-collection

(Txopi) #1

I know the sentence collection tool is coming, and it will help sentence upload and revision. But during last two months, we have collected more than 6.000 CC0 sentences in Basque language and we already have checked, fixed, cleaned and reviewed them. The dataset is small but contains a diverse grammar and lexicon.
Here it is a pull request with those first Basque sentences.
NOTE: the website is already translated to Basque.


(Rubén Martín) #2

Thanks for this work @txopi

Note that we are currently improving our processes to include sentences in the corpus and we are working on the final guidelines to accept sentences that will be applied to the sentence collection tool.

This means we will probably need to wait a bit to add sentences following this improved processes to make sure they are 100% useful for the speech engine learning, but definitely Basque will have already a lot covered thanks to this work :slight_smile:

Cheers.


(Txopi) #3

Did you changed your decision? I realised some days ago Basque appeared as Launched language in voice.mozilla.org!

Nobody seems to read Slack chat so I decided to ask for some fixes in the sentences (found in part thanks to the Sentence Collector). I also asked to load Basque accents.

I think this changes should be done before Basque people starts participating on Common Voice. @nukeador, shouldn’t we make a step forward or backward as soon as possible and resolve Basque languages situation? Please, help!


(Rubén Martín) #4

Let me look into this, thanks for flagging.


(Txopi) #5

Did you reach any conclusion?


(Rubén Martín) #6

Hey @txopi

Did you run the sentences through the sentence collection tool to ensure all of them are valid? We understand the initial work was done before we announced the change in the process to submit and review sentences, so as long as these sentences are all valid we should be fine.

Cheers.


(Txopi) #7

Yes, I did, but keep in mind that when I uploaded 60 of them there was a bug so I uploaded even some sentences with acronyms. To be absolutely sure all the sentences accomplish the rules, I would need to remove the Basque sentences from the Sentence Collector. That option exists? If so, I don’t know where.


(Rubén Martín) #8

Can you coordinate with @mkohler to test all your sentences through the sentence collector? We should make sure the PR doesn’t have these invalid sentences.

/cc @gweber


Common Voice Sentence Collection Tool launch
(Michael Kohler) #9

That bug has been fixed in the meantime. And on Monday we have deleted all the sentences that were submitting during the testing period, so none of the sentences previously submitted (before Monday) should be in anymore.

I see there are currently 6212 sentences for Basque. I suppose those are all the new ones that are complying?


(Txopi) #10

As you can see in other topic of this forum, @gweber is about to accept a PR that fixes this.
Thank you for your answer @mkohler!


(Txopi) #11

All is ready for Basque!
We already started making recordings :heart_eyes: