nmstoker
(Neil Stoker)
December 16, 2020, 1:02am
1
This new transcribed dataset looks like it could be very helpful - I don’t know how feasible (or desirable) it would be to incorporate it into training for the next release of the main English model, but with 44.5k hrs transcribed it compares well to the amount of transcribed audio on the earlier LibriSpeech dataset (1,000 hrs)
This paper introduces Multilingual LibriSpeech (MLS) dataset, a large
multilingual corpus suitable for speech research. The dataset is derived from
read audiobooks from LibriVox and consists of 8 languages, including about
44.5K hours of English and...
It also has quantities of transcribed audio for other languages too but those are less dramatic (but could still be a big help compared with what’s available for them too)
3 Likes
Awesome, thank you for the link!
Hi, We have implemented an MLS Importer for the Italian speech dataset.
#!/usr/bin/env python3
import time
import os
import re
from corpora_importer import ArchiveImporter,Corpus
CORPUS_NAME = 'mls'
#function fix_apostrophe and load_mailabs_fixed_token
##see this: https://github.com/MozillaItalia/DeepSpeech-Italian-Model/issues/124
def fix_apostrophe(text_normalized,fixed_token):
tokens = text_normalized.split()
for tok in tokens:
if(tok in fixed_token):
##replace works only if there are no ambiguous tokens like : destro ->d'estro sera->s'era
text_normalized = text_normalized.replace(tok,fixed_token[tok])
This file has been truncated. show original
The work is not yet finalized and we would like to try various training tests.
MLS has audio clips ranging from 10 to 20 seconds.
opened 06:09PM - 17 Dec 20 UTC
help wanted
dataset
## LIST OF ALL ITALIAN DATASETS FOUND
From issue #90 I'm putting here all the d… atasets that have been discovered.
Some of them are plug-and-play for Deepspeech others instead need to be created from scratch (splits up audio by sentences)
Feel free to pickup one that has not been done for checking it out.
### NOTE
If one of this dataset needs a deeper analysis please do not start a discussion here but open a new issue and I will update this table with the issue reference.
## DATASETS
| dataset | hrs | url | plug-n-play | TODOs | doing | done | note
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
|**MLS** | 279.43 h | [↗](http://openslr.org/94/)| | | | | **HOT!!!!**
|VoxForge #111 | 20h | [↗](http://www.repository.voxforge1.org/downloads/it/Trunk/Audio/Main/16kHz_16bit/)| ✔ | <ul><li>- [x] url replace in DS import_voxforge.py script</li><li>- [x] fix import sys error </li></ul> | ✔ | |
|MAILABS | 127h40m | [↗](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/)| ✔ | | | ✔ |
|Evalita2009 | 5h | [↗](http://www.evalita.it/2009/tasks/digits)| | | | ✔ |
|MSPKA | 3h | [↗](http://www.mspkacorpus.it/)| | | | ✔ |
|SIWIS | 4.5h | [↗](https://phonogenres.unige.ch/index.php?page=téléchargement)| | | | ✔ |
|SUGAR | 1.5h | [↗](https://github.com/evalitaunina/SUGAR_Corpus)| | | | | sentences are not useful
|VociParlateWikipedia #34 | ? | [↗](https://it.wikipedia.org/wiki/Categoria:Voci_parlate)| |<ul><li>- [ ] sync audio with its page revision</li></ul> | | |
|EMOVO | ~12m | [↗](http://voice.fub.it/activities/corpora/emovo/index.html)| |<ul><li>- [ ] align filename codes with their sentences </li></ul> | | | interesting for emotions (disgust, happy..)
|ZIta | <1hr | [↗](https://github.com/ChMeluzzi/ZIta)| | | | | transcriptions do not follow recordings (eg: Lett_Z_Sp1_zero.wav)
|LIM_Veneti | <1hr | [↗](https://github.com/ChMeluzzi/LIM_Veneti)| | | | | no audio files?
|split-MDb | ~46m | [↗](http://www.parlaritaliano.it/index.php/en/corpora/644-spit-mdb-spoken-italian-multilevel-database)| |<ul><li>- [ ] parse&clean the .wrd files </li></ul> | | | based on CLIPS
|tg60 | 1h30m | [↗](http://www.parlaritaliano.it/index.php/it/dati/650-corpus-di-parlato-telegiornalistico-anni-sessanta-vs-2005)| |<ul><li>- [ ] long audio files to be split </li></ul> | | | maybe among the info files there are some timings that could be useful for splitting up?
|PraTiD | 1h12m | [↗](http://www.parlaritaliano.it/index.php/en/corpora/645-corpus-pratid)| |<ul><li>- [ ] long audio files to be split </li></ul> | | | From CLIPS; maybe among the info files there are some timings that could be useful for splitting up?
|ParlatoCinematografico | ? | [↗](http://www.parlaritaliano.it/index.php/it/dati/659-corpus-di-parlato-cinematografico)| |<ul><li>- [ ] long audio files to be split </li></ul> | | | .lab files with speakers timings
|PerugiaCorpusPEC | ? | [↗](https://www.unistrapg.it/cqpwebnew/)| | | | | a login is needed. License?
3 Likes
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
January 18, 2021, 11:40am
4
Please share that as a PR as soon as you can!
MLS, and all importers we are doing recently, use our utility for common operations (corpora_importer.py), they depend on this utility.
Then we have a collector for generating a final speech dataset that aggregates all imported corpora.
DeepSpeech EN repo use a different strategy.
Should I also do a PR of the utilities?
In any case, the work on our corpora_collector has yet to be completed, hopefully soon.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
January 18, 2021, 2:50pm
6
That’s a good question, maybe it would be useful to move to that factorized code as well, what do you think @reuben ?