SeamlessM4T

Razmik-Badalyan · December 4, 2023, 9:37am

This SeamlessM4T does a great job of speech recognition for Georgian and other languages. I wonder if they used Common Voice data to train the models. Has anyone any information on this?

kathyreid · December 4, 2023, 12:14pm

Hi Razmik,

They’re very cagey / opaque about where they sourced their data:

Audio pre-processing We start with 4 million hours of raw audio originating from a publicly available repository of crawled web data. Table 10 provides statistics on the amount of raw audio for each language

that’s from page 18 of the arXiv paper.

As an educated guess, they scraped the Internet Archive. “Publicly available” is not the same as “open source licensed” …

Razmik-Badalyan · December 5, 2023, 8:25am

Hi Kathy, thank you for the info.