This SeamlessM4T does a great job of speech recognition for Georgian and other languages. I wonder if they used Common Voice data to train the models. Has anyone any information on this?

Hi Razmik,

They’re very cagey / opaque about where they sourced their data:

Audio pre-processing We start with 4 million hours of raw audio originating from a publicly available repository of crawled web data. Table 10 provides statistics on the amount of raw audio for each language

that’s from page 18 of the arXiv paper.

As an educated guess, they scraped the Internet Archive. “Publicly available” is not the same as “open source licensed” …


Hi Kathy, thank you for the info.