Hi @heyhillary @phire, thank you for creating this thread. I have several questions related to the dataset release:
(1) After downloading the Belarusian dataset, we found that the total duration of all recordings is larger than announced on the Common Voice website: 356 hours actually vs. 325 hours indicated in the website statistics as of 2021-07-29 (or even less on 2021-07-21 when the dataset was created). Is it true that, for statistic purposes, the total duration is calculated with certain limitations, e.g. dropping silence at the beginning / end of each clip, or dropping invalidated clips?
(2) I’m wondering if there is any established workflow to deal with the sentences in reported.tsv. As an example, for Belarusian we have 5-6% problematic sentences in the Wikipedia export, and many of them have been reported by the speakers so far (although both precision and recall of reporting are not perfect, i.e. some reported sentences are OK, and some problematic sentences have never been reported). Could we e.g. prepare a PR, based on reported.tsv, to remove known problematic sentences from the site data, so that they no longer would be available for recording? Just wondering if this kind of manual patching is the right way to go, consistent with other proposed improvements, such as the automated workflow to run extraction from newly-created Wikipedia articles, outlined by @mkohler here.
(3) Tangentially to the above, comments in reported.tsv for Belarusian, which were filled in by the contributors, are not displayed correctly: all Cyrillic characters have been replaced with question marks (probably an encoding issue at some stage of the data pipeline). Should we file an issue in the common-voice repo, or is it already on the radar?
Thanks in advance for any comments.