Sha256 checksum seems to be wrong for "Common Voice Delta Segment XX.X" and what is "Delta Segment"

First of all, can anyone point me to a documentation describing what “Delta Segment” (e.g. “Common Voice Delta Segment XX.X”) is and how it is different from the “Corpus” (e.g. “Common Voice Corpus XX.X”)?

Secondly, and MOST IMPORTANTLY, it seems selecting “Common Voice Delta Segment XX.X” doesn’t show (reflect) the correct “sha256 checksum” value on Download the Dataset. It only seem to show (or switch, or reflect) the correct “sha256 checksum” when “Common Voice Corpus XX.X” is selected for download. So we will never know if “Common Voice Delta Segment XX.X” is malign or tampered with. I just checked a few but here is just one of the example (language) where it is different:


  • Common Voice Delta Segment 12.0
Info sha256 checksum
CommonVoice Site *1 778825e1dc4a29f4fa5201e7bac51740a205763c6055632636dffc6aea669342
Actual *2 a916e827b04383ccb4bc13a54f34e5e94a85e07e41d597dc800f76f6e09b2f4c

*1 Correct “sha256 checksum” for Turkish “Common Voice Corpus 12.0”
*2 Executed CertUtil -hashfile cv-corpus-12.0-delta-2022-12-07-th.tar.gz SHA256 on Windows 10. This command returns the same value as what is purported on “CommonVoice Site” for “sha256 checksum” of Turkish “Common Voice Corpus 12.0”. So the CertUtil (the tool) is not the problem.

Well for the first question of what “Delta Segment” is, Google search led me to a (what seems to be a) non CommonVoice page here:

The files entitled “Delta Segment” consist of any additional audio files that were added after the main dataset was created. This means that they are basically updates that allow you to download them as they come in, without having to download the full dataset again.

It was the fourth result on Google search (What is “Delta Segment” in CommonVoice). I wonder whether the problem, is Google Search or Inadequate Documentation on the part of CommonVoice. I am still interested to find if there is any mention of “Delta Segment” from an official CommonVoice page.

This might help for your first question:

I have no idea about checksums, but what I know is this: v10 and v11 delta releases on the DL page are of no use, as they are missing files.

In addition, I’ve seen a lot of related problems in unrelated systems where hashes result in different values, for many reasons. I don’t know if this is the case here.

Btw: cv-corpus-12.0-delta-2022-12-07-th.tar.gz is not the Turkish one :slight_smile:

Thank you for your reponse and showing me where the documentation is. I did not realise it was on Mozilla Discourse.

If Data Segment contains a lot of error or if it is incomplete, then I think it should not be uploaded for download in the first place. There will be a lot of 残念 (dissapointed :disappointed_relieved:) people. That is IMHO (in my humble opinion).

Actually regarding the following:

I was looking at my edit history but I think I had it correct for Turkish :sweat_smile: (I had it correcly with tr):