First of all, can anyone point me to a documentation describing what “Delta Segment” (e.g. “Common Voice Delta Segment XX.X”) is and how it is different from the “Corpus” (e.g. “Common Voice Corpus XX.X”)?
Secondly, and MOST IMPORTANTLY, it seems selecting “Common Voice Delta Segment XX.X” doesn’t show (reflect) the correct “sha256 checksum” value on Download the Dataset. It only seem to show (or switch, or reflect) the correct “sha256 checksum” when “Common Voice Corpus XX.X” is selected for download. So we will never know if “Common Voice Delta Segment XX.X” is malign or tampered with. I just checked a few but here is just one of the example (language) where it is different:
Turkish
- Common Voice Delta Segment 12.0
(cv-corpus-12.0-delta-2022-12-07-tr.tar.gz)
Info | sha256 checksum |
---|---|
CommonVoice Site *1 | 778825e1dc4a29f4fa5201e7bac51740a205763c6055632636dffc6aea669342 |
Actual *2 | a916e827b04383ccb4bc13a54f34e5e94a85e07e41d597dc800f76f6e09b2f4c |
*1 Correct “sha256 checksum” for Turkish “Common Voice Corpus 12.0”
*2 Executed CertUtil -hashfile cv-corpus-12.0-delta-2022-12-07-th.tar.gz SHA256
on Windows 10. This command returns the same value as what is purported on “CommonVoice Site” for “sha256 checksum” of Turkish “Common Voice Corpus 12.0”. So the CertUtil
(the tool) is not the problem.