I’m using the Greek CommonVoice dataset for a speech recognition project, and I can’t seem to be able to extract the sentences from each row, as any encoding hits back with a parsing error. Is there any resource I could read on that?
Hi, what file are you trying to read. How are you trying to access it? Can you give us some more information? Ideally with a test case. I trained a model for Greek and had no problems with encodings. You can also join us on Matrix to get realtime help.
Ok, so, just to answer my question for anybody else who uses windows:
If the dataset file extension is .tar, rename the archive to “.tar.gz”, and WinRAR will decompress it correctly. The problem I had was due to bad decompression.