Encoding of CommonVoice Greek Dataset

Tzouraguy · June 21, 2021, 8:31am

Hello!

I’m using the Greek CommonVoice dataset for a speech recognition project, and I can’t seem to be able to extract the sentences from each row, as any encoding hits back with a parsing error. Is there any resource I could read on that?

Many thanks in advance.

ftyers · June 21, 2021, 10:19pm

Hi, what file are you trying to read. How are you trying to access it? Can you give us some more information? Ideally with a test case. I trained a model for Greek and had no problems with encodings. You can also join us on Matrix to get realtime help.

Tzouraguy · June 22, 2021, 10:55am

Right now, I’m trying to recreate the steps on this video about speech recognition, and they use this python script in particular to perform the decoding from the Common Voice Dataset.

At the moment, the error i’m experiencing is this:

Which is due to the characters in the following position:
Screenshot_2
which are effectively the first greek characters in the file.

Not sure if this is popping up due to bad decompression or due to me running this process on Windows, but I haven’t come up with anything so far…

Tzouraguy · June 23, 2021, 8:30am

Ok, so, just to answer my question for anybody else who uses windows:

If the dataset file extension is .tar, rename the archive to “.tar.gz”, and WinRAR will decompress it correctly. The problem I had was due to bad decompression.