OK, I recently started with ML and came here to “donate” to one language set. What I recognized after a couple of hundred record/listen sessions:
- There are many sentences with foreign language words embedded (mostly names and mostly very hard to pronounce). I don’t know why they are in the dataset.
- I could only hear a single female voice, one in 50 or so.
- People with dialects dominate (dialects are fine & good unless they dominate)
- I see that users are racing to do more, but their voices/dialects would become dominant.
- Most recordings do not include correct accents (e.g. in questions), they sound like a machine speaking. This is also fine, but they dominate.
I don’t know about other languages, but want to hear from your experience from them.
So, I think the points given above will be problematic. As a newbie in this area I would like to hear your ideas on these items and their remedies.
Thanks in advance