99.99% chances itās against the terms of use of the TTS service, and itās also going to break the usefulness of the dataset.
Hi Linus.
I find the idea excellent!
It would encourage blind people to contribute as well as people who cannot read or write. My mother is a native in Kabyle language and she would love to contribute to common voice but she is analphabet and would need assistance. I hope the idea will make its way.
I guess it depends on how you set it up, if common voice just push a text string to some TTS motor and the recording is sort of separate I guess the licence of the TTS motor wonāt affect the final dataset, but might be tricky to get it that modular. If not possible there are a couple of open source TTS motors, but most seems old and mostly comes in english and maybe a few other languages, so maybe not too useful.
Another way could be to use the commom voice TTS, but would only help once quite a lot of data already been collected, donāt know how much is needed to have a TTS model that is usable. Here are some examples of the English one, there are even decent examples there are a few months old and the English dataset still has a few hundred hours to reach the 1200h goal.
Thatās reasons I didnāt even think of but really interesting! I guess Iām the podcast generation and like to do 1 thing with my ears and another thing with my hands Do you know if there are any present TTS motors that support Kabyle, open or proprietary?
The point of Common Voice is to collect human voice. Using TTS completely defeats that.
Iām not sure youāve understood my point in that case.
My idea is to just replace āview text on screen - readā part of the process, not the āspeak-recordā part.
I.e you listen to a sentence, read by some TTS, the recording starts and you repeat the sentence with your voice and pronunciation.
Ok, youāre just making the process longer and more complicated, I clearly donāt see what value or improvement this brings.
As I stated there are many situations were it is not possible to contribute, if you need to read from a screen, ie as sifaks mentionend, if you are disabled in some way, or for me it would be useful to be able to contribute while walking or doing something else. How do you suggest that could be done less complicated?
Likely @belkacem77 could guide you here. The fact that heās so active on Kabyle and interested in TTS makes me believe there is currently no solution.
Blind people would already have a screen reader.
If youāre walking, you should look forward to not bump :-).
I donāt see any simple way to integrate that. Some of your points might be valid, but Iām not sure the Common Voice team has the bandwidth to address those.
Maybe, if you can, sending a PR is the best way to try and get traction ?
Maybe thatās the solution, probably not so optimized for this but maybe good enough.
Yes, exactly, thatās why I would prefer to take in the info with my ears not eyes, so I can look forward.
I get that there is not a great chance that such project will get prioritized by the team and Iām unfortunately quite bad at coding still but, I was more after some feedback, and maybe some hint of were to look further. Iāll probably investigate what I can do myself anyhow and hopefully come up with a PR some day or suggest some modular setup if others are interested.
An interesting idea, but it would also destroy diversity where there are multiple variants of the same language. To take one example, assuming the TTS system speaks US-English, British, Welsh, Scottish, Australian, Indian and many other speakers will be āencouragedā (in fact almost forced) to copy āincorrectā US pronunciation and stress patterns. Also, even the best TTS systems make frequent errors in pronunciation and in stress, especially with unusual words and proper names. Listeners will simply replicate what they hear and lock those errors into the recordings database.
Thatās actually a valid point, that if the TTS is bad the risk that it spills over in the actual human recording, thatās what you meant, right? Hard to tell to what degree, did not thought of this though.
I like this idea because users would no longer be reading text but reciting it from memory, which may lead to a more natural, conversational style of speech.
But as @Michael_Maggs pointed out, there would be a lot of challenges in implementing it reliably.
This just advocates for more guidance when people record, you can (should) read the sentence a first time and recite it from memory with the current setup. Also, I think itās good we have not just this type of speech, and people not talking fluently, etc., itās as valid IMHO.
What about not using TTL but having the possibility to re-record the sentence we are listening. It is still useful for blind people and people not able to read? All sentences has to be recorded several times anyway!
Well, in fact, ideally we only want/need one record per sentence. Having a lot of records for the same sentence is not very useful for the Deep Speech algorithms.
Right. I mean by that more opportunity to choose the best one. Especially when we just vote NO to a sentence, we could record a ābetter oneā on the spot.