There are several issues with the current sentences provided by Common Speech for Tamil. First, often they are not complete sentences, or even express a coherent thought. I assume this is still acceptable. It also suffers from some of the issues noted in this discussion thread: Extending our sentence collection capabilities.
First, we need to be able to provide sentences from varied sources than ancient literature. We can possibly get permission from contemporary authors to release their work under required license (CC0) for this. Example from blog posts!
Second, Tamil has notably different spoken and written/ literary language variants. Written language is more formal and understood across dialects and regions. However, the spoken Tamil is notably different. Would like to know if we can crowdsource the sentences for spoken Tamil?
Please point to documentation about how to contribute to source sentences.
Also, is it possible to use existing audio transcripts to create training datasets of public domain works?
Thank you.