This is awesome work @J-b and will be hugely impactful. I’d love to help you carry this forward. I’ll answer some of your questions below:
Agreed, the license is great, thank you for writing this tool!
And yes we are definitely looking for more conversational style text. That said, this wikipedia text will have some advantages, mainly that it will contain some proper nouns and terminology we so far have missed. Several thousand (maybe even 10K) sentences from a variety of wikipedia articles only helps the dataset.
More generally, we want our sentence collections in various languages to come from a variety of sources. If we can include wikipedia, that’s awesome. But we shouldn’t base any language on wikipedia if we can help it.
Yup, best would be if we can have these fully written out (ie. Celsius, kilograms, degrees, etc.).
Also note, any scripts you write would definitely be useful for our language collection tool. So please continue to share your work with us!