I don’t think it is a good idea for the following reasons:
- For languages where there is enough training data (for LLMs):
- there is enough text to select real sentences
- a better approach to expanding the corpus is to work with specific domains
- the sentences will need to be checked anyway
- For languages where there isn’t enough training data (for LLMs):
- the sentences have to be checked anyway, and mostly they are going to be rubbish (we tried recently with varieties of Nahuatl, it was a disaster), it will cause headaches for reviewers
If you wanted to use GPT to generate sentences and then run them through the normal review process, I don’t see a problem, but it seems to me that working on a specific task/application for larger languages (English, German, Esperanto) seems like it would be more productive.