@wannaphong has brought up a topic on GitHub I’d like to discuss here. Namely, the sentences from the
sentence-collector.txt file in the Common Voice data folder do not necessarily need to match sentences in the Sentence Collector DB and therefore are not findable through the Sentence Collector API.
This happens in the following scenario:
- Somebody uploads a sentence
- That sentence passes validation
- Language has some cleanup transformations specified, which gets applied when exporting to that folder
- The sentence in the
sentence-collector.txtfile is the cleaned up sentence, while the sentence in the Sentence Collector DB is still the “old” one
- Therefore the API logic does not find the sentence and returns nothing
This can currently happen with English, French and Thai. Other languages do not have any cleanups specified. These cleanups are currently mostly used due to missing validations at the beginning, and were adjusted when adjusting validations to catch already existing sentences that should be fixed.
Generally I always had the opinion that we should avoid the cleanup scripts as much as possible. However, while thinking about this problem, this has changed. I think the cleanup can be a good way of letting contributors upload sentences, and then do corrections on them as needed. This means less “errors” for contributors. For certain things, we can accept the sentence and do a transformation on it, rather than simply rejecting it and letting the contributor figure out what needs to be fixed.
Therefore, I’m suggesting the following:
- Instead of running the cleanup while exporting, let’s run it after validating sentences and before writing them to the Sentence Collector database
- This would mean that there is no discrepancy between the
sentence-collector.txtfile and the database, the API would also work correctly
- For all the existing sentences we would need to run a migration to apply all the cleanups to the sentences in the database before moving the cleanup within the process.
There might be something I’m missing, so I’d love to hear all your input on this. What do you think of that? Can you image something that would break if we did this? Do you see any downsides?