This is an interesting reflection. In general the sentence collector should be used only when there are no other large sources of text available to mass validate, because manual review of each sentence is painful as you described.
Has your language already done the wikipedia process or if European, the Europarl one? This should provide a language with enough buffer for close to 1000 hours without repetitions.