What are the rules for what constitutes accuracy?

The question asks whether or not the sentence was spoken accurately.

The options are either Yes or No

One was accurate - but - with a couple of stutters. eg. ‘d-decide’ instead of ‘decide’. Another had a “dropped” consonant. eg. ‘goin’ instead of ‘going’.

Such enunciation is not uncommon. Nor is occasional slurring, inappropriate intonation, or generous amounts of ums and ahs. In fact I would say that speech that was 100% correct would likely strike the listener as being unnatural.

tl;dr is “good enough” a Yes or a No?

@kdavis can you give us some thoughts here?

Note, we are working on being more specific about validation criteria on the site itself:

1 Like

If the audio does not match the text, e.g. a dropped consonant “goin” instead of “going”, then the audio should be marked as invalid.

If, however, the audio has slight stutters, e.g. “d-decide” instead of “decide”, then I think the audio should be marked as valid.

The logic is that in normal speech people have disfluencies, e.g. “d-decide” instead of “decide”, which a STT system should learn to deal with. However, the STT system needs to learn a correct mapping from audio to text from the data we are creating which it can not do if it’s given incorrect audio-transcript pairs, e.g. as the “goin” audio being transcribed as “going”.

1 Like