The question asks whether or not the sentence was spoken accurately.
The options are either Yes or No
One was accurate - but - with a couple of stutters. eg. ‘d-decide’ instead of ‘decide’. Another had a “dropped” consonant. eg. ‘goin’ instead of ‘going’.
Such enunciation is not uncommon. Nor is occasional slurring, inappropriate intonation, or generous amounts of ums and ahs. In fact I would say that speech that was 100% correct would likely strike the listener as being unnatural.
If the audio does not match the text, e.g. a dropped consonant “goin” instead of “going”, then the audio should be marked as invalid.
If, however, the audio has slight stutters, e.g. “d-decide” instead of “decide”, then I think the audio should be marked as valid.
The logic is that in normal speech people have disfluencies, e.g. “d-decide” instead of “decide”, which a STT system should learn to deal with. However, the STT system needs to learn a correct mapping from audio to text from the data we are creating which it can not do if it’s given incorrect audio-transcript pairs, e.g. as the “goin” audio being transcribed as “going”.