There has been some research into different modes of speaking: solo speech, whispered speech, nam speech, etc. Currently, Common Voice discards whispered data.
Why this is important: There are profound (acoustic characteristics) differences between whispered and normal speech . With whisper input, the model might learn a better representation of human voice. There exist a few datasets for whispered input (CHAINS, wTIMIT, wMRT) - but they are quite small and limited number of speakers.
Bottomline: Should we instead just create a ‘whisper’ tag and collect that data?