Tagging an audio clip as whisper/voice

There has been some research into different modes of speaking: solo speech, whispered speech, nam speech, etc. Currently, Common Voice discards whispered data.

Why this is important: There are profound (acoustic characteristics) differences between whispered and normal speech [1]. With whisper input, the model might learn a better representation of human voice. There exist a few datasets for whispered input (CHAINS, wTIMIT, wMRT) - but they are quite small and limited number of speakers.

Bottomline: Should we instead just create a ‘whisper’ tag and collect that data?

1 Like

I think this would be a question for #deep-speech team, if this is a requirement from their side we will adapt how we tag and collect data over Common Voice.