Public on-site acquisition and/or validation without touching

Dear CommonVoice-Team & Community,

I am working for a public museum in Germany and we love the platform you have created!

In fact, we and our partners (several institutes around europe) love it so much that we’d like to contribute to it via a hackathon.

The idea is to bring the speech acquisition and/or validation into the public space so that our visitors can be part of this.

We have noticed that you have already prepared posters and slides for instructions.

However, mainly due to the pandemic, devices that require touching are no option for interaction.

As we saw that the development of third party apps is not desired, we thought about making a challenge to create a set of tools that interact with the website using gestures, eye-tracking or else.

I’d like to invite all of you to share your thoughts on this. Do you know if there are similar attempts?

Thanks a lot for your work!



I would say the coolest solution would be to use voice commands using a German Deepspeech model. Since the website supports shortcuts, writing a script for this should be relatively easy. If the moddel is still not good enough one could cheat a little and use the Google API as well.

An easier solution would be to offer gloves or disinfect the Computer after every use.


Hey Stefan,

thank you for the reply! The hint with the shortcuts is pretty helpful!

A german speech model would only allow us to interact with germans. We would love to invite as many people as possible.
The problem with speech models is that they are language specific which would require us to detect the language first (In fact, this is something we are currently looking into but it’s not that easy :smiley:). Furthermore, we would have to have a speech model for every language.

One possibility might be to use a hand-tracking device that acts as a cursor.

There are DeepSpeech models for over 40 languages already. As for detecting the language, you could ask people when they buy the ticket, and print out a barcode for them and they could wave the bar code in front of a barcode reader to switch language. Or they could just say the name of their language in their own language. Lots of possibilities :smiley:

Hey Francis,

oh wow! Speech Models for 40languages is a lot! :open_mouth:

I actually considered this. I’d use a simple keyword spotting approach but I am lacking the data. Do you know a good dataset to use? Actually I didn’t check if this is already part of CommonVoice.
I think this could be one of the simplest and most natural solutions …

Real 3d keyboard (without surface) maybe possible soon.


Showing the sentence on a screen and playing the audio clip.
Tracking the hand movement with a camera for hand gestures.
Swiping left = clip is correct
Swiping right = clip is incorrect
Swiping down = skip clip
Swiping up = report an error in sentence