License agreement for third party dataset to use as a training corpora

kubofhromoslav · July 21, 2021, 9:25pm

Hi folks! (I am not sure whether to put this into CV or DS category…)
I am working in a small non-formal team for speech recognition in Esperanto. In addition to the Esperanto Common Voice dataset we would like to use other datasets. We have some options for recordings, but they have all rights reserved. The owners are willing to provide the recordings for us to use for training a model, but do not necessarily want to free the dataset for everyone.

In the DeepSpeech release notes (like for version 0.9.3) is as one of the source training corpora mentioned “approximately 1700 hours of transcribed WAMU (NPR) radio shows explicitly licensed to use as training corpora.”

Could you please provide some model license agreement that we could adapt and use with our potential partners? Or link me to some license agreement already prepared? Many thanks!

Topic		Replies	Views
Sharing the dataset Common Voice dataset	3	1325	November 22, 2017
How to add sentences and recordings. Kyrgyz. 10000 samples Common Voice sentence-collection , dataset	5	964	March 26, 2020
Explicitly forbidding/limiting TTS usage? Common Voice	13	2588	June 30, 2023
Smaller commonvoice dataset Common Voice learning , feedback	0	1205	September 2, 2020
Pre Release Data vs Latest Release Data Common Voice dataset	1	479	April 2, 2019

License agreement for third party dataset to use as a training corpora

Related topics