Regarding the email on Deletion Request

neouyghur · April 4, 2024, 2:59am

Hi, today I have received an email regarding “Deletion Request”. Is it an email sent to everyone? because I have not found those IDs.

bozden · April 4, 2024, 6:01am

Osman, I also got it.

And as far as I can understand, these can be from any of the many (124 in v17.0 + 120 in v16.1 + …) language datasets over there. E.g. if these people requested the deletion between v16.1 and v17.0, their data will not be in v17.0, but they might exist in v16.0 or any other prior version people already downloaded.

From the dataset consumer perspective, it would be required (due to ethical reasons and local/international laws) to have them (and the ones from a prior e-mail) in a file, and drop them from the datasets before consuming (e.g. training a model).

I’m aware that the best course of action is to expand the datasets (.tar.gz), remove the recordings and related metadata, probably re-run the splitting algorithm, etc… But it will not be practical. I think, it will just be enough to filter any table to exclude them before usage, so make that table part of your workflow. And no, no lawyer will say it will be enough

It would be very nice if we could have an api endpoint to get the up-to-date cumulative list.

kathyreid · April 4, 2024, 10:41am

I have started something similar at:

but haven’t yet updated the most recent one. It’s a tsv not an API endpoint, but still useful

bozden · April 4, 2024, 11:02am

I have started something similar at

Awesome, thank you @kathyreid !

jesslynnrose · April 4, 2024, 11:27am

This is correct and thanks so much for helping explain this before I could get to it!

bozden · April 5, 2024, 11:53pm

I also added a table in my workflow, right into the splitting algorithms, using a simple pandas filtering library function.

Here is the result of removals when I ran it against 1336 total datasets:

ver	lc	recordings
7.0	en	24
7.0	it	75
7.0	ru	3
8.0	en	24
8.0	it	75
8.0	ru	6
9.0	ca	1
9.0	en	25
9.0	it	75
9.0	ru	7
10.0	ca	1
10.0	en	26
10.0	it	75
10.0	ru	15
10.0	zh-TW	3
11.0	ca	2
11.0	en	27
11.0	it	75
11.0	ru	22
11.0	zh-TW	10
12.0	ca	4
12.0	en	29
12.0	it	75
12.0	ru	45
12.0	zh-TW	13
13.0	ca	4
13.0	en	35
13.0	it	75
13.0	ru	46
13.0	zh-TW	18
14.0	ca	5
14.0	en	40
14.0	it	75
14.0	ru	46
14.0	zh-TW	23
15.0	en	27
15.0	it	75
15.0	ru	47
16.1	en	27
16.1	it	75
16.1	ru	47

So, if you do not deal with these languages / dataset versions, you will be fine - for now. But it is best to implement it for future deletions. I’m relieved to see they result in low number of deletions.