How is the 'audio' column generated?

thegame · October 17, 2023, 11:35am

Can anyone give me the code on how the ‘audio’ column is generated?
I put my sample code below but I think the ‘audio’ column is wrong

import os

import librosa

import pandas as pd

# Define the directory containing audio and text files

data_directory = '/content/drive/MyDrive/1'

# List audio files and matching text files

audio_files = [os.path.join(data_directory, file) for file in os.listdir(data_directory) if file.endswith('.wav')]

text_files = [file.replace('.wav', '.txt') for file in audio_files]

# Initialize lists to store data for the CSV

path_column = []

sentence_column = []

audio_column = []

age_column = []

gender_column = []

# Constants for age and gender

age_value = 35

gender_value = 'male'

# Process audio and text files

for audio_file, text_file in zip(audio_files, text_files):

    # Read the text from the matching text file

    with open(text_file, 'r') as text:

        sentence = text.read()

    # Use the provided code to generate the 'audio' column

    audio, sampling_rate = librosa.load(audio_file, sr=48000)

    # Append data to the respective columns

    path_column.append(audio_file)

    sentence_column.append(sentence)

    audio_column.append(audio)

    age_column.append(age_value)

    gender_column.append(gender_value)

# Create a DataFrame to hold the data

data = {

    'path': path_column,

    'sentence': sentence_column,

    'audio': audio_column,

    'age': age_column,

    'gender': gender_column

}

df = pd.DataFrame(data)

# Save the DataFrame to a CSV file

output_csv = '/content/drive/MyDrive/1/output.csv'

df.to_csv(output_csv, index=False)

print(f'CSV file "{output_csv}" has been created with the specified columns.')

bozden · October 17, 2023, 12:06pm

Hey @thegame, welcome. Is this really related to Mozilla Common Voice?

Common Voice datasets keep the audio in .mp3 format under the clips directory. Also, the metadata are in .tsv files (tab separated). The .tsv files only keep the path field, which is the file name of the audio file. So use os.path.join(DATASET_DIR, "clips", CLIP_NAME) to access it. If your library does not support .mp3 files, you should convert them to .wav previously (and change .mp3 to .wav in filename).

As a side note, librosa returns audio as time series of floating numbers, as np.ndarray, see here:
https://librosa.org/doc/latest/generated/librosa.load.html#librosa.load

kathyreid · October 17, 2023, 12:08pm

I think this is from the Hugging Face Common Voice dataset. If so, this question should be directed to the Hugging Face forum.

bozden · October 17, 2023, 12:18pm

But, although they have different folder structure, the audio files are also .mp3 there (if you download them)…

I think this is a completely different dataset…

thegame · October 18, 2023, 5:58am

Yeah, I tried to follow this blog post but the updated collab has a column called ‘audio’ which has its own metadata.
I tried to create my parallel dataset to match the blog dataset structure but Mozilla dataset does not seem to have the ‘audio’ column and I’m not sure how the ‘audio’ column was generated

bozden · October 18, 2023, 1:30pm

Yeah, I explained that part above. But I’m not familiar with that model/dataset…

thegame · October 19, 2023, 1:38pm

For uploading the dataset, I should upload to Mozilla first or Hugging face If i wanted to opensource my data?
I need the hugging face format first tho.

bozden · October 19, 2023, 1:45pm

As @kathyreid mentioned, you should ask the format related questions in HF forums.

You cannot upload anything to Common Voice, as the voice corpus is recorded one by one by users. You should open-source the merged dataset on HF.