How is the 'audio' column generated?

Can anyone give me the code on how the ‘audio’ column is generated?
I put my sample code below but I think the ‘audio’ column is wrong

import os

import librosa

import pandas as pd

# Define the directory containing audio and text files

data_directory = '/content/drive/MyDrive/1'

# List audio files and matching text files

audio_files = [os.path.join(data_directory, file) for file in os.listdir(data_directory) if file.endswith('.wav')]

text_files = [file.replace('.wav', '.txt') for file in audio_files]

# Initialize lists to store data for the CSV

path_column = []

sentence_column = []

audio_column = []

age_column = []

gender_column = []

# Constants for age and gender

age_value = 35

gender_value = 'male'

# Process audio and text files

for audio_file, text_file in zip(audio_files, text_files):

    # Read the text from the matching text file

    with open(text_file, 'r') as text:

        sentence =

    # Use the provided code to generate the 'audio' column

    audio, sampling_rate = librosa.load(audio_file, sr=48000)

    # Append data to the respective columns






# Create a DataFrame to hold the data

data = {

    'path': path_column,

    'sentence': sentence_column,

    'audio': audio_column,

    'age': age_column,

    'gender': gender_column


df = pd.DataFrame(data)

# Save the DataFrame to a CSV file

output_csv = '/content/drive/MyDrive/1/output.csv'

df.to_csv(output_csv, index=False)

print(f'CSV file "{output_csv}" has been created with the specified columns.')

Hey @thegame, welcome. Is this really related to Mozilla Common Voice?

Common Voice datasets keep the audio in .mp3 format under the clips directory. Also, the metadata are in .tsv files (tab separated). The .tsv files only keep the path field, which is the file name of the audio file. So use os.path.join(DATASET_DIR, "clips", CLIP_NAME) to access it. If your library does not support .mp3 files, you should convert them to .wav previously (and change .mp3 to .wav in filename).

As a side note, librosa returns audio as time series of floating numbers, as np.ndarray, see here:

1 Like

I think this is from the Hugging Face Common Voice dataset. If so, this question should be directed to the Hugging Face forum.

1 Like

But, although they have different folder structure, the audio files are also .mp3 there (if you download them)…

I think this is a completely different dataset…

1 Like

Yeah, I tried to follow this blog post but the updated collab has a column called ‘audio’ which has its own metadata.
I tried to create my parallel dataset to match the blog dataset structure but Mozilla dataset does not seem to have the ‘audio’ column and I’m not sure how the ‘audio’ column was generated

Yeah, I explained that part above. But I’m not familiar with that model/dataset…

For uploading the dataset, I should upload to Mozilla first or Hugging face If i wanted to opensource my data?
I need the hugging face format first tho.

As @kathyreid mentioned, you should ask the format related questions in HF forums.

You cannot upload anything to Common Voice, as the voice corpus is recorded one by one by users. You should open-source the merged dataset on HF.

1 Like