Can anyone give me the code on how the ‘audio’ column is generated?
I put my sample code below but I think the ‘audio’ column is wrong
import os
import librosa
import pandas as pd
# Define the directory containing audio and text files
data_directory = '/content/drive/MyDrive/1'
# List audio files and matching text files
audio_files = [os.path.join(data_directory, file) for file in os.listdir(data_directory) if file.endswith('.wav')]
text_files = [file.replace('.wav', '.txt') for file in audio_files]
# Initialize lists to store data for the CSV
path_column = []
sentence_column = []
audio_column = []
age_column = []
gender_column = []
# Constants for age and gender
age_value = 35
gender_value = 'male'
# Process audio and text files
for audio_file, text_file in zip(audio_files, text_files):
# Read the text from the matching text file
with open(text_file, 'r') as text:
sentence = text.read()
# Use the provided code to generate the 'audio' column
audio, sampling_rate = librosa.load(audio_file, sr=48000)
# Append data to the respective columns
path_column.append(audio_file)
sentence_column.append(sentence)
audio_column.append(audio)
age_column.append(age_value)
gender_column.append(gender_value)
# Create a DataFrame to hold the data
data = {
'path': path_column,
'sentence': sentence_column,
'audio': audio_column,
'age': age_column,
'gender': gender_column
}
df = pd.DataFrame(data)
# Save the DataFrame to a CSV file
output_csv = '/content/drive/MyDrive/1/output.csv'
df.to_csv(output_csv, index=False)
print(f'CSV file "{output_csv}" has been created with the specified columns.')
Hey @thegame, welcome. Is this really related to Mozilla Common Voice?
Common Voice datasets keep the audio in .mp3 format under the clips directory. Also, the metadata are in .tsv files (tab separated). The .tsv files only keep the path field, which is the file name of the audio file. So use os.path.join(DATASET_DIR, "clips", CLIP_NAME) to access it. If your library does not support .mp3 files, you should convert them to .wav previously (and change .mp3 to .wav in filename).
Yeah, I tried to follow this blog post but the updated collab has a column called ‘audio’ which has its own metadata.
I tried to create my parallel dataset to match the blog dataset structure but Mozilla dataset does not seem to have the ‘audio’ column and I’m not sure how the ‘audio’ column was generated
For uploading the dataset, I should upload to Mozilla first or Hugging face If i wanted to opensource my data?
I need the hugging face format first tho.