# Methodology for data collection for voice corpora
Starting with rough points to address:
- Sentences should come from an open data source, preferrably public domain.
- What type of source should be preferred? Written? Transcripts of spoken? News, articles, books, social media?
- What type of register should be preferred?
- How modern should be the source? Is there any use for older data sources?
- The data set should be constructed to provide good coverage of different language components and aspects:
- Words
- Inflections: How good a coverage can and should we aim for? Which types of inflections (gender, tense, number, person, mood, etc.)?
- Phonemes?
- Accents? Gender or age of speakers? Geographical origin? What is even feasible here?
- Are any properties of word and/or phoneme frequency distribution in the language be kept? If so, distribution measured/obtain on what corpora (should they be estimated on written corpora)?
- Should data on contributers be collected and kept, if possible? If so, which? Age, gender, etc.
- A train/validation/test split should be determined beforehand.
- What is a proper ratio?
This file has been truncated. show original