This is a very interesting paper from 2010 that explains in detail how Google built their corpus and voice collection.
A few quotes:
On corpus gathering
To generate these lists we used a sam-ple of common typed Google search queries, and in some in-stances an additional set of phrases in the target language. Foreach target language, we used a statistical language classifierand information about the user’s locale settings to find commonGoogle search queries in that language. We then removed mis-spelled words and pornographic terms from the list. Finally,the remaining queries were sampled uniformly and become theprompts given to speakers
On crowdsourcing voice gathering
Once the phones are configured, a single session of 500 ut-terances takes an average of 30 minutes of recording time inAuto-advancemode, and can be completed in as few as 20 min-utes if the speaker reads quickly. Educating the speaker aheadof time can speed up the process. We walk the speaker throughrecording the first prompt manually, and then play back therecording to make sure the speaker is engaged in each step. Thespeaker then activatesAuto-advancemode once they feel com-fortable with the process.
On who helped with the work in 17 languages
In many locales, we engageduniversity students, working relatively independently in the tra-dition of census takers and surveyors, to execute the data collec-tion. The students set a target number of speakers with certaindemographic and recording environment distributions and wereable to leverage their social networking communities to recruitspeakers. Hiring university students can also be efficient in thatthey are often just entering the work force and many are edgetechnology users requiring less technical support
(Thanks @josh_meyer for sharing)