You may wish to make use of Hansard, the official record of parliamentary debates of the British Columbia Legislature, similar to the U.S. Congressional Record, but which is among the most advanced of its type in Canada or the Commonwealth of Nations. This features the separate voices of some 85 Members of the Legislative Assembly — a combination of men and women of different ages, some of whom speak English as a Second Language or with a Chinese- or Punjabi-influenced accent.
The Legislature meets for several weeks in two annual sessions, and the proceedings are carefully transcribed by 24 sessional editor-transcribers and senior editors. All transcriptions go through two passes to confirm the accuracy of the work, and the editors avail themselves of a well-staffed research department that confirms the spellings of personal names, company names, Indigenous (First Nations) groups and occasional greetings in French, Chinese, Punjabi, Korean and B.C.'s 34 Indigenous languages.
The transcribed files are uploaded to the Internet within about two hours of the words being spoken in the Legislative Assembly, and every five minutes a linked timing point is inserted. Clicking on that time notation, set off in the right margin, opens a link to the original video recording of the actual Legislature debate from which the transcription, as just described, was prepared.
I think it would be feasible for you to extract sentences from the transcribed proceedings and connect them to the original audio; as I’ve pointed out, the audio is already segmented info five-minute portions linked to the associated correctly spelled, relatively grammatical and reasonably punctuated text.
Furthermore, since you plan to create an open speech recognition system not for financial gain, licensing this rich repository of archived data should not be problematic. The Legislature is mainly concerned that material not be used out of context to embarrass individual MLAs or hold them up to public ridicule, and would likely to be willing to work with you to further facilitate your process if their product is not already ideal for your needs.
For your information, here is the relevant disclaimer on their Copyright page:
The user acknowledges that the copyright in all material contained herein is claimed by the Legislative Assembly and the Queen’s Printer on behalf of, and rests with, Her Majesty the Queen in Right of the Province of British Columbia. No person may reproduce the material contained herein by any means for financial gain, or other than personal use, without the express written consent of the Speaker of the Legislative Assembly.
If this policy is not sufficiently liberal for the Common Voice project, I believe that through contacting Rob Sutherland, Hansard Services director, you could reach an agreement on fair use of the Hansard data that would not be more stringent than your current permission to harvest up to three sentences per Wikipedia page and that would satisfy any concerns the Speaker of the House might have.
Here is a recent example of the sort of data that is available. B.C.'s Hansard record goes back to the 1970s, and is also human-indexed by subject, speaker and legislative debate step (first reading, second reading, committee stage, third reading and Royal Assent):
Here’s an example, from 2003, of the earliest data that is linked to audio/video:
For previous years from 1970 to 2002, text is available, but the corresponding audio has not been linked to the transcribed text.
Finally, here’s an example of a typical subject matter index, itself linked to the pages of Hansard at which debate related to a particular topic occurred.
I hope this idea may prove useful to the wonderful Common Voice project.
Erik Bjørn Pedersen
Victoria, B.C., Canada