16 Years of Human-transcribed B.C. Hansard data aligned with linked original video/audio

Hello,

You may wish to make use of Hansard, the official record of parliamentary debates of the British Columbia Legislature, similar to the U.S. Congressional Record, but which is among the most advanced of its type in Canada or the Commonwealth of Nations. This features the separate voices of some 85 Members of the Legislative Assembly — a combination of men and women of different ages, some of whom speak English as a Second Language or with a Chinese- or Punjabi-influenced accent.

The Legislature meets for several weeks in two annual sessions, and the proceedings are carefully transcribed by 24 sessional editor-transcribers and senior editors. All transcriptions go through two passes to confirm the accuracy of the work, and the editors avail themselves of a well-staffed research department that confirms the spellings of personal names, company names, Indigenous (First Nations) groups and occasional greetings in French, Chinese, Punjabi, Korean and B.C.'s 34 Indigenous languages.

The transcribed files are uploaded to the Internet within about two hours of the words being spoken in the Legislative Assembly, and every five minutes a linked timing point is inserted. Clicking on that time notation, set off in the right margin, opens a link to the original video recording of the actual Legislature debate from which the transcription, as just described, was prepared.

I think it would be feasible for you to extract sentences from the transcribed proceedings and connect them to the original audio; as I’ve pointed out, the audio is already segmented info five-minute portions linked to the associated correctly spelled, relatively grammatical and reasonably punctuated text.

Furthermore, since you plan to create an open speech recognition system not for financial gain, licensing this rich repository of archived data should not be problematic. The Legislature is mainly concerned that material not be used out of context to embarrass individual MLAs or hold them up to public ridicule, and would likely to be willing to work with you to further facilitate your process if their product is not already ideal for your needs.

For your information, here is the relevant disclaimer on their Copyright page:

The user acknowledges that the copyright in all material contained herein is claimed by the Legislative Assembly and the Queen’s Printer on behalf of, and rests with, Her Majesty the Queen in Right of the Province of British Columbia. No person may reproduce the material contained herein by any means for financial gain, or other than personal use, without the express written consent of the Speaker of the Legislative Assembly.

If this policy is not sufficiently liberal for the Common Voice project, I believe that through contacting Rob Sutherland, Hansard Services director, you could reach an agreement on fair use of the Hansard data that would not be more stringent than your current permission to harvest up to three sentences per Wikipedia page and that would satisfy any concerns the Speaker of the House might have.

Here is a recent example of the sort of data that is available. B.C.'s Hansard record goes back to the 1970s, and is also human-indexed by subject, speaker and legislative debate step (first reading, second reading, committee stage, third reading and Royal Assent):
[https://www.leg.bc.ca/documents-data/debate-transcripts/41st-parliament/4th-session/20191128am-Hansard-n301]

Here’s an example, from 2003, of the earliest data that is linked to audio/video:
[https://www.leg.bc.ca/documents-data/debate-transcripts/37th-parliament/4th-session/20031216pm-Hansard-v19n8].

For previous years from 1970 to 2002, text is available, but the corresponding audio has not been linked to the transcribed text.

Finally, here’s an example of a typical subject matter index, itself linked to the pages of Hansard at which debate related to a particular topic occurred.
[https://www.leg.bc.ca/documents-data/indexes/view#41st-parliament&3rd-session&2018-Subject-Indexmhds]

I hope this idea may prove useful to the wonderful Common Voice project.

Kind regards,
Erik Bjørn Pedersen
Victoria, B.C., Canada

2 Likes

Thanks for pointing out this resource.

@kdavis @r_LsdZVv67VKuK6fuHZ_tFpg This might be something to check with legal in case we want to use it directly to train Deep Speech?

Cheers.

What license is the data under?

From the first message it seems All rights reserved, personal use only.

The don’t have a “clean” license separate from the copyright?

I think that, in general, records produced by the government in Canada and the Provinces fall under Crown Copyright, and may require a specific license to be included in Common Voice.

Is it accurate transcript, or transcript-for-nice-reading-later with a large amount of rewording?

Thanks for the question, @lissyx. It is what is called a substantially verbatim transcription that discards nothing meaningful. Our goal as transcribers is to have a text version that can be read smoothly in the same way as one can listen to the recorded audio/video file that is linked to the transcript at five-minute intervals. Of course, people do not always speak in complete sentences, so if a speaker says something substantive that ultimately trails off, we retain that kernel of potential meaning and indicate the trail-off with four-dot ellipses. If a person has a parenthetical statement but completes the original sentence afterwards, we set off the interruptive element with em dashes. A few false starts and unnecessary tics are deleted, but intentional rhetorical flourishes are retained. We treat elliptical sentences as grammatically complete, as when a politician says something like «Advanced Education.» before launching into an extensive disquisition on what she or he feels is worthwhile to note about this year’s post-secondary education budget. Boilerplate parliamentary procedure is often reduced to a simple style line or editorial comment, and routine recognitions of members by the Speaker of the House are subsumed by the member’s first initial and last name appearing in boldface. Try listening to a few paragraphs whilst reading along, and you will get a good idea of the flavour of things.

Thanks,

I’m asking because I wanted to leverage the same kind of dataset in French from the parliament, and it turned out:

  • they have no pure verbatim transcript kept somewhere (at least not to the knowledge of the people I talked over the phone, which were in charge of the transcriptions archives)
  • they produce verbatim “smooth to read”

That makes me slightly worried.

That sounds close to the practices from French parliament described above.

Would you have a few links to share so I can directly get a good picture of how the dataset is ?