16 Years of Human-transcribed B.C. Hansard data aligned with linked original video/audio

erik.pedersen · January 4, 2020, 8:31am

Hello,

You may wish to make use of Hansard, the official record of parliamentary debates of the British Columbia Legislature, similar to the U.S. Congressional Record, but which is among the most advanced of its type in Canada or the Commonwealth of Nations. This features the separate voices of some 85 Members of the Legislative Assembly — a combination of men and women of different ages, some of whom speak English as a Second Language or with a Chinese- or Punjabi-influenced accent.

The Legislature meets for several weeks in two annual sessions, and the proceedings are carefully transcribed by 24 sessional editor-transcribers and senior editors. All transcriptions go through two passes to confirm the accuracy of the work, and the editors avail themselves of a well-staffed research department that confirms the spellings of personal names, company names, Indigenous (First Nations) groups and occasional greetings in French, Chinese, Punjabi, Korean and B.C.'s 34 Indigenous languages.

The transcribed files are uploaded to the Internet within about two hours of the words being spoken in the Legislative Assembly, and every five minutes a linked timing point is inserted. Clicking on that time notation, set off in the right margin, opens a link to the original video recording of the actual Legislature debate from which the transcription, as just described, was prepared.

I think it would be feasible for you to extract sentences from the transcribed proceedings and connect them to the original audio; as I’ve pointed out, the audio is already segmented info five-minute portions linked to the associated correctly spelled, relatively grammatical and reasonably punctuated text.

Furthermore, since you plan to create an open speech recognition system not for financial gain, licensing this rich repository of archived data should not be problematic. The Legislature is mainly concerned that material not be used out of context to embarrass individual MLAs or hold them up to public ridicule, and would likely to be willing to work with you to further facilitate your process if their product is not already ideal for your needs.

For your information, here is the relevant disclaimer on their Copyright page:

The user acknowledges that the copyright in all material contained herein is claimed by the Legislative Assembly and the Queen’s Printer on behalf of, and rests with, Her Majesty the Queen in Right of the Province of British Columbia. No person may reproduce the material contained herein by any means for financial gain, or other than personal use, without the express written consent of the Speaker of the Legislative Assembly.

If this policy is not sufficiently liberal for the Common Voice project, I believe that through contacting Rob Sutherland, Hansard Services director, you could reach an agreement on fair use of the Hansard data that would not be more stringent than your current permission to harvest up to three sentences per Wikipedia page and that would satisfy any concerns the Speaker of the House might have.

Here is a recent example of the sort of data that is available. B.C.'s Hansard record goes back to the 1970s, and is also human-indexed by subject, speaker and legislative debate step (first reading, second reading, committee stage, third reading and Royal Assent):
[https://www.leg.bc.ca/documents-data/debate-transcripts/41st-parliament/4th-session/20191128am-Hansard-n301]

Here’s an example, from 2003, of the earliest data that is linked to audio/video:
[https://www.leg.bc.ca/documents-data/debate-transcripts/37th-parliament/4th-session/20031216pm-Hansard-v19n8].

For previous years from 1970 to 2002, text is available, but the corresponding audio has not been linked to the transcribed text.

Finally, here’s an example of a typical subject matter index, itself linked to the pages of Hansard at which debate related to a particular topic occurred.
[https://www.leg.bc.ca/documents-data/indexes/view#41st-parliament&3rd-session&2018-Subject-Indexmhds]

I hope this idea may prove useful to the wonderful Common Voice project.

Kind regards,
Erik Bjørn Pedersen
Victoria, B.C., Canada

nukeador · January 7, 2020, 11:04am

Thanks for pointing out this resource.

@kdavis @r_LsdZVv67VKuK6fuHZ_tFpg This might be something to check with legal in case we want to use it directly to train Deep Speech?

Cheers.

kdavis · January 7, 2020, 1:03pm

What license is the data under?

nukeador · January 7, 2020, 1:05pm

kdavis · January 7, 2020, 1:13pm

The don’t have a “clean” license separate from the copyright?

xorgy · February 23, 2020, 11:57pm

I think that, in general, records produced by the government in Canada and the Provinces fall under Crown Copyright, and may require a specific license to be included in Common Voice.

lissyx · February 24, 2020, 9:49am

Is it accurate transcript, or transcript-for-nice-reading-later with a large amount of rewording?

erik.pedersen · February 25, 2020, 10:48am

Thanks for the question, @lissyx. It is what is called a substantially verbatim transcription that discards nothing meaningful. Our goal as transcribers is to have a text version that can be read smoothly in the same way as one can listen to the recorded audio/video file that is linked to the transcript at five-minute intervals. Of course, people do not always speak in complete sentences, so if a speaker says something substantive that ultimately trails off, we retain that kernel of potential meaning and indicate the trail-off with four-dot ellipses. If a person has a parenthetical statement but completes the original sentence afterwards, we set off the interruptive element with em dashes. A few false starts and unnecessary tics are deleted, but intentional rhetorical flourishes are retained. We treat elliptical sentences as grammatically complete, as when a politician says something like «Advanced Education.» before launching into an extensive disquisition on what she or he feels is worthwhile to note about this year’s post-secondary education budget. Boilerplate parliamentary procedure is often reduced to a simple style line or editorial comment, and routine recognitions of members by the Speaker of the House are subsumed by the member’s first initial and last name appearing in boldface. Try listening to a few paragraphs whilst reading along, and you will get a good idea of the flavour of things.

lissyx · February 25, 2020, 12:01pm

Thanks,

I’m asking because I wanted to leverage the same kind of dataset in French from the parliament, and it turned out:

they have no pure verbatim transcript kept somewhere (at least not to the knowledge of the people I talked over the phone, which were in charge of the transcriptions archives)
they produce verbatim “smooth to read”

That makes me slightly worried.

That sounds close to the practices from French parliament described above.

Would you have a few links to share so I can directly get a good picture of how the dataset is ?

erik.pedersen · February 26, 2022, 10:46am

Sure. Here’s some formal transcription of the Speech from the Throne, which marks the formal opening of a new session of parliament: [https://www.leg.bc.ca/documents-data/debate-transcripts/42nd-parliament/3rd-session/20220208pm-Hansard-n142]

A slight peculiarity of the B.C. Hansard is that we make an effort to transcribe non-English speech when it is used in the Legislative Assembly, so you may from time to time see transcriptions of short greetings in French, Punjabi, Chinese, Arabic, Hebrew, Korean or a Canadian First Nations (Indigenous) language. Alternatively, you may encounter an editorial comment like [A language other than English was spoken.]

However, from 2:15 p.m. to 2:50 p.m., that speech is ordinary English text, read aloud by the same person, Hon. Janet Austin, the Lieutenant-Governor of British Columbia
Clicking on one of the time stamps, shown in blue to the right of the formatted text, takes you to a recorded video corresponding to the human-readable text transcript. The video record of the Speech from the Throne, it being a highly formal occasion, also includes a picture-in-picture image of a person signing the same speech in American Sign Language.

Another formal speech you might like to check out is read by Hon. Selina Robinson, the Minister of Finance, from 1:35 p.m. to about 2:10 p.m. [https://www.leg.bc.ca/documents-data/debate-transcripts/42nd-parliament/3rd-session/20220222pm-House-Blues]. This transcript is still in the preliminary draft stage, the so-called Blues, and will not be finalized for about ten days, at which time it, too, will be linked to the corresponding audiovisual data.

Texts for any particular sitting day can be found here, [https://www.leg.bc.ca/documents-data/debate-transcripts/42nd-parliament/3rd-session], initially only in HTML but are eventually supplemented by a PDF version and a detailed index.

I hope this information is helpful to you.

Kind regards,
Erik Pedersen

lissyx · February 26, 2022, 12:32pm

Hello,

Looks like this was two years ago. Unfortunately im not working anymore on this, you should continue hacking with coqui.

Topic		Replies	Views
New Zealand parliament corpus Common Voice sentence-collection	2	646	July 8, 2020
Question: Importing Catalan Parliament proceedings Common Voice sentence-collection	3	527	June 4, 2020
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3766	September 11, 2019
📖 Readme: How to see my language on Common Voice Common Voice announcements	35	14462	May 10, 2022
Translation of sentences from other-language corpuses Common Voice sentence-collection	14	2225	November 25, 2022

16 Years of Human-transcribed B.C. Hansard data aligned with linked original video/audio

Related topics