hellosct1
(Christophe)
August 30, 2019, 6:12am
1
Hi,
Preparing to add new sentences in Sentence Collector is an important step to avoid manual entry.
I have listed the different file formats we used to make it easier to add, with CC0 sources.
The different file formats are:
that you can find here
They can be useful to everyone to facilitate content extraction.
Question: What do you think of creating a new repositorie that will serve as a reference?
Christophe
nukeador
(Rubén Martín [❌ taking a break from Mozilla])
August 30, 2019, 12:52pm
2
Are these scripts intended to convert from these formats to txt so you can c&p?
hellosct1
(Christophe)
August 30, 2019, 2:30pm
3
Hi,
yes, scripts allow you to read a source (file or url) and obtain a TXT file with lines up to 14 words
Then just go to the ADD page of Sentence Collector
and copy/paste them to add them
Christophe
nukeador
(Rubén Martín [❌ taking a break from Mozilla])
August 30, 2019, 3:07pm
4
@dabinatis this somehow overlapping with your work on splitting text into sentences?
I think all of this work really informs a lot future improvements into analyzing, extracting and filtering sentences from large sources of text.
dabinat
September 1, 2019, 3:17pm
5
@nukeador My script doesn’t extract the sentences, just cleans them up after extraction. It’s also English-specific, but people are welcome to fork it for other languages.