Sentence collector: Import file formats


Preparing to add new sentences in Sentence Collector is an important step to avoid manual entry.

I have listed the different file formats we used to make it easier to add, with CC0 sources.

The different file formats are:

  • CSV
  • TXT
  • XML
  • JSON
  • EPUB

that you can find here

They can be useful to everyone to facilitate content extraction.

Question: What do you think of creating a new repositorie that will serve as a reference?


Are these scripts intended to convert from these formats to txt so you can c&p?


yes, scripts allow you to read a source (file or url) and obtain a TXT file with lines up to 14 words

Then just go to the ADD page of Sentence Collector
and copy/paste them to add them


@dabinatis this somehow overlapping with your work on splitting text into sentences?

I think all of this work really informs a lot future improvements into analyzing, extracting and filtering sentences from large sources of text.

@nukeador My script doesn’t extract the sentences, just cleans them up after extraction. It’s also English-specific, but people are welcome to fork it for other languages.