Polish dataset download

Hey Kuba,

  i've build semi automatic solution to extract sentences (within

nltk) from old movie subtitles. Process is not fully automated and
i’m trying to review sentences before i’ll add it to collector
(sometimes i can miss low quality senetce). I’ll try to modify
process - it will take only those sentences, in which words are in
polish dictionary only (btw my blacklist is still growing, so
quality of batch should be better each iteration).

  Do I need to write movie title within description of sentences

batch?

BR

Kuba,

  what about counters on profile - is't safe to add new sentences?

BTW I’ve reviewed my code - i had english pickle on tokenizer :wink:
so new batches should be much more better now (now process is
armed with polish pickle)

BR

No this this just a display problem of the website, see here and here. You can add as much sentences as you want.

Hey Stefan

  thanks for answer, got another issue - why total number of

sentences doesnt change also? Is it same issue like counters on
profile?

  • 25097 total sentences.
    best regards

Hey, how old movies are we talking? Please take note that if the movie was released any sooner than 1950, or probably even 1940, the scripts, and, in effect, subtitles, are still probably protected by author rights, and thus unsuitable for common voice.

Hey, most of them are from 90’ - but the group who made
translations doesnt exists anymore

btw if You have ideas where from I can take source to make
sentences just let me know :wink:

You may consider manually scraping some stuff from https://pl.wikisource.org/wiki/Wikiźródła:Strona_główna, they have a fair collection of usually guaranteedly CC0 texts.

I had tried to add sentence from another profile and number of
added senteces didn’t increase

Ruben told to not add sentences in collector from wikipedia …

Wikisources is a different project by the Wikimedia foundation, and serves basically only as a database of PD works. The restriction on Wikipedia is only on the main project itself (and possibly some of the other projects), and that is due to the reason that the content there is originally created by the Wikipedia contributors, who retain the rights to their content under the CC-BY-SA license or some similar (and thus, it is not CC0)

thx for the hint :wink: !!! i’m going to prepare batch from
wikisource then :slight_smile:

Ok -can You cleanup my sentences from movie subtitles then?
(aiteam) - i’m preparing new batch from wikisources

Not me. @mkohler could you please unless you find a better way?

Will do this weekend.

@aiteam, please remember to add source of text when uploading them :wink: .
Also You can read this topic (among many others) which has some useful info for processing sentences https://discourse.mozilla.org/t/help-needed-in-processing-large-polish-text-base/40778. Since it seems You are knowledgeable in NLP and stuff (at least more than me, an amateur) maybe You would like to share your scripts with the community? I am sure we will find more sentence sources waiting to get processed in many lanuages.

Hey Kuba,

  what type of infromation should be in source? (link, type eg.

wikisource, title of the novel/book)?

  NLP is in my case is just beginning ;) when i'll be ready i'll

publish my sources :slight_smile:

BR

Good to hear about NLP :wink: . Well I think material/book title and author would be good and maybe short url of website? When uploading texts from wolnelektury.pl I tend to put something like: Author;Title;wolnelektury.pl. And by the way You can skip this site as I already downloaded most of what was useful there.

Digression, is it only me or Your posts have odd formatting? To me they show as normal text

interupted by such text

ok- ill put source url of book/novel. Dont worry bout wolne
lektury i’m scrapping wikisource atm.

bout formating - maybe it’s mozillas thunderbird issue :frowning:

I can only find sentences with the source “own choosed and edited”, can you be more specific here?