Issues in the Romanian dataset

Now, Q3 - in parts…

You can look at the data ONLY in the releases, or see them during validation. When you look the metadata in the release:

  • You will see *_sentences.tsv files for text-corpus only
  • Each other metadata file also has some sentence column valid for that data.

First part is already there (or not). Please see the explanations and a problem in this issue.

nouns, verbs, adjectives, and adverbs
compare it with a list from a bigger Romanian corpus

That would require language knowledge of course, maybe some NLP tooling for that particular language. I don’t have them, so I do generic tokenization and count of words for the frequencies in Analyzer text-corpus tab.

You should do these knowing the language, but anyway, I uploaded the intermediate files for you (v20.0 ro)… If our results do not merge, please break the glass (issue).

Nope, AI generated sentences are not allowed. Here is some info:

I think its time to talk about AI generated sentences again (read answers/resolution)

It is degenerative, and we will see in LLMs in a few years (actually dementia is already there).