Now, Q3 - in parts…
You can look at the data ONLY in the releases, or see them during validation. When you look the metadata in the release:
- You will see
*_sentences.tsvfiles for text-corpus only - Each other metadata file also has some sentence column valid for that data.
First part is already there (or not). Please see the explanations and a problem in this issue.
nouns, verbs, adjectives, and adverbs
compare it with a list from a bigger Romanian corpus
That would require language knowledge of course, maybe some NLP tooling for that particular language. I don’t have them, so I do generic tokenization and count of words for the frequencies in Analyzer text-corpus tab.
You should do these knowing the language, but anyway, I uploaded the intermediate files for you (v20.0 ro)… If our results do not merge, please break the glass (issue).
Nope, AI generated sentences are not allowed. Here is some info:
I think its time to talk about AI generated sentences again (read answers/resolution)
It is degenerative, and we will see in LLMs in a few years (actually dementia is already there).