I have made some progress regarding scraping bn.wiki, a more or less complete rules(bn.toml) file has been created with help from @mkohler. Although it is only generating 410-20 sentences from 90k articles on Bengali Wikipedia. Which seems low compared to the 4.5k sentences out of 139k articles on Hindi Wikipedia. Maybe its an issue with the sentence tokeniser.
Nevertheless, there is another issue with the blocklist creation, because cvtools is hardcoded for the roman alphabet. This word statistics generation /blacklist generation tool will be particularly useful for scraping more public domain sources. So can anyone suggest an alternative tool, I am sure there are plenty. It wouldn’t be productive to try to modify cvtools if a good alternative already exists.
All the files related to the progress can be found in the repo mentioned earlier.