We want your feedback: Improving the sentence collection

  • Finding CC0 stuff is hard, everyone ‘liberal’ use CC-BY.
  • Having a way to tag alternate sentences that mean the exact same thing.
    • In Norway we have so many ‘official’ ways to say things, and some people only use one of them, some use the other, some very very few people use both.
    • In English “To be” can be either “Å verta” or “Å bli”.
    • We have a ton of this.
  • Having a built-in system for translation of English sentences.
    • I’ve done this manually for now, it’s boring, but I’m also a bit concerned that some interesting data (the connection English sentence <=> Norwegian sentence, is getting lost).
    • It also needs to have several output sentences for one input sentence.
  • When taking in new sentences, check all new words so we can check if it’s in the correct grammar.
    • Sadly we even have “choose-your-own-adventure” grammar in Norwegian.
    • You have to be internally consistent, but you can choose to either write “to be” as “å vera” or “å vere”. Yes, in addition to “å bli”.
    • We would only want to have one of those forms in the corpus, so that the speech recognition only comes out in one consistent form.
    • That’s a hard problem, and I think Norwegian has it worse than most, but anyone would benefit from rules, stats and information on importing (or in review).
  • We could have a simple way for people to contribute their blogs as corpus.
    • That’s how I’ve gotten most of the sentences I’m preparing for Norwegian.
    • However it needs rather intensive proof reading.
    • Or even other places like Facebook / Twitter.
  • Not for sentence collection: but we also need to be able to say what dialect the person identify as speaking.
    • They sound extremely different, so a good Norwegian speech recognition will need to have a good distribution.
    • This is also my main interest in this project, as commercial speech recognition I’ve tried won’t understand you unless you change the way you speak.
2 Likes