- Finding
CC0
stuff is hard, everyone ‘liberal’ useCC-BY
. - Having a way to tag alternate sentences that mean the exact same thing.
- In Norway we have so many ‘official’ ways to say things, and some people only use one of them, some use the other, some very very few people use both.
- In English “
To be
” can be either “Å verta
” or “Å bli
”. - We have a ton of this.
- Having a built-in system for translation of English sentences.
- I’ve done this manually for now, it’s boring, but I’m also a bit concerned that some interesting data (the connection English sentence <=> Norwegian sentence, is getting lost).
- It also needs to have several output sentences for one input sentence.
- When taking in new sentences, check all new words so we can check if it’s in the correct grammar.
- Sadly we even have “choose-your-own-adventure” grammar in Norwegian.
- You have to be internally consistent, but you can choose to either write “
to be
” as “å vera
” or “å vere
”. Yes, in addition to “å bli
”. - We would only want to have one of those forms in the corpus, so that the speech recognition only comes out in one consistent form.
- That’s a hard problem, and I think Norwegian has it worse than most, but anyone would benefit from rules, stats and information on importing (or in review).
- We could have a simple way for people to contribute their blogs as corpus.
- That’s how I’ve gotten most of the sentences I’m preparing for Norwegian.
- However it needs rather intensive proof reading.
- Or even other places like Facebook / Twitter.
- Not for sentence collection: but we also need to be able to say what dialect the person identify as speaking.
- They sound extremely different, so a good Norwegian speech recognition will need to have a good distribution.
- This is also my main interest in this project, as commercial speech recognition I’ve tried won’t understand you unless you change the way you speak.
2 Likes