I am noticing very slow progress with steps while testing.
Please advise, I have tested with normal mode (without --utf8). And seems fast while testing.
UTF-8 increases the alphabet size to 256, and since you’re not passing a scorer, it also can’t trim out of vocabulary beams during decoding, so the decoding process gets slower. If you build a scorer it should counteract any slowness you’re seeing.
The out of vocabulary trimming thing is also made worse by UTF-8 because without a scorer the decoder does not trim invalid UTF-8 sequences, so it’s spending time doing useless work.
No. There’s no universal rule as it depends on what you’re comparing it to. Alphabet mode can be super slow with very large alphabets, for example. With a scorer it should be just as fast in most cases, and much faster if you have a large alphabet.
In general, unless you have a specific issue with alphabet mode that can be fixed by using UTF-8 mode, you should stick to using an alphabet. Some examples of these issues:
My alphabet is too large and makes the model big and slow.
I want to easily train on several languages at once.
I want to easily transfer from one language to another including the final layer.