Use wordpiece to replace LM

It may relate to this. Due to the success of UTF-8 mode by reuben based on PR on github recently. I wonder if there is any experiment about wordpiece output token. UTF-8 use 256 tokens and get WER 0.13 performance without language model. If we use 1k wordpieces or more maybe it can achieve sub 10 WER without LM. And the model size decrease a lot. Increase token may increase output layer size but absolute not proportionally.

This is my inspiration. You can read it by sci-hub. Wordpiece model achieve comparable performance with RNN-T and LAS at table 5, and outperform char+LM at table 7.

I may try it so I need some advice and previous experiment result. Thanks!