Pre-specified symbol embeddings

I’ve adjusted the code a bit to be able to specify multicharacter, space delimited phonemes that map to pre-specified dvectors, in the hope that I might be able to better incorporate external information about phonemes (mouth positions etc.) and better performance (without success so far). Does anyone have utility for this? If so, I think there might need to be some reorganization of the way input ultimately gets to the network to accommodate a ‘straight in’ path that avoids cleaning and the potential for phoneme generation etc.

I use a pickle file that holds the symbol embeddings makes it all that is needed in ‘characters’ in the config JSON and it replaces the nn.Embedding network in a model with nn.Linear and the appropriate dims.

It seems to work ok but bypassing the other config checks etc. appropriately makes things a little inelegant. I haven’t yet designed the right way to do this correctly’ but happy to do so if anyone else wants the capability.

It would be useful to be able to insert additional steps around this whole area (both in training and in inference). In the past I’d manually make a few code edits to allow me to intercept the phoneme inputs and substitute alternatives when the original text was a heteronym. I haven’t got the code to hand but could point out where it is applied this evening after work, although I expect you’re making changes in the same locations.

I also experimented with the server (for inference) so that it would check the input text for the presence of phonemes and if found it would skip sending it to be turned into phonemes. This meant you could normally use regular text but also make it capable of handling edge case pronunciation (although my version only worked on a whole sentence level for simplicity). It’s quite useful for testing.

Making the processing around those steps a more configurable pipeline would be ideal, but it comes at the cost of extra complexity.

Any strong views either way on this?