Eval Audio, Test audio, Train Audio on Tensorboard

haqkiemdaim · May 29, 2019, 9:59pm

Hi,
Im quite confuse with the presence of these 3 types of audios (as stated above) on the Tensorboard.

On 470th/1000 epoch,
why only Eval Audio and Train Audio have better audio compared to Test Audio which i cant even recognize what it said?

Which one of these audio should be **mostly **
concern ?

What all these 3 audios means ?

And …

Is TestAudio = combination of phonemes from different words?

nmstoker · May 30, 2019, 2:02am

What, if anything, did you find in the wiki about this?

haqkiemdaim · May 30, 2019, 2:29am

Ouhh yeah sory. I missed that i just notice it. But i don’t understand what does it mean by:

However, test results are obtained by using the exact same setting in inference time

What is it mean by that?

nmstoker · May 30, 2019, 8:56am

So my reading of that is that it means the test audio output is produced in the same way (ie with same settings) as audio would be produced if you were to use the model “for real” (ie when doing inference, which in this case is taking some text and using the model to generate new output audio).

This test audio is probably the best audio to focus on because it represents how the model would perform when you actually use it - the other audio can appear better than that you’ll get from the model in actual use because of the teacher-forcing that the wiki also mentions.

I hope that helps. I’m no expert myself, just piecing things together as best I can

haqkiemdaim · May 30, 2019, 10:13am

Woow. Good explaination im currently at 500th/1000 epoch with 5 batch size. Tacotron1

EvalStat graphs, postnet and decoder, already converge but not TrainEpochStat which haven’t converge yet.

Do u think 5bs and 1000 epoch is sufficient to make the model converge?

nmstoker · May 30, 2019, 11:34am

Maybe others will have more insight but I think it’s a little difficult to answer without knowledge of your dataset.

At the risk of stating the obvious, the things that will make a difference are the quantity and quality of data. There’s some discussion elsewhere in the forum on this kind of thing, and it might be worth looking at those two data notebooks mentioned in the wiki, to run them on your data (they give an insight into things like the distribution of text length etc etc)

nmstoker · May 30, 2019, 11:56am

Regarding quantity, it’s worth noting that LJSpeech has ~24 hrs of audio (https://keithito.com/LJ-Speech-Dataset/). I managed to get fair results with a private dataset of something like ~12 hrs. Unfortunately I haven’t (yet) got info on how it would perform with lesser amounts, but I’m guessing that unless there’s some kind of transfer training approach, you’d need data approaching this order of magnitude.

haqkiemdaim · May 31, 2019, 6:41pm

Yeah… i already did the dataset analysis on my datset using the given 2 notebooks. I got only 5hour 30min duration so far (more to come)

haqkiemdaim · May 31, 2019, 6:43pm

Anyway if u dont mind im asking, on the TestAudios, does it generated randomly by picking any phonemes from any text?