Eval Audio, Test audio, Train Audio on Tensorboard

Hi,
Im quite confuse with the presence of these 3 types of audios (as stated above) on the Tensorboard.

On 470th/1000 epoch,
why only Eval Audio and Train Audio have better audio compared to Test Audio which i cant even recognize what it said?

Which one of these audio should be **mostly **
concern ?

What all these 3 audios means ?

And …

Is TestAudio = combination of phonemes from different words?

What, if anything, did you find in the wiki about this?

Ouhh yeah sory. I missed that :pray: i just notice it. But i don’t understand what does it mean by:

However, test results are obtained by using the exact same setting in inference time

What is it mean by that?

So my reading of that is that it means the test audio output is produced in the same way (ie with same settings) as audio would be produced if you were to use the model “for real” (ie when doing inference, which in this case is taking some text and using the model to generate new output audio).

This test audio is probably the best audio to focus on because it represents how the model would perform when you actually use it - the other audio can appear better than that you’ll get from the model in actual use because of the teacher-forcing that the wiki also mentions.

I hope that helps. I’m no expert myself, just piecing things together as best I can :slightly_smiling_face:

1 Like

Woow. Good explaination :+1: im currently at 500th/1000 epoch with 5 batch size. Tacotron1

EvalStat graphs, postnet and decoder, already converge but not TrainEpochStat which haven’t converge yet.

Do u think 5bs and 1000 epoch is sufficient to make the model converge?

Maybe others will have more insight but I think it’s a little difficult to answer without knowledge of your dataset.

At the risk of stating the obvious, the things that will make a difference are the quantity and quality of data. There’s some discussion elsewhere in the forum on this kind of thing, and it might be worth looking at those two data notebooks mentioned in the wiki, to run them on your data (they give an insight into things like the distribution of text length etc etc)

1 Like

Regarding quantity, it’s worth noting that LJSpeech has ~24 hrs of audio (https://keithito.com/LJ-Speech-Dataset/). I managed to get fair results with a private dataset of something like ~12 hrs. Unfortunately I haven’t (yet) got info on how it would perform with lesser amounts, but I’m guessing that unless there’s some kind of transfer training approach, you’d need data approaching this order of magnitude.

Yeah… i already did the dataset analysis on my datset using the given 2 notebooks. I got only 5hour 30min duration so far (more to come)

Anyway if u dont mind im asking, on the TestAudios, does it generated randomly by picking any phonemes from any text?