So my reading of that is that it means the test audio output is produced in the same way (ie with same settings) as audio would be produced if you were to use the model “for real” (ie when doing inference, which in this case is taking some text and using the model to generate new output audio).
This test audio is probably the best audio to focus on because it represents how the model would perform when you actually use it - the other audio can appear better than that you’ll get from the model in actual use because of the teacher-forcing that the wiki also mentions.
I hope that helps. I’m no expert myself, just piecing things together as best I can
Maybe others will have more insight but I think it’s a little difficult to answer without knowledge of your dataset.
At the risk of stating the obvious, the things that will make a difference are the quantity and quality of data. There’s some discussion elsewhere in the forum on this kind of thing, and it might be worth looking at those two data notebooks mentioned in the wiki, to run them on your data (they give an insight into things like the distribution of text length etc etc)
Regarding quantity, it’s worth noting that LJSpeech has ~24 hrs of audio (https://keithito.com/LJ-Speech-Dataset/). I managed to get fair results with a private dataset of something like ~12 hrs. Unfortunately I haven’t (yet) got info on how it would perform with lesser amounts, but I’m guessing that unless there’s some kind of transfer training approach, you’d need data approaching this order of magnitude.