If you want to attempt this journey, good luck. There’s other documentation/comments about multiple cards on the same system you can use.
Training a viable model is a relatively massive undertaking. Tens to hundreds of hours of wav files to have a good dataset means you will need commensurate computational hardware to handle it. This has been the case since tacotron (8gb+ gpu was recommended on keith ito’s repo from 4 or 5 years ago). While it’s possible to use a smaller gpu, it hasn’t been found to produce models with the same quality, or as quickly. Perhaps this is changing, but if you are serious now about accomplishing training a quality model in say, a month or less, you’re going to have a difficult time if you don’t use a higher end gpu.
Also, you don’t need to buy a big gpu. There’s google colab to do testing and start modeling with. Cloud gpu’s can be rented fairly cheaply these days. On pre-emptible instances the price can be brought down quite a bit. I’ve even used one despite having 8gb gpu’s locally to run with for batch sizing and performance reasons. It’s a matter of balancing your wants and needs.