Pip install tts: when speaking multiple sentences each sentence speaks a different voice

Ubuntu_Lover · July 29, 2023, 3:05pm

I’m amazed at the quality of some of these voices. Has anyone encountered this problem though? Just thought I’d ask before I spend a day deep diving into how this works. I will post my solution here… and hopefully I’ll be able to upstream potential fixes. IMO something paid for with a grant, this big, should at least deliver something that is not just a demo, but usable out of the box.

Can anyone offer any advice or links? Is this where the pip project lives? And who still have the keys or carry the responsibility? Is this it: https://github.com/mozilla/TTS

Ubuntu_Lover · July 29, 2023, 3:04pm

I’m using the cli like this:

tts --text "Once you've ironed that out you can work to overcome the practical matters. Or not. Who knows. Lets go!" --model_name $a

Where I get the model name from:

tts --list_models

Ubuntu_Lover · July 29, 2023, 3:03pm

Okay, looks like I am just missing some poorly documented or hidden in all the documentation, command line option. https://tts.readthedocs.io/en/dev/inference.html is quite helpful!

When I create a custom python script like the following, I get the same behaviour.

import torch
from pydub import AudioSegment
from pydub.playback import play

if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using GPU.")
else:
    device = torch.device("cpu")
    print("GPU not available. Using CPU.")

from TTS.tts.configs.bark_config import BarkConfig
from TTS.tts.models.bark import Bark

config = BarkConfig()
model = Bark.init_from_config(config).to(device)
model.load_checkpoint(config, checkpoint_dir="/home/ubuntu/.local/share/tts/tts_models--multilingual--multi-dataset--bark", eval=True)

while True:
    text = input("Enter a sentence (or 'exit' to quit): ")
    if text.lower() == "exit":
        break
    elif text != "":
      output_dict = model.synthesize(text, config, speaker_id="random", voice_dirs=None)
      audio_data = output_dict['wav']
      audio_segment = AudioSegment(data=audio_data.tobytes(), sample_width=audio_data.dtype.itemsize,
                                 frame_rate=22050, channels=1)
      play(audio_segment)

This works fine “out of the box” though, if I go to http://127.0.0.1:5002 - but it doesn’t seem to be using my GPU:

tts-server --model_name "tts_models/en/jenny/jenny"

Now just need to figure out what the actual names of the voices are and how to use a shorter path… the basic things that the landing page or intro docs should share. Why ship demo things instead of fully usable out of the box things?