Creating a Streaming Interface

Hello, everyone. I am trying to make mozillatts take in a paragraph and read it while it is still inferencing; i.e., play a sentence while processing the next sentences instead of outputing them in one batch. I am doing this:

import os
import sys
import nltk.data

tokenizer = nltk.data.load(‘tokenizers/punkt/english.pickle’)
fp = sys.stdin
data = fp.read()
sens = tokenizer.tokenize(data)
for sen in sens:
os.system(‘curl -G --output - --data-urlencode text="’ + sen + ‘" http://localhost:5002/api/tts | aplay’)


Then I run that through a script:

#!/bin/bash
cat /dev/stdin | python /home/zoomerhimmer/bin/scripts/feed-speak.py - &
trap ‘kill $!; exit 0’ INT
wait

On the terminal:

$echo “Say something. How about this? OK, that’s good.” | foliate-speak.sh

Do not be deceived! I am not a real developer. I got all this code from who-knows-where over the web. However, there are at least two issues with the above: 1) It doesn't inference ahead, but only plays the sentence then processes the next; 2) For some reason it doesn't work with the foliate e-reader.

I have been using just the server so far, but I'm wondering if it wouldn't be better to write a custom synthesis script. My hunch is that I'll need to dig deeper (maybe at the synthesizer.py level?) to be able to queue the audio and inference simultaneously. And I'll actually have to learn python for real this time.

Anyway, I will keep working on it and let you guys know if I succeed. Besides, it's probably something very simple progammically speaking. There's a deepspeech streaming script right here: https://github.com/mozilla/DeepSpeech-examples/blob/r0.9/mic_vad_streaming/mic_vad_streaming.py. So I don't think it should be impossible.

Of course, if anyone feels inclined to point me to a ready made solution like deepspeech's, I would be more than happy to drop all my pride in workmanship and run gleefully through the easy gate. That way I can hold off on learning python until I next find myself in a sticky wicket :)

Previously, something like this worked with foliate:

curl -G --output - --data-urlencode text="$(cat /dev/stdin)" ‘http://localhost:5002/api/tts’ | aplay - &
trap ‘kill $!; exit 0’ INT
wait


But it would take a long time to process a page. Though it might be practicable with a gpu.

I found a partial solution. But it still has three issues. I will write it up after I get better Internet and some sleep.

I made the following changes to the synthesizer.py file:

1,2d0
< from collections import deque
< import subprocess
20,22d17
< class MutableWrapper(object):
< def init(self, value):
< self.value = value
120,124d114
< def play_sentence(self, process, queue):
< if process.value == “start” or process.value.poll() is not None:
< self.save_wav(queue.popleft(), ‘/home/zoomerhimmer/tmp/wavs/tmp.wav’)
< process.value = subprocess.Popen([“aplay”, “/home/zoomerhimmer/tmp/wavs/tmp.wav”])
<
128,129d117
< queue = deque()
< process = MutableWrapper(“start”)
176,179d163
< #added
< queue.append(waveform)
< self.play_sentence(process, queue)
<
182,185d165
<
< # finish the queue
< while queue:
< self.play_sentence(process, queue)

Three problems persist followed by tentative solutions: 1) While still in the inferencing loop, the program needs to wait to get to the play_sentence line (try to do an event/listener thing); 2) Text in double quotes are considered one sentence (use a different tokenizing technique); 3) when foliate turns pages the audio can overlap (use the system’s PIDs to check if it’s OK to begin playing).
I use this to interact with foliate:

curl -G --output - --data-urlencode text="$(cat /dev/stdin)" ‘http://localhost:5002/api/tts’ &
trap ‘kill $!; exit 0’ INT
wait

Of course, it completely breaks the way the original scripts were meant to be used, but I’m just trying to get it to work for now.