Our WebSockets-based TTS Streaming solution allows you to greatly reduce the latency of Text-to-Speech conversion and it's not limited by the text's length.

We support two kinds of streaming:

  • TTS Simplex streaming - allows you to send the text to the server once and receive the converted audio response as a stream.
  • TTS Duplex streaming - allows you to asynchronously send text to the server multiple times and receive back the converted audio. You control when the stream ends.

Both kinds are supported in our SDKs. As long as you use them, you don't need to worry about the low-level details presented in this article below.

TTS Simplex streaming

First, connect with the WebSocket using the endpoint below. You need to specify the voice you want to use in the query parameter.

wss://api.gemelo.ai/v1/tts/stream/simplex/ws?voiceId=xxx

Next, you send the following message to authenticate in the service:

{
  "type": "authApiKey",
  "clientKey": "xxx",
  "apiKey": "yyy" 
}

Then, you send another message with the text to convert:

{
  "type": "convert",
  "text": "xxx",
}

Now you just listen for the incoming raw binary data. You should not close the connection by yourself, the server will do that as soon as it finishes converting the text.

TTS Duplex streaming

Similarly to the simplex streaming, you need to connect with the WebSocket and choose the voice, however the endpoint is different:

wss://api.gemelo.ai/v1/tts/stream/duplex/ws?voiceId=xxx

Now, send the authentication command:

{
  "type": "authApiKey",
  "clientKey": "xxx",
  "apiKey": "yyy" 
}

From now on, you can asynchronously listen for incoming raw binary data and send the convert command to the server as many times as you want:

{
  "type": "convert",
  "text": "xxx"
}

When you are done, you can finish the stream by sending the close command. The server will flush the buffer and gracefully close the connection (you will receive a WebSocket close event when it does):

{
  "type": "close"
}

Alternatively, you can close the WebSocket connection manually (depending on the language you are using).