How to create a dialogue system using OpenAI and Charactr API?

The main idea is to send three requests:

  1. sending an audio sample to Whisper (ASR) to convert speech to text,
  2. then sending this text to ChatGPT to get a text response,
  3. finally sending this response to gemelo.ai TTS to convert text to speech and get a voice response.

This is one iteration of the dialogue. To have a longer conversation, we can repeat these three steps many times and collect text responses to keep context. We can implement such a dialogue system in a few small steps:

  1. First install the charactr-api-sdk library to be able to use gemelo.ai API for gemelo.ai TTS and openai library to be able to use OpenAI API for Whisper and ChatGPT.
pip install charactr-api-sdk openai

Install also other required libraries.

pip install requests ipywebrtc ipython
  1. Load all required libraries.
import json
from typing import Dict, List

import IPython.display
import openai
import requests
from charactr_api import CharactrAPISDK, Credentials
from ipywebrtc import AudioRecorder, CameraStream
  1. Set up your keys for OpenAI and gemelo.ai API so that you can connect to them with your accounts.
openai_api_key = 'xxxx'
charactr_client_key = 'yyyy'
charactr_api_key = 'zzzz'

openai.api_key = openai_api_key
  1. Create CharactrAPISDK instance using your gemelo.ai keys. It allows you to use TTS and choose a voice.
credentials = Credentials(client_key=charactr_client_key, api_key=charactr_api_key)
charactr_api = CharactrAPISDK(credentials)
  1. Check the list of available voices to choose one of them.
charactr_api.tts.get_voices()

You get the list of available voices as output. Choose a voice and set up voice_id variable - this voice will be used to generate voice responses.

voice_id = 136
  1. Set up a model and other parameters for ChatGPT. The list of all parameters and their explanations you can find here: https://platform.openai.com/docs/api-reference/chat/create
model = 'gpt-3.5-turbo'
parameters = {
    'temperature': 0.8,
    'max_tokens': 150,
    'top_p': 1,
    'presence_penalty': 0,
    'frequency_penalty': 0,
    'stop': None
}

Using the above parameters, define a function that generates a text response with ChatGPT based on history of conversation (context) and a current request. It updates the conversation.

def update_conversation(request: str, conversation: List[Dict]) -> None:
    """
    Run a request to ChatGPT to get a response 
    and update the conversation.
    """
    try:
        user_request = {'role': 'user', 'content': request}
        conversation.append(user_request)
        result = openai.ChatCompletion.create(model=model,
                                              messages=conversation,
                                              **parameters)
        response = result['choices'][0]['message']['content'].strip()
        bot_response = {'role': 'assistant', 'content': response}
        conversation.append(bot_response)
    except Exception as e:
        raise Exception(e)

Next describe a bot and set up conversation variable.

conversation = [{'role': 'system', 'content': 'You are AI Assistant.'}]
  1. Define a function that runs a request to Whisper to convert speech to text. The list of all parameters and their explanations you can find here: https://platform.openai.com/docs/api-reference/audio
def speech2text(audio_path: str) -> str:
    """Run a request to Whisper to convert speech to text."""
    try:
        with open(audio_path, 'rb') as audio_f:
            result = openai.Audio.transcribe('whisper-1', audio_f)
        text = result['text']
    except Exception as e:
        raise Exception(e)
    return text
  1. Create an audio recorder in a notebook.
camera = CameraStream(constraints={'audio': True, 'video': False})
recorder = AudioRecorder(stream=camera)
recorder

Use record button within a notebook to record your request. Then save your recording.

temp_path = 'recording.webm'
with open(temp_path, 'wb') as f:
    f.write(recorder.audio.value)
  1. Generate a voice response with Whisper, ChatGPT and gemelo.ai TTS that answers your recording.
# convert the recorded audio to text
input_text = speech2text(temp_path)
# generate a text response and update conversation
update_conversation(input_text, conversation)
# convert a text response to speech
tts_result = charactr_api.tts.convert(voice_id, conversation[-1]['content'])

It returns Audio object that contains fields: data, type, duration_ms, size_bytes. To listen to the output voice response in a notebook, run:

IPython.display.Audio(tts_result['data'])

You can also save the output voice response as a file.

with open('output.wav', 'wb') as f:
    f.write(tts_result['data'])

To continue the conversation, repeat steps 8 and 9.