Week 2 Summary

Resource for Röda Korset

Week 2 Summary

Image by Röda Korset 239d ago

Hi,

This week, we started by creating a private OpenAI account and purchased credits to test the APIs. Implementing OpenAI’s Whisper (Speech-to-Text), GPT-3.5-turbo, and TTS (Text-to-Speech) went smoothly using an OpenAI Node.js library and environment variables for API keys. We prompted GPT-3.5-turbo to translate the text to Swedish/Ukrainian based on the detected source language.

After completing the implementation of all three APIs, we began testing the button functionalities. We programmed the "Prata" button and “Avsluta” button to stop playback if pressed during playback and allow the user to speak again or end the session, respectively.

The major challenge during this week has been the translation process. At first we recorded the whole talk (multiple sentences) from each person before we put that audio file into the recording process. Initially, there was a significant delay between pressing the "Sluta prata" button and playing the translated audio. This delay was at least 5 seconds, before the translation process was done and read out loud. This delay occurred because we had to wait for Whisper to transcribe the audio file, GPT-3.5-turbo to translate it, and TTS to convert the text back into audio. We explored various methods for real-time playback and streaming but found that only TTS supported real-time playback, which wasn't enough to solve the overall delay issue.

Therefore we enhanced the model by making an automated script that cuts the audio as soon as there is a natural pause (in between sentences), and then that audio file is put through the translation process while the next audio file is being recorded during the rest of the talk. This pioneered the model with a waiting time of less than 1 second before hearing the translated audio, since everything is now happening simultaneously. This way, the initial recordings can be played immediately after pressing the "Sluta prata" button, while the remaining snippets are processed concurrently during playback.

Next week, we will further explore the implementation of small audio snippets. We discovered that Whisper sometimes struggles to transcribe small snippets without the context of the surrounding sentences. Following that, we will implement real-time text display on the app screen synchronized with the audio playback, to enable reading.