Project Overview

About the Project

AI technology has developed significantly over the past few years. For example, there are many types of AI such as voice assistants like Siri and Alexa, generated models like Chat GDP and DALL-E, self-driving AI, and image recognition AI. They are very useful close to us and support our lives. However, AI does not have emotion, making it difficult for humans to have natural conversations with AI. Also, since the AI we are familiar with only generates text, it is difficult to have natural conversations like humans.

Features

Speech to Text:

Skill: Web Speech API
Role: Converts user voice input into text data in real time.
Data format: Speech - Text (input)

AI Response:

Skill: Open AI API
Role: Based on text input, generate text responses accordingly.
Data format: Text (input) - Text (response)

Emotional Analysis:

Skill: Open AI API
Role: Analyzes the user's input text and the AI's response to determine the avatar's expression. [smiling, sleepy, expectation, sad, embarrassed, surprised, angry]
Data format: Text (input + response) - Text (emotion/expression data)

Text to Speech:

Skill: Amazon Polly
Role: Converts response text into natural speech.
Data format: Text(response) - Mp4

Audio Analysis and Play:

Skill: Web Audio API
Role: Playback of mp4 converted by Amazon Polly. Analyze audio data and get frequency data for lip sync of avatar.
Data format: Mp4 - Frequency data

Avatar Motion:

Skill: Live 2D Cubism SDK for web
Role: Provide avatars with expressions and movements, and display animations based on user responses.

Avatar:

Skill: Live2D
Role: Avatar made with Live2D

Auth:

Skill: Firebase Auth
Role: Manage user authentication and provide secure access to the system.

System

System Architecture

Data Flow

Technologies

Web Speech API (speech-to-text):

The Web Speech API is a native browser API developed by the W3C. It is free to use and does not require any auth keys. It allows speech recognition to text and text to speech using JavaScript. It requires no backend and is easy to implement. I use this API as speech to text.

Open AI API:

The OpenAI API is an advanced platform that provides a wide variety of functions using natural language processing (NLP). By using API keys to authenticate, the user can use text generation, image recognition, voice processing, image generation, video generation, and more.

Amazon Polly (text-to-speech):

Amazon Polly is a text-to-speech service that uses deep learning to generate natural, human-like speech. It is also available in many languages. You can choose between standard voice and high-quality neural voice, 62 different voice qualities, and gender and usage. The output files are encrypted with 256-bit Advanced Encryption Standard, providing security. Amazon Polly is suitable for video narration and automated voice. To use it, you must create an AWS account.

Web Audio API:

Web Audio API is an API that allows users to generate, edit, and analyze audio in real time in the browser. It is able to perform speech synthesis, frequency analysis, and more.

Live2D Cubism SDK for Web:

Live 2D SDK is a development kit for displaying and manipulating Live2D models in application. It supports a variety of platforms including Unity, WebGL, Web, and consumer platforms such as PS5 and Nintendo Switch. It enables real-time rendering and character expression. This time I will use the web.

Firebase Auth:

Demos

LINK

TXWES AI

Contact

For inquiries, please reach out to me at rtsunamura01@gmail.com.