Use Cases
Realistic lip-syncing is crucial for creating immersive and expressive AI characters. Lip-syncing involves matching the mouth movements of a virtual character to the spoken dialogue, ensuring that the visual and auditory aspects of speech are synchronized. This synchronization improves the character's believability and engagement as viewers perceive the speech through visual and audio cues.
Accurate lip-syncing is important because it conveys the illusion that the character is speaking the words. When mouth movements match dialogue, the performance is more natural and convincing, allowing your audience to focus on the speech's content and emotion.
This article explores the challenges faced in achieving realistic lip-sync, such as quality, latency, and emotional expression, and how Convai is working to overcome these hurdles. You’ll also learn about integrating Convai’s lip sync component into characters with popular game engines like Unreal Engine and Unity.
Realistic lip-sync is essential for capturing the nuances of speech, such as the shape and timing of mouth movements. This process typically uses “visemes,” which are visual representations of phonemes (the distinct units of sound in speech; e.g., "cat" is three phonemes: /k/, /æ/, and /t/).
The visemes are the specific mouth shapes and movements associated with each phoneme. While there are many phonemes, visemes group similar-looking mouth shapes, reducing the complexity required for animation. For instance, the single viseme representing the phonemes /b/, /p/, and /m/ has a lip shape similar to all three.
Lip-syncing techniques range from simple rule-based systems to complex neural networks that produce highly detailed and expressive animations. Advanced techniques use AI and machine learning to analyze audio input and generate corresponding mouth shapes and facial expressions in real time.
When done well, lip-sync brings virtual characters to life and allows them to deliver dialogue naturally and convincingly. It is a key component of creating expressive characters that can engage in realistic conversations and convey emotions through their speech and facial animations.
Lip-sync is especially important for applications like video games, animated films, and virtual assistants, where the character's speech is a primary means of communication and storytelling.
Watch Convai and Nvidia’s demo from CES 2024, which showcases accurate lip sync, action generation, and scene perception (read more about it on this blog).
At Convai, we have developed a multi-tiered approach to lip-syncing, addressing various needs and use cases for virtual characters.
The underlying goal is to create realistic, natural lip movements that match the spoken audio. We have seen this improve the overall immersion and engagement of conversational AI experiences for our users.
See Also: Convai Gallery (A collection of games, apps, demos, and samples built on the Convai Platform).
Our strategy uses existing industry-standard solutions and cutting-edge technology to provide you with flexible options for lip-syncing. We primarily focus on server-side processing to reduce the computational burden on your devices for smoother performance and broader compatibility.
Three levels of lip-syncing:
At Convai, we offer our users both the OVR Lip Sync component and Audio2Face options. In the sections that follow, we will discuss how Convai allows you to use both.
Hardcoded blend shapes involve using predefined facial expressions or shapes (blend shapes) mapped to phonemes. ARKit, Apple's augmented reality framework, provides a set of blend shapes specifically designed for facial animation.
In this technique:
This method is straightforward but can be limited in flexibility and naturalness as it relies on predefined mappings.
Convai supports ARKit blendshapes for Ready Player Me characters through heuristic mapping.
Convai's primary lip-sync solution is OVR Lip Sync, which Oculus (Meta) developed to enable realistic mouth animations for virtual characters based on audio input.
This technique takes raw audio bytes as input and outputs visemes (visual phonemes) at 100 frames per second. There are 15 visemes that get parsed on the client side (Unity, Unreal Engine, Three.js) and applied to the character's face.
With OVR Lip Sync, the audio processing occurs on the server-side on Convai. The client sends raw audio data to the lip sync server, which analyzes the audio and extracts the viseme information.
The server outputs the viseme data at 100 frames per second. (We apply smoothing here to avoid jittery animation.) This approach offloads the processing from your device and ensures consistent lip sync results across different platforms.
While easy to implement, OVR Lip Sync has limitations:
Convai offers a more advanced lip-sync solution for enterprise customers—Audio2Face, powered by NVIDIA. It uses pre-trained models and runs in NVIDIA containers. Unlike OVR Lip Sync, which uses 15 visemes, Audio2Face outputs weights for 52 Arkit blend shapes. These blend shapes cover a wider range of facial expressions, including mouth, emotional, and head motions.
The AI models used by Audio2Face are trained on large audio and corresponding facial motion datasets. They analyze the audio signal and map it to the appropriate blend shape weights to generate realistic lip-sync and facial expressions. This AI-driven approach enables Audio2Face to produce high-quality, expressive results superior to OVR Lip Sync.
Some key advantages of Audio2Face include:
While Audio2Face offers superior quality in terms of facial animation, it is slower than OVR Lip Sync. This trade-off is because of the more complex processing involved in analyzing audio and generating blend shape weights.
Key points of comparison include:
We are actively working on improving Audio2Face based on your feedback and requests. Some key areas of focus include:
Big changes are coming to Audio2Face to improve its quality, speed, and ease of use. We are always working to improve Audio2Face so that you can create perfect facial animations for virtual characters across a wide range of applications.
The Convai plugin helps you integrate two main lip-sync technologies: OVR Lip Sync and Reallusion CC4 Extended (Reallusion CC4+) Blendshapes. It supports majorly three characters:
Let’s see how to add lip sync to a MetaHuman character in this section. In Unreal Engine (UE), you can add lip sync to a MetaHuman character from the plugin on the UE marketplace.
NOTE: Before you begin, ensure your character model has a compatible facial rig using blend shapes or bones to control the mouth, jaw, and other facial features. Unreal Engine has a tutorial on YouTube on creating facial rigs for your specific MetaHuman character. Also, check out the documentation to learn more about it.
Modify the parent class of your MetaHuman to ConvaiBaseCharacter, as indicated in the documentation.
3. Search for Convai Face Sync and select.
4. Finally, LipSync has been added to your MetaHuman character. Compile and save it, then try it.
If you use Unity, you can also install the Convai Unity SDK from the Asset Store. Here’s also a tutorial to learn how to create lifelike NPCs in Unity with Reallusion CC4 characters:
To add the lip-sync component to your Read Player Me character, follow the documentation's instructions.
When integrating lip-sync functionality into your Unreal Engine project using Convai, it's essential to consider the performance implications. Convai has designed its lip-sync component to minimize the impact on client-side performance while optimizing server-side latency.
This section will discuss the minimal client-side performance impact and the server-side latency optimization techniques to ensure smooth and efficient lip-syncing.
One significant advantage of using Convai's lip-sync solution is the minimal performance impact on the client side. We designed the lip-sync component to be lightweight so it does not add significant overhead to the client application. Here’s how:
Based on benchmarks conducted by our engineering team, server-side lip-sync on the client results in negligible processing time. Most latency is introduced on the server side, where the audio data is analyzed and the viseme data is generated. Even low-end client devices (e.g., Android) can benefit from high-quality lip-sync without sacrificing performance.
You can be confident that enabling lip-sync will not adversely affect the performance of your projects.
Convai is continuously working on optimizing the server-side latency to provide the best possible lip-sync experience. The goal is to minimize the delay between the audio input and the corresponding lip-sync output. This ensures that the character's mouth movements are synchronized with the speech in real-time.
To achieve low latency, Convai employs various techniques, such as:
It's worth noting that while Convai's lip-sync solution is designed to minimize latency, there may still be some delay depending on factors such as network conditions and the complexity of the audio input.
However, our team is actively working on further optimizations to reduce latency and improve the overall performance of the lip-sync functionality.
We are actively working on several exciting developments to enhance its lip-sync and facial animation plugins. As you demand realistic and engaging virtual characters, we are developing more features to create expressive and emotionally responsive characters that can better connect with users in various applications.
Here’s what we are doing:
One key area of focus is incorporating emotion detection based on the user's speech. We are working on improving the realism of our lipsync component by detecting emotions in real-time. Here’s how:
Building on emotion detection, we want characters to generate emotional responses that improve their expressiveness and realism:
Improving facial expressions is crucial for achieving greater realism in virtual characters. At Convai, we are focused on several initiatives to improve this aspect:
Our goal is to improve your experience by creating virtual characters that can truly connect with users on an emotional level.
Lip-syncing is crucial to creating realistic and engaging virtual characters. Convai offers multiple lip-sync solutions, including a basic form using hard-coded blend shapes, OVR Lip Sync from Oculus/Meta, and NVIDIA's more advanced Audio2Face tool.
While each approach has strengths and weaknesses, they all seek to synchronize the character's mouth movements with the spoken dialogue, improving the interaction's overall believability.
Our team continuously explores ways to improve the quality, expressiveness, and efficiency of lip-sync animations as you demand immersive and interactive experiences. This includes incorporating emotion detection and generation, AI techniques, and real-time performance optimization.
With ongoing advancements in this field, virtual characters can expect to become increasingly lifelike, responsive, and engaging, opening up new possibilities for applications in entertainment, education, and beyond.
Additionally, see our article on Convai’s Safety guardrails for AI Characters to learn more about Convai ensures safe conversation with intelligent virtual characters.