Use Cases
Creating realistic and expressive facial animations for AI characters is a complex challenge in developing virtual worlds and interactive experiences. Developers and animators often struggle to achieve natural and believable facial movements that accurately synchronize with speech and emotions. This can lead to a disconnect between the character's appearance and behavior, diminishing the overall immersion and engagement of users.
Fortunately, recent advancements in AI (LLMs, diffusion models, agents) have given rise to new solutions that promise to revolutionize lip sync and facial animation. Convai’s plugin for platforms such as Unreal Engine, Unity, Roblox, etc., includes components for applying context-aware face animations and lip-sync to talking characters.
In this article, we will:
By the end of this article, you will know how to create highly realistic characters with precise lip sync and expressive facial animations.
See Also: Lip-Syncing Virtual AI Characters with Convai in Unreal Engine.
Facial expressions are crucial for creating realistic AI characters, as they convey emotions and enhance the overall immersive experience for users. Realistic facial animations can improve user engagement by making interactions with virtual characters more relatable and lifelike.
Expressive faces help communicate non-verbal cues essential for natural and effective communication. Without accurate facial expressions, AI characters may appear robotic and fail to evoke the intended emotional responses from users.
ARKit blend shapes provide a powerful tool for creating dynamic facial expressions. These blend shapes are based on the Facial Action Coding System (FACS), which breaks down facial movements into individual action units (eyebrows, eyes, mouth, and cheeks).
Combining these action units can achieve various facial expressions that reflect various emotions. ARKit offers 52 distinct blend shapes, such as mouthSmile_L, eyeBlink_R, and browInnerUp, which can be used to create nuanced emotions.
ARKit blend shapes are particularly effective because they provide a standardized way to achieve high-quality facial animations across different platforms and devices. These blend shapes are integral to generating realistic facial movements synchronized with audio for lip-syncing.
You use predefined poses—sets of blend shape values representing specific emotions—to simplify the creation of emotional expressions. For instance, a “happy” pose might involve raised eyebrows, a wide smile, and lifted cheeks, while a “sad” pose might include furrowed brows, a downturned mouth, and drooping eyelids.
Here are some common predefined poses:
Using these predefined poses, you can quickly apply realistic facial expressions to AI characters to improve their emotional depth and make them more engaging.
In real-world interactions, people often experience and express mixed emotions simultaneously. To capture this complexity, your AI character must seamlessly blend multiple emotions. This involves combining different predefined poses based on the intensity of each emotion.
For example, a character might be primarily happy but also slightly surprised. The system would blend the "happy" and "surprised" poses, resulting in a nuanced expression that reflects both emotions.
The process of blending multiple emotions involves:
When you blend multiple emotions, AI characters can exhibit a wide range of expressive behaviors that make interactions with them more realistic and engaging. This is essential for gaming, virtual reality, and other interactive media applications where emotional authenticity improves the user experience.
Convai's lip sync and facial animation system is designed to provide seamless integration and an efficient workflow across various platforms. Convai ensures optimal performance and customization options for creating realistic character animations using server-side computation and client-side synchronization.
Convai handles the heavy lifting of generating lip sync and facial animation data on the server side. This approach reduces client-side performance impact by offloading the computation to the server.
Client devices, especially those with limited processing power like mobile phones or web browsers, can focus on rendering and other tasks. This ensures a smooth user experience without overburdening the client's hardware.
Convai provides integration support for popular game engines and platforms (Unity, Unreal Engine, Roblox, WebGL, Discord). This helps if you want to incorporate lip sync and facial animation into your projects.
On the client side, platforms like Unreal Engine, Unity, and WebGL synchronize the audio playback with the lip sync frames received from the server. Convai's plugin handles this synchronization process so that the character's lips move in perfect harmony with the audio.
Each animation frame is carefully timed to correspond with the audio, making the characters' speech appear natural and fluid. Synchronization is crucial for creating a believable and immersive character performance.
Convai offers flexibility and customization options to fine-tune the lip sync and facial animation system to your needs as a developer.
When sending the initial request to the Convai server, you can control which features to enable or disable. For example, you can receive only the audio without lip sync data or request both audio and lip sync frames for comprehensive character animation.
This granular control allows you to optimize performance and tailor the system to the project's requirements.
For custom characters with unique blend shapes, developers can map the visemes (visual representations of speech sounds) to their character's specific blend shapes. Convai provides a tutorial and step-by-step guidance on how to perform this mapping process.
You can ensure the lip sync looks natural and matches the character's unique facial structure by customizing the viseme-to-blend shape mapping.
In Unreal Engine, you can fine-tune lip-sync and facial animations to achieve desired results for your AI characters. Convai plugin allows developers to adjust and customize how characters' mouths and facial features move during speech.
This customization is crucial for ensuring that animations align with the project's specific visual style and requirements, whether it's a realistic or stylized character.
The video below is a tutorial on integrating custom characters with Convai in Unreal Engine and adding components like facial animations and lip sync.
Here are the steps from that video on how to customize lip-sync and facial animations in Unreal Engine:
Convai currently uses viseme-based animations for lip sync rather than blend shapes or ARKit. In future releases, the plugin will support ARKit blend shapes. To set up lip sync, map each visem to Echo's corresponding blend shape or bones.
Here's how to do it:
Note that the main challenge in this process is the mapping from visemes to your character's specific blend shapes or bones. This step can vary significantly depending on your character's design and facial structure.
Take your time to carefully identify the relevant blend shapes or bones for each viseme and adjust as needed to achieve the desired lip sync animation.
Convai lip sync and facial animation components have undergone significant advancements over the past year. These improvements have improved the realism and accuracy of the animations, resulting in a more immersive and engaging experience for users.
One of the key improvements has been the optimization of the viseme-based lip sync system. Previously, the lip movements were often inaccurate and noisy, with fast fluctuations that appeared unnatural.
However, by implementing server-side post-processing techniques, Convai has significantly improved the accuracy and smoothness of the lip sync, resulting in more realistic animations.
We have also started rolling out NVIDIA's Audio2Face on the Convai server. Unlike the viseme-based system, Audio2Face outputs ARKit blend shapes directly for more advanced and expressive solution.
It generates blend shapes not only for the lips but also for the eyebrows, eyes, and the entire face, enabling a wider range of emotions and expressions. Although Audio2Face is computationally expensive and in beta, it is gradually being rolled out for high-tier plans and partners.
Looking ahead, Convai is excited about the ongoing work on AI-generated facial expressions. We are training models to output emotional voice, emotional lip sync, and corresponding facial expressions.
Our users want to move beyond preset poses and achieve even greater realism. This transition will ensure that your character's voice, facial expressions, and lip sync are seamlessly synchronized to improve the overall believability of the animations.
Advanced lip sync and facial animation technology open up many potential applications. In the gaming industry, these improvements can lead to more immersive and engaging character interactions, enhancing the overall player experience.
Brand agents can benefit from more natural and expressive communication, making the interaction more human-like. Realistic facial animations can also help the education sector create compelling e-learning content.
Great! Throughout this article, we reviewed face animation, lip-syncing, and the complexities of creating realistic and expressive animations. We reviewed how you could use the Convai plugin on a platform like Unreal Engine to integrate face animations and apply lip-syncing to your character.
The integration process involves server-side computation for performance optimization and client-side synchronization for seamless playback. You also have customization options for tailoring animations to specific characters and use cases.
Convai is actively working on integrating facial capture technology and AI-generated facial expressions for even greater realism and expressiveness. Convai empowers developers to create engaging and believable character animations with AI across various platforms.
Check out our other article on lip-syncing AI characters, and sign up for our Discord if you have questions about using Convai.