Live Portrait
Efficient and Controllable Video-Driven Portrait Animation
Speaker-aware talking head animation for high-fidelity facial synchronization from a single image.
MakeItTalk is a sophisticated AI framework focused on speaker-aware talking head animation, originally introduced at SIGGRAPH 2020. Unlike simple warping methods, MakeItTalk utilizes a 3D Morphable Model (3DMM) to disentangle identity-specific and speech-specific facial landmarks. By predicting facial movements based on audio input, it generates highly realistic animations from a single portrait image. In the 2026 landscape, MakeItTalk serves as a critical, lightweight baseline for developers requiring real-time, landmark-based animation on edge devices where heavy diffusion-based models (like EMO or LivePortrait) might be computationally prohibitive. The architecture effectively captures not just lip movement, but also non-verbal cues such as head tilts, eye blinks, and brow movements, synchronized with the audio's prosody. It is particularly valued in the research community for its ability to animate diverse subjects, including oil paintings, sketches, and 2D cartoon characters, making it a versatile tool for stylized digital content creation and legacy photo revitalization.
Uses a deep neural network to predict 3D facial landmarks from audio features, disentangling speaker identity from the speech content.
Efficient and Controllable Video-Driven Portrait Animation
Turn 2D images and videos into immersive 3D spatial content with advanced depth-mapping AI.
High-Quality Video Generation via Cascaded Latent Diffusion Models
The ultimate AI creative lab for audio-reactive video generation and motion storytelling.
Verified feedback from the global deployment network.
Post queries, share implementation strategies, and help other users.
Utilizes 3D Morphable Models to represent facial geometry, allowing for realistic head rotation and perspective changes.
Trained on diverse datasets allowing the model to interpret non-photorealistic faces like sketches and paintings.
Predicts rhythmic head tilts and rotations based on the prosody and energy of the audio input.
Separates the audio signal into content (phonemes) and speaker (pitch/tone) components.
Landmark-based approach is significantly faster than pixel-level diffusion generation.
Implements a smoothing filter over the predicted sequence of landmarks to prevent jitter.
Animating statues or old paintings where no video data exists.
Registry Updated:2/7/2026
Syncing an actor's lips to a new language track without reshooting.
Streaming video in areas with extremely poor connectivity.