Lip Sync
ONNX neural inference maps speech phonemes to 52 ARKit blendshapes at 30fps. Crisp mouth movements with natural co-articulation.
AnimaSync extracts emotion from speech and generates lip sync, facial expressions, and body motion in real time — no server required.
One engine handles the full animation pipeline — from raw audio to animated avatar.
ONNX neural inference maps speech phonemes to 52 ARKit blendshapes at 30fps. Crisp mouth movements with natural co-articulation.
Voice energy and pitch automatically drive brows, cheeks, eyes, and smile. Emotion follows the speaker naturally.
Stochastic blink injection at 2.5–4.5s intervals with 15% double-blink probability. No dead-eyed avatars.
Embedded VRMA bone animation clips with smooth idle-to-speaking crossfade. Breathing, gestures, and posture shifts.
AudioWorklet captures microphone at 16kHz. Process chunks as they arrive — no need to wait for complete audio.
Rust/WASM + ONNX Runtime Web. No server, no API calls, no data leaves the browser. Works offline after first load.
Install from npm, initialize the engine, and start generating animation frames. Works with any Three.js + VRM setup.
Two engines, one API surface. Pick the engine that fits your project.
Phoneme classification engine — 111-dim output with full expression control. Built-in IdleExpressionGenerator, VoiceActivityDetector, and VRM 18-dim mode.
Student distillation model — direct 52-dim ARKit blendshape prediction. Simpler post-processing, faster integration.
Interactive demos you can try right now — no install needed.
Upload a VRM avatar, then speak into your microphone or upload audio — watch real-time lip sync, facial expressions, eye blinks, and body motion. Fully client-side, no server needed.
6-step interactive tutorial. Download a VRM, wire up AnimaSync V1, apply lip sync, add mic streaming — with live demos at each step.
V1 phoneme engine — 111-dim output mapped to 52 ARKit blendshapes. ONNX inference with real-time visualization.
Try it →V2 student distillation model — 52 ARKit blendshapes with direct prediction. Crisp mouth, real-time rendering.
Try it →Same voice input, two animation engines, two avatars. See the difference live in a dual-panel view.
Try it →Two engines for different needs. Both produce ARKit-compatible output at 30fps.
| Feature | V1 Recommended | V2 |
|---|---|---|
| Output | 111-dim ARKit blendshapes | 52-dim ARKit blendshapes |
| Architecture | Phoneme classification + viseme mapping | Student distillation (direct) |
| Post-processing | OneEuroFilter + anatomical constraints | crisp_mouth + fade + auto-blink |
| Idle expressions | Built-in IdleExpressionGenerator | Blink injection in post-process |
| Voice activity | Built-in VoiceActivityDetector | — |
| Best for | Full expression control, custom avatars | Quick integration, lightweight |