Six ways to put a talking, lip-synced face on the web — from pure client-side three.js to streaming a cinematic Unreal MetaHuman off a cloud GPU. Each is a real, runnable demo with a shared performance HUD so you can try them and measure them on your own machine.
The portable core. A glTF head driven by the 52 standard ARKit blendshapes — manual sliders, expression presets, and the embedded face-capture clip. This is the foundation everything else builds on.
Mouth shapes derived from a live audio signal (mic, file, or TTS) via the Web Audio API — the browser stand-in for NVIDIA Audio2Face. Speak and the face moves.
Type text, the browser speaks it, and visemes drive the mouth in sync — the Azure / Ready Player Me pattern, here using the built-in Web Speech API so it needs no API key.
The full thing, and the closest to your use case: speak → speech-to-text → a dialogue tree → reply → text-to-speech → lipsync. A face that talks back, entirely in the browser.
The off-the-shelf route: a polished open-source avatar library and a Ready Player Me character with built-in lipsync, moods, and gestures. What you ship when you don't want to build the pipeline.
The only path to true MetaHuman fidelity on the web: Unreal renders the real thing on a cloud GPU and streams the video over WebRTC. Explainer, provider comparison, cost estimator, and a live connection tester.
The biggest perceived-quality lever: image-based lighting plus a post stack (ambient occlusion, bloom, depth of field, antialiasing). Toggle each pass and watch the jump from "raw WebGL" to "rendered."
Shading turns a plastic head into believable skin: fake subsurface scattering, sheen, detail normals, and clearcoat catchlights in the eyes. A/B against the flat material. Skin SSS is the real MetaHuman moat.
The behavioral half of realism, pure JS: blinks, saccades, breathing, idle sway, cursor gaze, and co-articulated speech. Toggle "dead" vs "alive" — this sells presence more than polygons do.
The next-gen upgrade path: three.js's WebGPU renderer with a node-based (TSL) material on the head. Compute, better materials, more headroom — the direction the gap keeps closing.
The photoreal, non-Unreal path: render a real captured scene as 3D Gaussian splats. Captured reality, not mesh+shader. Animatable head splats are the frontier for a photoreal talking face without an engine.
The synthesis: lighting, post, skin, eyes, and micro-motion on one head, each independently toggleable with the perf HUD. A/B "raw WebGL" vs "full cinematic" and measure the fps cost of every layer on your hardware.
Swap between five real heads — two photoreal (Avaturn, Avatar SDK), Ready Player Me, the high-detail Lee Perry-Smith scan, and the neutral scan — driven by one rig. Expressions, audible lipsync, and micro-motion on every face. Drop in your own MetaHuman GLB.