Splannequin

Freezing Monocular Mannequin-Challenge Footage with Dual-Detection Splatting

WACV 2026

1National Yang Ming Chiao Tung University, 2Ohio State University
Teaser

Splannequin transforms imperfect Mannequin-Challenge videos into completely frozen videos.

(Top) A monocular Mannequin-Challenge video is intended to resemble large-scale frozen frames, yet real-world recordings inevitably contain slight body movements. The red crops across successive frames (Ii) highlight these noticeable movements. (Bottom) After our processing, every crop (green boxes) of successive frames remains perfectly static. Splannequin analyzes the entire video and resynthesizes a temporally consistent sequence of views at t* while preserving overall visual fidelity.

Freezing Results of Our Mannequin-Challenge Dataset

Video

Coming soon

Abstract

Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge (MC) videos is a unique problem distinct from standard dynamic scene reconstruction. Instead of focusing on modeling motion, our goal is to create a frozen scene while strategically preserving subtle dynamics to enable user-controlled instant selection. To achieve this, we introduce a novel application of dynamic Gaussian splatting: the scene is modeled dynamically, which retains nearby temporal variation, and a static view is rendered by fixing the model's time parameter. However, under this usage, monocular capture with sparse temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded at weakly supervised timestamps. We propose Splannequin, an architecture-agnostic regularization that detects two states of Gaussian primitives, hidden and defective, and applies temporal anchoring. Under predominantly forward camera motion, hidden states are anchored to their last well-observed past states, while defective states are anchored to future states with stronger supervision. Our method integrates into existing dynamic Gaussian pipelines via simple loss terms, requires no architectural changes, and adds zero inference overhead. This results in markedly improved visual quality, enabling high-fidelity, user-selectable frozen-time renderings, validated by a 96% user preference.

Pipeline

Pipeline

The pipeline: (1) extracts point clouds from input video, (2) use dynamic Gaussian splatting with dual-detection losses that anchor hidden Gaussians to earlier frames (t'< t) and defective Gaussians to later frames (t < t'), and (3) renders freeze-time videos at any timestamp t*. Temporal distance-based confidence weighting ensures appropriate regularization strength, with closer reference frames providing stronger anchoring than distant ones for robust temporal consistency and artifact elimination.

Time-Camera Conceptualization.

Our approach optimizes problematic Gaussians using pseudo ground truth from the same horizontal coordinates at supervised areas. The bird's-eye view shows a high school hallway with frame samples and image planes.

Agdd

Assuming forward camera motion, the diagonal dashed line represents standard dynamic rendering, while the horizontal line shows freeze-time rendering at a fixed timestamp t*. Along this freeze-time line, unsupervised Gaussians are either hidden (red points, as the camera has passed them) or defective (blue points, not yet well-observed). Our approach regularizes these problematic Gaussians by anchoring them to their supervised counterparts from other timestamps: hidden (red) Gaussians use past states, and defective (blue) Gaussians use future states. The right panel shows a bird's-eye view of a hallway, illustrating how the camera's path creates defective and hidden regions.

User-Selectable Freeze-Time Instants.

Unseen

Splannequin empowers users to select the precise moment to freeze, allowing for artistic control over the final scene. Both rows show high-fidelity freeze-time videos generated from the same input sequence but frozen at two different, user-selected timestamps. (Top) At Timestamp 0, the subject in the inset is looking down. (Bottom) At Timestamp 80, captured seconds later, the subject has turned their head. Our method successfully reconstructs both moments with sharp detail and stability, preserving these subtle differences and enabling creative selection based on pose and expression.

Visual Comparison

Baseline dynamic methods are improved by Splannequin to achieve better quality while having the frozen effect without relying on any extra information and diffusion-based hallucination. The final results are faithful to content and are free to select any timestamp. Below video comparison shows one improved timestamp.

Comparison

Each column shows freeze-time renderings from all methods at a viewpoint. Rows correspond to direct comparisons of identical viewpoints with baselines: 4DGaussians (top), D-3DGS (middle), and SC-GS (bottom). Adding Splannequin consistently produces sharper, more temporally coherent results, exhibiting reduced ghosting and artifact suppression compared to baseline methods.


3 Try selecting different methods and scenes!

1 2 5 6 7 8

Quantitative Results

Quantitative comparison on our real-world dataset. The values represent the percentage improvement Splannequin provides when added to each baseline method (higher is better). Our method consistently enhances all baselines, with the most gains in technical artifact suppression (COVER Technical) and on the lowest-quality frames (IQA Bottom 25%). Methods are abbreviated as: (1) 4DGaussians+, (2) D-3DGS+, and (3) SC-GS+. W.F. is the worst frame.

Quantitative

Citation