CVPR 2026 3D Reconstruction Curated Relevance Audit

This is a relevance-curated pass over the earlier 864 strict candidates. It is not a quality ranking. The goal is to separate core reconstruction papers from strong system bridges, adjacent context, and likely keyword noise.

Rows

#	Relevance	Paper	Editorial bucket	Matched groups	Reason	Abstract
1	core_reconstruction high	Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation Autonomous Driving / Autonomous Driving	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	VGGT/feed-forward geometry lineage with direct geometry signal	abstract Understanding dynamic 3D environments is essential for safe autonomous driving, particularly when reasoning about human-centric, nonrigid agents. However, existing self-supervised occupancy prediction frameworks predominantly assume rigid-body motion and rely on simple frame-to-frame offsets, limiting their ability to capture fine-grained deformations and maintain temporal coherence. To address this issue, we propose DeGO, a deformable Gaussian occupancy framework that unifies decoupled Gaussian deformation with factorized 4D foundation-model distillation. DeGO disentangles rigid and nonrigid motion, enabling each Gaussian primitive to evolve through both deformation and offset-based updates. In parallel, a factorized 4D distillation strategy transfers cross-camera and cross-frame knowledge from the VGGT foundation model, producing foundation-aligned features that enhance temporal consis
2	core_reconstruction high	Dynamic Visual SLAM using a General 3D Prior Robotics & Embodied AI / Embodied AI	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping	VGGT/feed-forward geometry lineage with direct geometry signal	abstract Reliable incremental estimation of camera poses and 3D reconstruction is key to enable various applications including robotics, interactive visualization, and augmented reality. However, this task is particularly challenging in dynamic natural environments, where scene dynamics can severely deteriorate camera pose estimation accuracy. In this work, we propose a novel monocular visual SLAM system that can robustly estimate camera poses in dynamic scenes. To this end, we leverage the complementary strengths of geometric patch-based online bundle adjustment and recent feed-forward reconstruction models. Specifically, we propose a feed-forward reconstruction model to precisely filter out dynamic regions, while also utilizing its depth prediction to enhance the robustness of the patch-based visual SLAM. By aligning depth prediction with estimated patches from bundle adjustment, we robustly ha
3	core_reconstruction high	DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving Autonomous Driving / Autonomous Driving	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping	VGGT/feed-forward geometry lineage with direct geometry signal	abstract Dynamic scene reconstruction in autonomous driving remains a fundamental challenge due to significant temporal variations, moving objects, and complex scene dynamics. Existing feed-forward 3D models have demonstrated strong performance in static reconstruction but still struggle to capture dynamic motion. To address these limitations, we propose DynamicVGGT, a unified feed-forward framework that extends VGGT from static 3D perception to dynamic 4D reconstruction. Our goal is to model point motion within feed-forward 3D models in a dynamic and temporally coherent manner. To this end, we jointly predict the current and future point maps within a shared reference coordinate system, allowing the model to implicitly learn dynamic point representations through temporal correspondence. To efficiently capture temporal dependencies, we introduce a Motion-aware Temporal Attention (MTA) module that
4	core_reconstruction medium	E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training Learning Algorithms / Self-supervised	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Self-supervised pre-training has revolutionized foundation models for language, 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Exp
5	core_reconstruction high	Emergent Extreme-View Geometry in 3D Foundation Models 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract 3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, non-overlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with ded
6	core_reconstruction high	Emergent Outlier View Rejection in Visual Geometry Grounded Transformers 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Reliable 3D reconstruction from in-the-wild image collections is often hindered by noisy images—irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effe
7	core_reconstruction high	FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract 3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences.In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of descriptor tokens. Global attention is then computed as cro
8	core_reconstruction high	Flow3r: Factored Flow Prediction for Visual Geometry Learning 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We propose Flow3r, a scalable framework for visual geometry learning that leverages flow prediction to guide learning using unlabeled monocular videos. Current 3D/4D reconstruction systems primarily rely on dense geometry and pose supervision, and cannot easily generalize to diverse dynamic real-world scenes. In this work, we propose a mechanism to augment training directly from unlabeled videos, leveraging dense 2D correspondences (or ‘flow’) between arbitrary image pairs as supervision. Our key insight is that a factored flow prediction module that computes from two images using ‘geometry latents’ from one image and the ‘pose latent’ from the othercan guide visual geometry learning. We first highlight the benefits and scalability of flow supervision in controlled settings and then leverage large-scale unlabeled data to improve off-the-shelf visual geometry models. We evaluate Flow3r ac
9	core_reconstruction high	FRM: Linear-Time 3D Reconstruction via Test-Time Training 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Feed-forward transformer models such as VGGT and $\pi^3$ are highly accurate, but their computational cost grows quadratically with the number of input images, making them slow to evaluate on large collections. More efficient approaches ameliorate this cost at the expense of reconstruction quality. We introduce Fast Reconstruction Model, a stateful feed-forward reconstruction model that uses a bidirectional architecture that scales linearly in the number of input views, while matching or surpassing the reconstruction quality of quadratic-time methods. FRM employs test-time training layers to compress images into a compact hidden scene state during a single forward pass, enabling our model to reconstruct 3D scenes at speeds up to 75 FPS on a single H100 GPU---over 20 times faster than SOTA methods such as VGGT. This hidden state also serves as an implicit scene representation which can be
10	core_reconstruction high	GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting has demonstrated superior performance in rendering efficiency and quality, yet the generation of 3D Gaussians still remains a challenge without proper geometric priors. Existing methods have explored to predict point maps as geometric references for inferring Gaussian primitives, while the unreliable estimated geometries may lead to poor generations. In this work, we introduce GaussianGrow, a novel approach that generates 3D Gaussians by learning to grow them from easily accessible 3D point clouds, naturally enforcing geometric accuracy in Gaussian generation. Specifically, we design a text-guided Gaussian growing scheme that leverages a multi-view diffusion model to synthesize consistent appearances from input point clouds for supervision. To mitigate artifacts caused by fusing neighboring views, we constrain on novel views generated at non-preset camera poses iden
11	core_reconstruction high	Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, \method{} produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.
12	core_reconstruction high	Generalizable Sparse-View 3D Reconstruction from Unconstrained Images 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Reconstructing 3D scenes from sparse, unposed images remains challenging under real-world conditions with varying illumination and transient occlusions. Existing methods rely on scene-specific optimization with appearance embeddings or dynamic masks, requiring extensive per-scene training and failing under sparse views. Moreover, evaluations on limited scenes raise questions about generalization. We present GenWildSplat, a feed-forward framework for sparse-view outdoor reconstruction that requires no per-scene optimization. Given unposed internet images, GenWildSplat predicts depth, camera parameters, and 3D Gaussians in a canonical space using learned geometric priors. An appearance adapter modulates appearance for target lighting conditions, while semantic segmentation handles transient objects. Through curriculum learning on synthetic and real data, GenWildSplat generalizes across div
13	core_reconstruction high	Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction Autonomous Driving / Autonomous Driving	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; gaussian_radiance; depth_correspondence; surface_occupancy; robotics_mapping	VGGT/feed-forward geometry lineage with direct geometry signal	abstract Accurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction.We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric s
14	core_reconstruction high	GGPT: Geometry-Grounded Point Transformer 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Recent feed-forward networks have achieved remarkable progress in sparse-view 3D reconstruction by predicting dense point maps directly from RGB images. However, they often suffer from geometric inconsistencies and limited fine-grained accuracy due to the absence of explicit multi-view constraints. We introduce the Geometry-Grounded Point Transformer (GGPT), a framework that augments feed-forward reconstruction with reliable sparse geometric guidance. We first propose an improved Structure-from-Motion pipeline based on dense feature matching and lightweight geometric optimisation to efficiently estimate accurate camera poses and partial 3D point clouds from sparse input views.Building on this foundation, we propose a geometry-guided 3D point transformer that refines dense point maps under explicit sparse-geometry supervision using an optimised guidance encoding. Extensive experiments dem
15	core_reconstruction high	HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Visual Geometry Grounded Transformer (VGGT) has shown significant progress in 3D vision tasks. However, its global attention layers incur quadratic computational cost with respect to the number of input views, becoming a critical bottleneck for scalability. Several sparsification-based acceleration techniques have been proposed to alleviate this issue, but they often suffer from substantial accuracy degradation. We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits head-wise sparsification sensitivity. In the first stage, we measure head-wise sparsification sensitivity using a novel metric, the Head Sensitivity Score (HeSS), which approxi
16	core_reconstruction high	HTTM: Head-wise Temporal Token Merging for Faster VGGT 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT.Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the un
17	core_reconstruction high	LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Recent feed-forward reconstruction models like VGGT and $\pi^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system byaligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($Sim(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignmen
18	core_reconstruction high	Learning 3D Reconstruction with Priors in Test Time 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We introduce a test-time framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks, without retraining or modifying the pre-trained image-only networks. Rather than feeding priors into the architecture, we cast them as constraints on the predictions and optimize the network at inference. The optimization loss is composed of a self-supervised objective and prior penalty terms. The self-supervised objective is defined as the compatibility among multi-view predictions, implemented by the photometric or geometric loss between the renderings from other views and each view itself. Any available priors are turned into the penalty terms on the corresponding output modalities. Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation, our method cons
19	core_reconstruction high	Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos Autonomous Driving / Autonomous Driving	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; pose_calibration_localization; dynamic_4d; robotics_mapping; data_benchmark	VGGT/feed-forward geometry lineage with direct geometry signal	abstract Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals
20	core_reconstruction high	LongStream: Long-Sequence Streaming Autoregressive Visual Geometry 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Long-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences. They typically anchor poses to the first frame, which leads to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we solve Transformer cache issues such as attention-sink reliance and long-term KV-cache contamination. We propose cache-consistent training combined with
21	core_reconstruction high	MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Recent advancements in neural visual geometry, including transformer-based models such as VGGT and Pi3, have achieved impressive accuracy on 3D reconstruction tasks. However, their reliance on full attention makes them fundamentally limited by GPU memory capacity, preventing them from scaling to large, unordered image collections. We introduce MERG3R, a training-free divide-and-conquer framework that enables geometric foundation models to operate far beyond their native memory limits. MERG3R first reorders and partitions unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently. It then merges the resulting local reconstructions through an efficient global alignment and confidence-weighted bundle adjustment procedure, producing a globally consistent 3D model. Our framework is model-agnostic and can be paired with existing neural geometry mod
22	core_reconstruction high	MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Recent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks.In 3D visual geometry reconstruction, large-scale training has likewise proven effective for learning versatile representations.However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability.Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation.In addition, it integrates dense sem
23	core_reconstruction high	MotionCrafter: Repurposing Video Generators for Dense Geometry and Motion Reconstruction 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy; generation_editing	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We introduce MotionCrafter, the first video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. To represent them effectively in latent space, we propose a 4D VAE that encodes point maps and scene flows as a unified latent compatible with pretrained video generators. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents—despite their fundamentally different distributions—we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in joint 4D geometry reconstruction and dense scene flow estimation, d
24	core_reconstruction high	OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed,
25	core_reconstruction high	Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips alongside object-level semantics; and 2D–3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation — marking a step forward toward real-time, semantics-aware Spatial AI.
26	core_reconstruction high	Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; surface_occupancy; generation_editing; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Current compositional image-to-3D scene generation approaches construct 3D scenes by time-consuming iterative layout optimization or inflexible joint object-layout generation. Moreover, most methods rely on limited field-of-view perspective images, hindering the creation of complete $360^\circ$ environments. To address these limitations, we design $\textbf{Pano3DComposer}$, an efficient feed-forward framework for panoramic images. To decouple object generation from layout estimation, we propose a plug-and-play Object-World Transformation Predictor. This module converts the 3D objects generated by off-the-shelf image-to-3D models from local to world coordinates. To achieve this, we adapt the VGGT architecture to $\textbf{Alignment-VGGT}$ by using target object crop, multi-view object renderings and camera parameters to predict the transformation. The predictor is trained using pseudo-geom
27	core_reconstruction high	PhysIR-Splat: Physically Consistent Thermal Infrared Radiative Transfer in 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; gaussian_radiance; pose_calibration_localization	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Thermal infrared (TIR) 3D reconstruction provides geometry that is intrinsically coupled to the temperature field, even in low-light, nighttime, and smoke-obscured environments. TIR imaging measures self-emitted thermal radiation driven by object temperature and is largely independent of external illumination; therefore, simply carrying over visible-spectrum assumptions to TIR-based 3D reconstruction and novel view synthesis (NVS) often results in floating artifacts and blurred edges. In addition, radiometric inconsistency and low contrast in TIR weaken structure-from-motion (SfM) initialization, which in turn hinders subsequent 3D Gaussian Splatting (3DGS) optimization. We present PhysIR-Splat, a 3DGS framework that follows infrared radiative transfer: we explicitly model temperature, emissivity, and environmental irradiance on Gaussian primitives and, during rendering, jointly account
28	core_reconstruction high	Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Understanding how the 3D world evolves over time is a fundamental task in computer vision, essential for embodied settings, autonomous driving, etc. It requires not only the reconstruction of the observed scene but also the anticipation of how the scene dynamics will unfold in the future. While the area of 3D reconstruction has progressed rapidly with the advent of recent feed-forward neural networks, forecasting future dynamics in 3D, given the 2D frames of a video remains unexplored. We present Point4Cast, a unified framework that processes streaming 2D frame sequences of a video to estimate the past, present, and future of the underlying dynamic scene, in 3D. At the core of our approach lies a persistently evolving latent \emph{spacetime representation} that models the environment’s evolution across time. Upon receiving a new 2D frame, an update operation integrates the incoming evide
29	core_reconstruction high	Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; surface_occupancy; generation_editing; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain—from active sensors such as LiDAR or from feed-forward predictors like VGGT—offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation. A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a st
30	core_reconstruction high	QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Estimating 3D attributes directly from images has advanced rapidly with the Visual Geometry Grounded Transformer (VGGT), which predicts camera parameters, depth maps, and point clouds in a single forward pass. However, its 1.2B-parameter scale severely limits deployment on resource-constrained platforms such as UAVs and mobile AR devices. To address this limitation, we introduce QVGGT, a tailored quantization framework designed to compress VGGT. Our approach starts from the observation that transformer blocks within VGGT exhibit heterogeneous sensitivity to quantization. We thus analyze per-block quantization sensitivity and propose a selective mixed-precision strategy that allocates higher precision to the most fragile transformer blocks. To address the amplification of quantization error caused by high-variance camera and register tokens, we further introduce token filtering with camer
31	core_reconstruction high	Reliev3R: Relieving Feed-forward 3D Reconstruction from Multi-View Geometric Annotations 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract With recent advances, Feed-forward Reconstruction Models (FFRMs) have demonstrated great potential in reconstruction quality and adaptiveness to multiple downstream tasks. However, the excessive reliance on multi-view geometric annotations, e.g. 3D point maps and camera poses, makes the fully-supervised training scheme of FFRMs difficult to scale up.In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. Relieving the reliance on geometric sensory data and compute-exhaustive structure-from-motion preprocessing, our method draws 3D knowledge directly from monocular relative depths and image sparse correspondences given by zero-shot predictions of pretrained models.At the core of Reliev3R, we design an ambiguity-aware relative depth loss and a trigonometry-based reprojection loss to facilita
32	core_reconstruction high	Scal3R: Scalable Test-Time Training for Feed-forward Large-Scale 3D Reconstruction 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lig
33	core_reconstruction high	Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapt
34	core_reconstruction high	SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model Autonomous Driving / Autonomous Driving	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	VGGT/feed-forward geometry lineage with direct geometry signal	abstract This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird’s eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state
35	core_reconstruction medium	STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction Learning Algorithms / Efficient Models	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Online 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. While causal VGGT transformers address this challenge through key-value (KV) cache mechanism, the linear growth of the cache introduces a significant memory bottleneck. When memory constraints trigger early eviction, reconstruction quality and temporal consistency deteriorate markedly. In this work, we observe that attention patterns in causal transformers for 3D reconstruction exhibit intrinsic spatio-temporal sparsity. Leveraging this insight, we propose STAC, a Spatio-Temporally Aware Cache compression framework specifically designed for streaming 3D reconstruction using large causal transformers. STAC incorporates three key components: a Working Temporal Token Caching mechanism that preserves long-term informative tokens based on decayed cumula
36	core_reconstruction high	TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; gaussian_radiance; dynamic_4d	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss.This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby _unbinding_ the number of predicted primitives from input image resolution and number of views. Our resulting method, __TokenGS__, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance o
37	core_reconstruction high	Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation Autonomous Driving / Autonomous Driving	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping	VGGT/feed-forward geometry lineage with direct geometry signal	abstract Recent advancements in feed-forward architectures for visual geometry estimation have achieved significant progress. Interestingly, per-frame visual geometry estimation approaches typically exhibit weaker multi-frame consistency but demonstrate superior per-frame accuracy compared to multi-frame algorithms. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals three key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furtherm
38	core_reconstruction high	V-DPM: Video Reconstruction with Dynamic Point Maps 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract New, powerful 3D representations such as DUSt3R’s invariant point maps, which encode 3D shape and camera parameters, have significantly advanced feed-forward 3D reconstruction. While point maps assume static scenes, Dynamic Point Maps (DPMs) extend the concept to dynamic 3D content, also representing 3D scene motion.However, DPMs have so far been limited to image pairs and, like DUSt3R, require post-processing via optimization when more than two views are involved. We argue that DPMs are far more meaningful when applied to videos and introduce V-DPM to demonstrate this.First, we show how to set up DPMs for videos to optimize their representational power, ease of neural prediction, and reuse of pre-trained models. Second, we implement these ideas on top of VGGT, a recent state-of-the-art 3D reconstructor. Although VGGT was trained on static scenes, we show that a small amount of synthetic
39	core_reconstruction high	VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training.VGG-T$^3$ ($\mathbf{V}$isual $\mathbf{G}$eometry $\mathbf{G}$rounded $\mathbf{T}$est $\mathbf{T}$ime $\mathbf{T}$raining) scales linearly w.r.t. the number of input views, similar to online models, and achieves a $11.6\times$ speed-up over baselines that rely on softmax attention for reconstructing a $1k$ image collection in just $54$ seconds. Because our method retains global scene aggregation capability, our resulting point map reconstruction err
40	core_reconstruction high	VGGT-$\Omega$ 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present VGGT-Ω, a feed-forward model for 3D reconstruction that substantially advances the state of the art in accuracy, efficiency, and capability for both static and dynamic scenes. Prior models such as VGGT have shown that feed-forward 3D reconstruction can already be competitive with traditional optimization-based methods. Here, we further demonstrate that the accuracy and robustness of these models scale predictably with model capacity and data size. To enable training 3D reconstruction models at an unprecedented scale, we introduce a high-quality data annotation pipeline that handles dynamic scenes, a self-supervised learning protocol, and architectural changes that greatly reduce memory requirements. We significantly simplify VGGT’s architecture by replacing multiple dense prediction heads with loss-driven multitask learning, removing unstable DPT blocks, and introducing more e
41	core_reconstruction high	VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that together form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT’s perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (i
42	core_reconstruction high	VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-view Indoor 3D Object Detection 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain—i.e., precisely calibrated multi-view camera poses—to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key c
43	core_reconstruction high	VGGTracker: Fast Spatial Tracking with Visual Geometry Transformer 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Existing 3D point tracking methods mostly rely on heuristic designs or scene reconstruction, which incurs significant computational overhead and makes it difficult to meet the demands of real-time applications.To address this problem, in this work, we present VGGTracker, a novel spatial tracker that leverages a feed-forward visual geometry transformer to predict the trajectories of arbitrary query points from monocular videos in real time.Specifically, we employ a query initialization mechanism to maintain and update a global feature vector and a set of frame-level feature vectors for each query point.Then, we propose a new spatial tracking framework, which consists of a visual geometry transformer backbone, a global embedding branch, a frame-level embedding branch, and a tracking head.The key innovation lies in the dual-branch embedding design, where the global embedding branch integrat
44	core_reconstruction high	WildPose: A Unified Framework for Robust Pose Estimation in the Wild 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Estimating camera pose in dynamic environments is a critical challenge, as most visual SLAM and SfM methods assume inputs from static environments. While recent dynamic-aware methods exist, they are often not unified: semantic-based approaches are brittle, per-sequence optimization methods fail on short sequences, and other learned models sometimes perform badly on static-only scenes. We present Wildpose, a unified monocular pose estimation framework that is robust in dynamic environments while maintaining state-of-the-art performance on static and low-ego-motion datasets. Our key insight is to connect the two powerful paradigms in modern 3D vision: the rich perceptual frontend of feed-forward models and the end-to-end optimization of differentiable bundle adjustment (BA). We achieve this by enhancing the differentiable BA pipeline in two ways. First, we introduce a new 3D-aware update o
45	core_reconstruction high	DVGT: Visual Geometry Transformer for Autonomous Driving Autonomous Driving / Autonomous Driving	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; dynamic_4d; robotics_mapping	VGGT/feed-forward geometry lineage with direct geometry signal	abstract Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, it still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Visual Geometry Transformer specifically designed for autonomous Driving (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. Finally, we use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego pose for each frame. Our DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-
46	core_reconstruction high	OccAny: Generalized Unconstrained Urban 3D Occupancy Autonomous Driving / Autonomous Driving	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; gaussian_radiance; surface_occupancy; robotics_mapping	VGGT/feed-forward geometry lineage with direct geometry signal	abstract Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization.While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios.We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features.OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images.Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework
47	core_reconstruction high	VGA:Empowering Aerial-Ground Localization by Visual Geometry Alignment Remote Sensing & Earth / Remote Sensing	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; pose_calibration_localization; robotics_mapping	VGGT/feed-forward geometry lineage with direct geometry signal	abstract Aerial-ground visual localization is a challenging task due to the significant differences in scene scale and view point captured between two views. In this work, we explore the practical benefit of jointly learning camera calibration and bird’s-eye-view (BEV) projection for estimating full 6 Degrees-of-freedom relative camera pose between uncalibrated aerial and ground views. We present Visual Geometry Alignment (VGA), a unified framework that jointly learns a global gravity-alignment prior inferred from dense monocular perspective fields, and a planar alignment prior complementing the unobserved azimuth angle through Procrustes alignment in a shared BEV plane. At inference, we jointly refine the relative camera pose by integrating the predicted per-camera gravity alignment and relative planar azimuth angle, yielding improved orientation and translation alignment from visual input with
48	core_reconstruction high	AREA3D: Active Reconstruction Agent with Unified Feed-Forward 3D Perception and Vision-Language Guidance Multimodal & Language / Agentic AI	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction	VGGT/feed-forward geometry lineage with direct geometry signal	abstract Active 3D reconstruction enables an agent to autonomously select viewpoints to build accurate and complete scene geometry efficiently, rather than passively reconstructing scenes from pre-collected images. Existing active reconstruction methods often rely on geometric heuristics, which may result in redundant observations without improving reconstruction quality. To address this, we propose \textbf{AREA3D}, an active reconstruction agent for 3D reconstruction by leveraging feed-forward 3D models and vision-language guidance. The framework decouples view uncertainty modeling from feed-forward reconstruction, enabling precise uncertainty estimation without online optimization. Moreover, the integrated Vision-Language Model provides high-level semantic guidance that guides exploration beyond purely geometric cues. Extensive experiments on both scene-level and object-level benchmarks (Replic
49	core_reconstruction high	4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps.Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity.The resulting system e
50	core_reconstruction high	ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Existing hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object's metric scale and pose for grounding its normalized mesh i
51	core_reconstruction high	CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Accurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and
52	core_reconstruction high	Catch Me if You Can: Active Mapping of Moving 3D Objects 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Current 3D mapping pipelines generally assume static environments, which limits their ability to accurately capture and reconstruct moving objects. To address this limitation, we introduce the novel task of active mapping of moving objects, in which a mapping agent must plan its trajectory while compensating for the object's motion. Our approach, Paparazzo, provides a learning-free solution that robustly predicts the target's trajectory and identifies the most informative viewpoints from which to observe it, to plan its own path. We also contribute a comprehensive benchmark designed for this new task. Through extensive experiments, we show that Paparazzo significantly improves 3D reconstruction completeness and accuracy compared to several strong baselines, marking an important step toward dynamic scene understanding.
53	core_reconstruction high	Complet4R: Geometric Complete 4D Reconstruction 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We introduce Complet4R, a novel end-to-end framework for Geometric Complete 4D Reconstruction, which aims to recover temporally coherent and geometrically complete reconstruction for dynamic scenes. Our method formalizes the task of Geometric Complete 4D Reconstruction as a unified framework of reconstruction and completion, by directly accumulating full contexts onto each frame. Unlike previous approaches that rely on pairwise reconstruction or local motion estimation, Complet4R utilizes a decoder-only transformer to operate all context globally directly from sequential video input, reconstructing a complete geometry for every single time step, including occluded regions visible in other frames. Our method demonstrates the state-of-the-art performance on our proposed benchmark for Geometric Complete 4D Reconstruction and the 3D point tracking task. Code will be released to support futur
54	core_reconstruction high	Efficiently Reconstructing Dynamic Scenes one D4RT at a Time 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Understanding and reconstructing the complex geometry and motion of dynamic 4D scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward network designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our unified decoding interface allows the model to independently and efficiently probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state-of-the-art, outper
55	core_reconstruction high	EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Strand-level hair geometry reconstruction is a fundamental problem in virtual human modeling and the digitization of hairstyles. However, existing methods still suffer from a significant trade-off between accuracy and efficiency. Implicit neural representations can capture the global hair shape but often fail to preserve fine-grained strand details, while explicit optimization-based approaches achieve high-fidelity reconstructions at the cost of heavy computation and poor scalability.To address this issue, we propose EfficientMonoHair, a fast and accurate framework that combines the implicit neural network with multi-view geometric fusion for strand-level reconstruction from monocular video. Our method introduces a fusion-patch-based multi-view optimization that reduces the number of optimization iterations for point cloud direction, as well as a novel parallel hair-growing strategy that
56	core_reconstruction high	FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; generation_editing	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Single-image 3D human reconstruction holds significant promise due to its convenience and high demand in various applications. Previous methods have garnered tremendous progress by employing 2D multi-view diffusion models to generate auxiliary views as reconstruction priors, but they struggle with 3D inconsistencies and limited generalization capabilities. In this paper, we present FISHuman, which aims to generate fine-grained, high-fidelity, and content-wise diverse 3D humans from a single-view input, providing production-ready 3D assets. We propose an elaborately designed workflow that reconstructs dynamic 3D meshes from multi-view inconsistent guidance. Specifically, we adapt a dual-stream transformer-based video diffusion model to generate cross-modally aligned multi-view RGB and normal sequences. We find that naively employing static 3D reconstruction can lead to geometric distortio
57	core_reconstruction high	Inferring Compositional 4D Scenes without Ever Seeing One 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard.Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, a
58	core_reconstruction high	MoRe: Motion-aware Feed-forward 4D Reconstruction Transrformer 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; pose_calibration_localization; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruc
59	core_reconstruction high	Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract emporally consistent surface reconstruction of dynamic 3D objects from unstructured point cloud data remains challenging, especially for very long sequences. Existing methods either optimize deformations incrementally, risking drift and requiring long runtimes, or rely on complex learned models that demand category-specific training. We present Neu-PiG, a fast optimization method based on a novel preconditioned surface encoding that estimates coherent non-rigid deformations without sacrificing temporal stability or accuracy. Our method encodes entire deformations across all time steps at various spatial scales into a multi-resolution latent grid, parameterized by the position and normal direction of a reference surface from a single keyframe. This latent representation is then augmented for time modulation and decoded into per-frame 6-DoF deformations via a lightweight multi-layer percep
60	core_reconstruction high	PatchScene: Patch-based Voxel Diffusion Model for Large-Scale Scene Completion 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We propose PatchScene, a novel diffusion-based framework for large-scale LiDAR scene completion. Unlike existing methods that rely on global latent representations or dense voxel grids, PatchScene adopts a patch-based voxel diffusion paradigm that explicitly generates fine-grained geometry within localized 3D regions. To ensure coherent reconstruction at both spatial and temporal scales, we introduce a confidence-guided spatio-temporal fusion mechanism that integrates overlapping patches and adjacent frames in a unified generative process. Furthermore, we design an Annular-Flow diffusion strategy that leverages the radial density pattern of LiDAR scans to progressively propagate high-fidelity information from near-range to far-range regions, enabling spatially unbounded scene completion. Extensive experiments on the SemanticKITTI benchmark demonstrate that PatchScene achieves state-of-th
61	core_reconstruction high	ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel self-correction manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre-computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation.To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation-Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision.The core of ReFlow is a novel self-correction flow matching mechanism, c
62	core_reconstruction high	ReGenHOI: Unifying Reconstruction and Generation for 3D Human–Object Interaction Understanding 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; generation_editing	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Understanding 3D human–object interaction (HOI) involves two highly-related abilities: reconstruction, which perceives observed geometry, and generation, which imagines plausible future interactions. However, most existing methods treat these abilities as separate tasks, limiting their capacity to capture the unified nature of human spatial reasoning. To address this, we propose a unified framework that bridges reconstruction and generation through a shared semantic–geometric reasoning space. Specifically, a 3D Contact Reasoning mechanism enables direct reasoning in 3D space, jointly modeling geometric structure and semantic relationships, while a Reasoning Trace Refinement module iteratively refines contact predictions by integrating geometric and semantic cues. The framework builds a unified latent representation via explicit reasoning on human–object contact regions. To further enhanc
63	core_reconstruction medium	Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling 3D Vision & Geometry / Pose Estimation	A. thesis anchor: dynamic/4D recon	general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Visual SLAM is one of the most fundamental problems in computer vision, with direct applications to real-time localization tasks such as AR/VR, robotics, and 3D scene reconstruction. Although significant progress has been made in both sparse and dense approaches, real-time monocular SLAM remains challenging—particularly in the uncalibrated setting, where existing methods are often inefficient and lack modularity. In this paper, we present a new visual SLAM pipeline implemented from scratch in C++ that explicitly leverages the spatio-temporal structure of the scene for improved localization, and is designed to be modular so that off-the-shelf components can be easily integrated. We introduce a temporal representation based on a buffer of recent keyframes that preserves short-term scene continuity. To complement this, we incorporate a spatial representation based on a 3D cell-based scene m
64	core_reconstruction high	Vista4D: Video Reshooting with 4D Point Clouds 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present Vista4D, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, our method re-synthesizes the scene with the same dynamics from a different camera trajectory and viewpoint. Existing video reshooting methods often struggle with depth estimation artifacts of real-world dynamic videos, while also failing to preserve content appearance and maintain precise camera control for challenging new trajectories. We build a 4D-grounded point cloud representation with static pixel segmentation and 4D reconstruction to explicitly preserve seen content and provide rich camera signals, and we train with reconstructed multiview dynamic data for robustness against point cloud artifacts during real-world inference. Our results demonstrate improved 4D consistency, camera control, and visual quali
65	core_reconstruction high	WorldStereo: Bridging Controllable Video Generation and Scene Reconstruction via 3D Geometric Memories 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories.In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds.Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank.These components enable WorldStereo to generate multi-view-co
66	core_reconstruction medium	TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos 3D Vision & Geometry / Pose Estimation	A. thesis anchor: dynamic/4D recon	general_reconstruction; pose_calibration_localization; dynamic_4d	direct reconstruction/3DGS/4D title linked to core representation cluster	abstract Reconstructing humans and their surrounding environments in a globally consistent 4D space is essential for comprehensive perception. However, prior works typically assume single-view inputs or decouple humans, scenes, and cameras, making them unable to recover coherent geometry, stable motion, and physically aligned trajectories. These limitations motivate us to introduce a new task: unified human–scene–camera reconstruction from multi-view videos, which aims to jointly estimate dynamic humans, static scenes, and camera poses in one global coordinate frame. We propose TROPHIES--Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos--a unified framework tailored for this task. TROPHIES features a Human Branch that models human through temporal and spatial reasoning, and a Scene Branch that reconstructs static geometry with human-aware attention. A global alignment
67	core_reconstruction high	Any4D: Unified Feed-Forward Metric 4D Reconstruction 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordi
68	core_reconstruction high	$\textit{4DSurf}$: High-Fidelity Dynamic Scene Surface Reconstruction 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract This paper addresses the problem of dynamic scene surface reconstruction using Gaussian Splatting (GS), aiming to recover temporally consistent geometry. While existing GS-based dynamic surface reconstruction methods can yield superior reconstruction, they are typically limited to either a single object or objects with only small deformations, struggling to maintain temporally consistent surface reconstruction of large deformations over time. We propose ``4DSurf'', a novel and unified framework for generic dynamic surface reconstruction that does not require specifying the number or types of objects in the scene, can handle large surface deformations and temporal inconsistency in reconstruction. The key innovation of our framework is the introduction of Gaussian deformations induced Signed Distance Function Flow Regularization that constrains the motion of Gaussians to align with the evo
69	core_reconstruction high	$L^{2}DGS$: Low-Light Dynamic Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Synthesizing novel spatiotemporal views of dynamic scenes is inherently challenging due to both object and camera motion, as well as sparsity of observations. Recent advances in Neural Radiance Fields (NeRFs) and Gaussian Splatting (GS) have enabled 4D dynamic scene reconstruction, but predominantly from well-lit images or videos. Some works address the problem of reconstructing a well-lit scene from low-light input, but these are limited to static scenes. Moreover, prior methods primarily emphasize improving illumination, while overlooking the underlying scene characteristics. Reconstructing well-lit dynamic scenes from inputs captured under low-light conditions is particularly challenging due to shadows, occlusions, and disocclusions caused by object motion, which makes the problem highly ambiguous and ill-posed. We propose $L^{2}DGS$ (Low-Light Dynamic Gaussian Splatting), a self-supe
70	core_reconstruction high	3D Gaussian Splatting from unposed Spike Stream 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) has significantly advanced 3D reconstruction with its impressive performance. However, its reliance on sharp images and precise camera pose priors limits its effectiveness in high-speed scenarios. Recent advances have integrated spike camera, a bio-inspired sensor with a high temporal resolution, to enhance 3DGS in such conditions. Although spike-based methods reduce the need for sharp images, they still face challenges in achieving precise camera pose estimation due to unstable observations and visual texture deficiency.To address these challenges, we propose Nope-SGS, the first framework that reconstructs high-speed 3D scenes from unposed captures of the bio-inspired high-temporal-resolution spike camera. To achieve robust 3D reconstruction and pose estimation, we first reformulate the spike model from a probabilistic perspective and extend its applicat
71	core_reconstruction high	3D Gaussian Splatting with Self-Constrained Prior for High Fidelity Surface Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Rendering 3D surfaces has been revolutionized within the modeling of radiance fields through either 3DGS or NeRF. Although 3DGS has shown advantages over NeRF in terms of rendering quality or speed, there is still room for improvement in recovering high fidelity surfaces through 3DGS. To resolve this issue, we propose a self-constrained prior to constraining the movement of 3D Gaussians, aiming for more accurate depth rendering. Our self-constrained prior is a TSDF grid fused by the rendered depth during the learning of 3D Gaussians. The prior measures a band on both sides of the estimated surface for imposing more specific constraints on the right 3D Gaussians, such as removing 3D Gaussians outside the band, encouraging larger opacity for Gaussians near the center of the band or smaller opacity for Gaussians near the boundary of the band. We regularly update the prior by fusing more rec
72	core_reconstruction high	4C4D: 4 Camera 4D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract This paper tackles the challenge of recovering 4D dynamic scenes from videos captured by as few as four portable cameras. Learning to model scene dynamics for temporally consistent novel-view rendering is a foundational task in computer graphics, where previous works often require dense multi-view captures using camera arrays of dozens or even hundreds of views. We propose 4C4D, a novel framework that enables high-fidelity 4D Gaussian Splatting from video captures of extremely sparse cameras. Our key insight lies that the geometric learning under sparse settings is substantially more difficult than modeling appearance. Driven by this observation, we introduce a Neural Decaying Function on Gaussian opacities for enhancing the geometric modeling capability of 4D Gaussians. This design mitigates the inherent imbalance between geometry and appearance modeling in 4DGS by encouraging the 4DGS
73	core_reconstruction high	4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract 4D reconstruction of equine family (e.g. horses) from monocular video is important for animal welfare. Previous mainstream 4D animal reconstruction methods require joint optimization of motion and appearance over a whole video, which is time-consuming and sensitive to incomplete observation. In this work, we propose a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction. For motion, we introduce a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video. For appearance, we design a novel feed-forward network that reconstructs a high-fidelity, animatable 3D Gaussian avatar from as few as a single image. To assist training, we create a large-scale synthetic motion dataset, VarenPose
74	core_reconstruction high	ActivePolicy: Active Gaussian Reconstruction and Optimization Strategy Based on Global-Local Information Gain 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Active 3D Gaussian reconstruction achieves superior completeness and rendering quality by intelligently selecting viewpoints. However, existing methods suffer from two critical limitations: information gain metrics that prioritize geometric coverage while ignoring rendering quality, and overfitting to sparse view configurations that degrades novel view synthesis. We introduce ActivePolicy, a novel framework addressing both challenges through principled NBV selection and regularization. We propose \textbf{GL-Graph}, a graph-theoretic strategy that unifies geometric consistency, rendering quality, and observation redundancy into a single stability criterion. To counteract overfitting, we introduce \textbf{4D-Reg}, which identifies floaters through manifold discrepancies among three depth types (R-Depth, $\alpha$-Depth, C-Depth) and suppresses them via adaptive dropout. Extensive experiment
75	core_reconstruction high	AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Monocular UAV videos pose a fundamental challenge for 3D reconstruction: dynamic scene modeling requires accurate camera poses, yet recovering poses from long UAV trajectories often fails under texture-sparse regions and moving objects.Existing approaches typically handle either pose-free static reconstruction or dynamic reconstruction with known poses, but jointly solving both from casual aerial footage remains difficult due to motion coupling and severe scale variation.We introduce \modelname, a scale-aware Gaussian splatting framework that jointly recovers camera trajectories and reconstructs dynamic scenes from pose-free monocular videos.Central to our method are scale-aware spatio-temporal anchors (S$^2$A-Anchors), which enable a unified optimization via three key decoupling mechanisms:(i) separating ego-motion from object motion,(ii) isolating static geometry from temporal deformat
76	core_reconstruction high	AnchorSplat: Feed-Forward 3D Gaussian Splatting With 3D Geometric Priors 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Scene-level 3D reconstruction has attracted increasing attention, and feed-forward 3D Gaussian Splatting (3DGS) has emerged as a promising paradigm for novel view synthesis. However, most existing methods adopt a pixel-aligned formulation that maps each 2D pixel to a 3D Gaussian, making the number of Gaussians tightly coupled with the input images. This leads to several limitations: (i) reconstruction quality is sensitive to the quantity and viewpoint coverage of input images, often causing Gaussians to accumulate more densely in regions with frequent viewpoints; (ii) alignment errors become more pronounced under sparse-view conditions; and (iii) the lack of explicit geometric consistency can degrade depth estimation and downstream 3D tasks. In this paper, we propose AnchorSplat, a novel multi-view feed-forward 3DGS framework for scene-level reconstruction that departs from pixel-aligned
77	core_reconstruction high	ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Reconstructing articulated objects into high-fidelity digital twins is crucial for applications such as robotic manipulation and interactive simulation. Recent self-supervised methods using differentiable rendering frameworks like 3D Gaussian Splatting remain highly sensitive to the initial part segmentation. Their reliance on heuristic clustering or pre-trained models often causes optimization to converge to local minima, especially for complex multi-part objects.To address these limitations, we propose ArtPro, a novel self-supervised framework that introduces adaptive integration of mobility proposals. Our approach begins with an over-segmentation initialization guided by geometry features and motion priors, generating part proposals with plausible motion hypotheses. During optimization, we dynamically merge these proposals by analyzing motion consistency among spatial neighbors, while
78	core_reconstruction high	BA-GS: Bayesian Adaptive Gaussian Splatting for SFM-Free 3D Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) has demonstrated exceptional performance in reconstruction and novel view synthesis tasks. However, its reliance on Structure-from-Motion preprocessing may lead to degraded performance under sparse-view scenarios. Recent works attempt to address this limitation by leveraging pre-trained image matching models to generate Gaussian primitives but overlook the probabilistic uncertainty embedded in both the initial primitive distribution and iterative position updates. This uncertainty can accumulate and degrade reconstruction fidelity. Hence, we propose BA-GS, a Bayesian framework that models both the global distribution and local uncertainty of Gaussian primitives. At global initialization, a Variational Bayesian Gaussian Mixture Model (VB-GMM) models the latent distribution of primitives, capturing region-wise density and gradient patterns. At local refinement,
79	core_reconstruction high	BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract The boundary representation (B-rep) models a 3D solid as its explicit boundaries: trimmed corners, edges, and faces. Recovering B-rep representation from unstructured data is a challenging and valuable task of computer vision and graphics. Recent advances in deep learning have greatly improved the recovery of 3D shape geometry, but still depend on dense and clean point clouds and struggle to generalize to novel shapes. We propose B-rep Gaussian Splatting (BrepGaussian), a novel framework that learns 3D parametric representations from 2D images. We employ a Gaussian Splatting renderer with learnable features, followed by a specific fitting strategy. To disentangle geometry reconstruction and feature learning, we introduce a two-stage learning framework that first captures geometry and edges and then refines patch features to achieve clean geometry and coherent instance representations. Ex
80	core_reconstruction high	CGHair: Compact Gaussian Hair Reconstruction with Card Clustering 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We present a compact pipeline for high-fidelity hair reconstruction from multi-view images. While recent 3D Gaussian Splatting (3DGS) methods achieve realistic results, they often require millions of primitives, leading to high storage and rendering costs. Observing that hair exhibits structural and visual similarities across a hairstyle, we cluster strands into representative hair cards and group these into shared texture codebooks. Our approach integrates this structure with 3DGS rendering, significantly reducing reconstruction time and storage while maintaining comparable visual quality. In addition, we propose a generative prior accelerated method to reconstruct the initial strand geometry from a set of images. Our experiments demonstrate a 4-fold reduction in strand reconstruction time and achieve comparable rendering performance with over 200× lower memory footprint.
81	core_reconstruction high	ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Dynamic 3D scene reconstruction is essential for immersive media such as VR, MR, and XR, yet remains challenging for long multi-view sequences with large-scale motion. Existing dynamic Gaussian approaches are either Frame-Stream, offering scalability but poor temporal stability, or Clip, achieving local consistency at the cost of high memory and limited sequence length.We propose ClipGStream, a hybrid reconstruction framework that performs stream optimization at the clip level rather than the frame level. The sequence is divided into short clips, where dynamic motion is modeled using clip-independent spatio-temporal fields and residual anchor compensation to capture local variations efficiently, while inter-clip inherited anchors and decoders maintain structural consistency across clips. This Clip-Stream design enables scalable, flicker-free reconstruction of long dynamic videos with hig
82	core_reconstruction high	Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We present Gaussian Splatting Alignment (GSA), a novel method for aligning two independent 3D Gaussian Splatting (3DGS) models via a similarity transformation (rotation; translation; scale), even when they are of different objects in the same category (e.g, different cars). In contrast, existing methods can only align 3DGS models of the same object (e.g, the same car) and often must be given true scale as input, while we estimate it successfully. Our approach leverages viewpoint-guided spherical map features to obtain robust correspondences and introduces a two-step optimization framework that aligns models while keeping the 3DGS models fixed. First, we perform an iterative, feature-guided coarse registration that is robust to extremely poor initialization (e.g, 180° misalignment or a 10× scale gap), followed by a fine registration step enforcing multi-view feature consistency, inspired
83	core_reconstruction high	CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Single-view 3D human reconstruction has garnered significant attention in recent years. Despite numerous advancements, prior research has concentrated on reconstructing 3D models from clear, close-up images of individual subjects, often yielding subpar results in the more prevalent multi-person scenarios. Reconstructing 3D human crowd models is a highly intricate task, laden with challenges such as: 1) extensive occlusions, 2) low clarity, and 3) numerous and various appearances. To address this task, we propose CrowdGaussian, a unified framework that directly reconstructs multi-person 3D Gaussian Splatting (3DGS) representations from single-image inputs. To handle occlusions, we devise a self-supervised adaptation pipeline that enables the pretrained large human model to reconstruct complete 3D humans with plausible geometry and appearance from heavily occluded inputs.Furthermore, we in
84	core_reconstruction high	Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Novel view synthesis has been significantly advanced by NeRFs and 3D Gaussian Splatting (3DGS), which require ordering volumetric samples or primitives for correct color blending. While the recent Gaussian-Enhanced Surfels (GES) enable high-performance, sort-free rendering, they suffer from aliasing artifacts and suboptimal reconstruction. To address these limitations, we propose DP-GES, a novel representation that augments opaque surfels with semi-transparent boundaries and leverages Depth Peeling to establish accurate per-pixel ordering. This design enables sort-free Gaussian splatting with correct transmittance modulation, effectively eliminating aliasing and popping artifacts while facilitating a fully differentiable joint optimization. Extensive experiments demonstrate that our method achieves superior reconstruction quality and compares favorably against state-of-the-art techniques
85	core_reconstruction high	DiffSoup: Direct Differentiable Rasterization of Triangle Soup for Extreme Radiance Field Simplification 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Radiance field reconstruction aims to recover high-quality 3D representations from multi-view RGB images. Recent advances, such as 3D Gaussian splatting, have achieved real-time rendering with high visual fidelity, given sufficiently powerful graphics hardware. However, drastic model simplification — i.e., reducing the number of primitives by several orders of magnitude — is required to enable efficient online transmission and rendering across diverse hardware platforms. We introduce DiffSoup, a radiance field representation that employs a soup (i.e., a highly unstructured primitives) of a small number of triangles with neural textures that have binary opacity. We show that the binary opacity representation is directly differentiable via stochastic opacity masking, enabling stable training without molifier (i.e., smooth rasterization). DiffSoup can be rasterized with a traditional depth-
86	core_reconstruction high	Disco-GS: Gaussian Splatting in Dynamic Color Lighting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Recent advances in Gaussian Splatting (GS) have significantly improved 3D scene reconstruction and novel view synthesis. However, most existing methods typically assume that training inputs are captured under stable lighting conditions and achromatic light. In contrast, scenes recorded under temporally varying color light, as in “disco lights” commonly seen in events, performances, and decorative settings, introduce severe ambiguities in both scene photometry and geometry. We propose Disco-GS, a framework that leverages GS for reconstructing the 3D scene while simultaneously recovering the underlying canonical appearance from videos captured under dynamic lighting conditions. Disco-GS estimates the effective per-pixel transient light, which, when applied to the canonical image, results in the observed color image of the scene, thereby enabling self-supervised learning. Disco-GS is an end
87	core_reconstruction high	Distilling Unsigned Distance Function for Surface Reconstruction from 3D Gaussian Splatting 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Unsigned distance fields (UDFs) are well suited for representing open surfaces, but learning them from multi-view images is challenging because ground-truth surfaces are unavailable for supervision in most cases and the gradient of a UDF is undefined on the underlying surface. Prior methods optimize UDFs with global objectives and apply gradient-based priors ignoring the non-differentiability for queries on the target surface, which leads to unstable training and over-smoothing on fine details. We address these issues by distilling a patch-based UDF prior, trained on synthetic ground truth algebraic surfaces with closed form expressions, into a lightweight student UDF inside Gaussian optimization process. We design band-limited knowledge distillation strategy that leverages a pretrained patch-based UDF predictor to provide reliable near-surface UDF supervision, enabling stable student t
88	core_reconstruction high	Dropping Anchor and Spherical Harmonics for Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Recent 3D Gaussian Splatting (3DGS) dropout methods address overfitting under sparse-view conditions by randomly nullifying Gaussian opacities. However, we identify a neighbor compensation effect in these approaches: dropped Gaussians are often compensated by their neighbors, weakening the intended regularization. Moreover, these methods overlook the contribution of high-degree spherical harmonic coefficients (SH) to overfitting. To address these issues, we propose DropAnSH-GS, a novel anchor-based dropout strategy. Rather than dropping Gaussians independently, our method randomly selects certain Gaussians as anchors and simultaneously removes their spatial neighbors. This effectively disrupts local redundancies and encourages the model to learn more robust, globally informed representations. Furthermore, we extend the dropout to color attributes by randomly dropping higher-degree SH coe
89	core_reconstruction high	DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present Compact 3D Reconstruction with Positive and Negative Primitives (DualPrim), a novel approach for reconstructing compact and topologically regular 3D meshes from multi-view images. Unlike traditional methods that rely on implicit representations such as signed distance functions, or explicit formats such as meshes and point clouds, our method models geometry using quadrics-based 3D primitives. Each primitive is defined by a positive-density superquadric that contributes to the shape, and a negative-density superquadric that carves out local volumes, enabling fine-grained geometric control and flexible topology. This dual-primitive representation yields compact, well-regularized, and efficiently parameterized mesh reconstructions. To infer primitive parameters from multi-view images, we design a differentiable rendering pipeline that jointly estimates positive and negative super
90	core_reconstruction high	E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract The emergence of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS) has advanced novel view synthesis (NVS). These methods, however, require high-quality RGB inputs and accurate corresponding poses, limiting robustness under real-world conditions such as fast camera motion or adverse lighting. Event cameras, which capture brightness changes at each pixel with high temporal resolution and wide dynamic range, enable precise sensing of dynamic scenes and offer a promising solution. However, existing event-based NVS methods still rely on known poses or depend on depth estimation models and auxiliary modalities such as RGB-D. We present E2EGS, a pose-free framework operating solely on event streams. Our key insight is that edge information provides rich structural cues essential for accurate trajectory estimation and high-quality NVS. To extract edges from noisy event streams, we
91	core_reconstruction high	EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Understanding a 3D scene immediately with its exploration is essential for embodied tasks, where an agent must construct and comprehend the 3D representation in an online and nearly real-time manner. In this study, we propose EmbodiedSplat, an online feed-forward 3DGS for open-vocabulary scene understanding that enables simultaneous online 3D reconstruction and 3D semantic understanding from the streaming images. Unlike existing open-vocabulary 3DGS methods, our objectives are two-fold: 1) Reconstructs the semantic-embedded 3DGS of the entire scene from over 300 streaming images in an online manner. 2) Highly generalizable to novel scenes with feed-forward design and supports nearly real-time 3D semantic reconstruction when combined with real-time 2D models. To achieve these objectives, we propose an Online Sparse Coefficients Field with a CLIP Global Codebook where it binds the 2D C
92	core_reconstruction medium	Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow 3D Vision & Geometry / Pose Estimation	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; robotics_mapping; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract High-quality 3D scene representation in radiance fields relies on accurate camera poses which are often difficult to acquire in real-world scenarios. An effective solution is to use RGB images for the joint optimization of radiance fields and camera poses, an approach that has been well explored in NeRF series methods. However, unlike NeRF, joint optimization in 3D Gaussian Splatting (3DGS) often requires additional regularization or prior spatial knowledge to reach comparable performance. To eliminate these dependencies, we introduce Energy-GS, a pose-aware Gaussian splatting framework that jointly optimizes scene representation and camera poses using only RGB images. We observe that pose gradients in joint optimization are unstable due to the point-based rendering mechanism. Furthermore, unlike NeRF’s spatial sampling framework that enables coarse-to-fine pose alignment, rasterization-
93	core_reconstruction high	FastEventDGS: Deformable Gaussian Splatting for Fast Dynamic Scenes from a Single Event Camera 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract The demand for dynamic 3D assets in AR/VR has recently popularized Deformable Gaussian Splatting. However, traditional RGB cameras are limited in their ability to reconstruct high-speed scenes due to motion blur and low temporal resolution. While event cameras offer a promising alternative, reconstructing a complete scene from their sparse and noisy output is a significant challenge. Existing event-based methods rely on an auxiliary sensor, such as a frame camera, thereby inducing tedious hardware and calibration challenges.We introduce FastEventDGS, a novel Deformable Gaussian Splatting-based framework that leverages a single event camera for high-fidelity 4D reconstruction. Our method utilizes a continuous camera trajectory parametrization and integrates two event generation models to provide both photometric and geometric constraints. We further propose a local patch event motion loss
94	core_reconstruction high	FastGS: Training 3D Gaussian Splatting in 100 Seconds 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multi-view consistency, efficiently solving the trade-off between training time and rendering quality. We innovatively design a densification and pruning strategy based on multi-view consistency, dispensing with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets demonstrate that our method significantly outperforms the state-of-the-art methods in training speed, achieving a 3.29× training acceleration and comparable rendering quality compared with DashGaussian on the Mip-NeRF 360 dataset and a 15.45× acceleration
95	core_reconstruction high	FHAvatar: Fast and High-Fidelity Reconstruction of Face-and-Hair Composable 3D Head Avatar from Few Casual Captures 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We present FHAvatar, a novel framework for reconstructing 3D Gaussian avatars with composable face and hair components from an arbitrary number of views. Unlike previous approaches that couple facial and hair representations within a unified modeling process, we explicitly decouples two components in texture space by representing the face with planar Gaussians and the hair with strand-based Gaussians. To overcome the limitations of existing methods that rely on dense multi-view captures or costly per-identity optimization, we propose an aggregated transformer backbone to learn geometry-aware cross-view priors and head-hair structural coherence from multi-view datasets, enabling effective and efficient feature extraction and fusion from few casual captures. Extensive quantitative and qualitative experiments demonstrate that FHAvatar achieves state-of-the-art reconstruction quality from on
96	core_reconstruction high	FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We present FlexAvatar, a flexible large reconstruction model for high-fidelity 3D head avatars with detailed dynamic deformation from single or sparse images, without requiring camera poses or expression labels. It leverages a transformer-based reconstruction model with structured head query tokens as canonical anchor to aggregate flexible input-number-agnostic, camera-pose-free and expression-free inputs into a robust canonical 3D representation.For detailed dynamic deformation, we introduce a lightweight UNet decoder conditioned on UV-space position maps, which can produce detailed expression-dependent deformations in real time. To better capture rare but critical expressions like wrinkles and bared teeth, we also adopt a data distribution adjustment strategy during training to balance the distribution of these expressions in the training set.Moreover, a lightweight 10-second refinemen
97	core_reconstruction high	FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Real objects inhabit a physical world and must behave plausibly during interaction with other physical objects. However, current methods that perform 3D reconstructions of real-world scenes from multi-view images optimize primarily for visual fidelity, i.e., they train with photometric losses and reason about uncertainty in the image or representation space. This appearance-centric view overlooks body contacts and couplings, conflates function-critical regions (e.g., aerodynamic or hydrodynamic surfaces) with ornamentation, and reconstructs structures suboptimally, even when physical regularizers are added. We consider the question: How can 3D reconstruction become aware of real-world interactions and underlying object function, beyond visual cues? We propose FluidGaussian, a plug-and-play method that tightly couples geometry reconstruction with ubiquitous fluid-structure interactions to
98	core_reconstruction high	FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract The increasing need for augmented reality and robotics is urging for articulated object reconstruction with high scalability. However, the existing settings of reconstructing from discrete articulation states or casual monocular video need non-trivial axes alignment or suffer from insufficient coverage, limiting the applications. In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simpler setup and high scalability. FreeArtGS combines free-moving part segmentation with joint estimation and end-to-end optimization, taking only a monocular RGB-D video as input. By optimizing with the priors from off-the-shelf point-tracking and feature models, free-moving part segmentation discovers rigid parts from relative motion in unconstrained capture. The joint estimation module proposes a noise-resistant approa
99	core_reconstruction high	From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors.Given uncalibrated and unposed multi-view images, NAS3R reconstructs 3D Gaussian primitives from context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision.To ensure stable convergence, NAS3R integrates scene reconstruction and camera estimation within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization.The framework is compatible with state-of-the-art architectures and can incorporate pretrained priors or intrinsic information when available.Extensive experiments show that NAS3R achieves superior result
100	core_reconstruction medium	From Rays to Projections: Better Inputs for Feed-Forward View Synthesis 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting, but title/abstract signal is narrower	abstract Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plücker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity
101	core_reconstruction high	FSFSplatter: Geometrically Accurate Reconstruction with Free Sparse-view Images within 2 minutes 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Gaussian Splatting has become a leading reconstruction technique, known for its high-quality novel view synthesis and detailed reconstruction. However, most existing methods require dense, calibrated views. Reconstruction from free sparse-view images often leads to poor surface due to limited overlap and overfitting.We introduce FSFSplatter for $\textbf{f}$ast geometrically accurate reconstruction from $\textbf{f}$ree $\textbf{s}$parse-view images. Our method integrates end-to-end dense Gaussian scene initialization and geometry-enhanced scene optimization.Specifically, FSFSplatter employs a large transformer to encode multi-view images and generates a dense and geometrically consistent Gaussian scene initialization via a batch based self-splitting Gaussian head. It eliminates local floaters through contribution-based pruning and mitigates overfitting by leveraging depth and multi-view f
102	core_reconstruction high	GaussFusion: Improving 3D Reconstruction in the Wild with Geometry-Informed Video Generator 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We present GaussFusion, a novel approach for improving 3D Gaussian splatting (3DGS) reconstructions in the wild through geometry-informed video generation. GaussFusion mitigates common 3DGS artifacts, including floaters, flickering, and blur caused by camera pose errors, incomplete coverage, and noisy geometry initialization. Unlike prior RGB-based approaches limited to a single reconstruction pipeline, our method introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Given an existing reconstruction, we render a Gaussian primitive video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. We further introduce an artifact synthesis pipeline that simulates diverse degradation patterns, ensuring robustness and generaliz
103	core_reconstruction high	Gaussian Mapping for Evolving Scenes 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Mapping systems with novel view synthesis (NVS) capabilities are widely used in computer vision, as well as in various applications, including augmented reality, robotics, and autonomous driving. Most notably, 3D Gaussian Splatting-based systems show high NVS performance; however, many current approaches are limited to static scenes. While recent works have begun addressing short-term dynamics (motion within the camera's view), long-term dynamics (the scene evolving through changes out of view) remain less explored.To overcome this limitation, we introduce a dynamic scene adaptation mechanism that continuously updates the 3D representation to reflect the latest changes. In addition, since maintaining geometric and semantic consistency remains challenging due to stale observations disrupting the reconstruction process, we propose a novel keyframe management mechanism that discards outdate
104	core_reconstruction high	GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We introduce GaussianZoom, a generative zoom-in 3D reconstruction system with an iterative progressive framework that combines geometry-consistent scene modeling and multi-scale semantic reasoning to enable high-fidelity extreme zoom-in rendering from low-resolution inputs.To achieve this, we develop a novel multi-view consistent super-resolution module with depth-based feature warping and VLM-driven detail synthesis, ensuring accurate multi-view correspondence while enriching fine-scale appearance beyond the observed resolution.To support zooming across large magnification ranges, we further introduce a new expandable continuous Level-of-Detail hierarchy that dynamically modulates Gaussian visibility for smooth, alias-free cross-scale rendering. Experiments on Mip-NeRF360 and Tanks\&Temples demonstrate that GaussianZoom achieves superior perceptual quality, multi-view consistency, and r
105	core_reconstruction high	GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Reconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diffusion models have enabled remarkable progress in image and video generation for avatar reconstruction, existing methods primarily rely on 2D priors and struggle to achieve consistent 3D geometry. We propose a novel framework that leverages geometry-aware diffusion to distill strong geometry priors for high-fidelity head avatar reconstruction. Our approach jointly synthesizes portrait images and corresponding surface normals, while a pose-free expression encoder captures implicit expression representations. Both synthesized images and expression latents are distilled into 3D Gaussian-based avatars, enabling photorealistic rendering with accurate geometry. Extensive experiments demonstrate that our method substantially outperforms state-of-
106	core_reconstruction high	Geometric-Photometric Event-based 3D Gaussian Ray Tracing 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Event cameras offer a high temporal resolution over traditional frame-based cameras, which makes them suitable for motion and structure estimation. However, it has been unclear how event-based 3D Gaussian Splatting (3DGS) approaches could leverage fine-grained temporal information of sparse events. This work proposes a framework to address the trade-off between accuracy and temporal resolution in the event-based 3DGS. Our key idea is to decouple the rendering into two branches: event-by-event geometry (depth) rendering and snapshot-based radiance (intensity) rendering, by using ray-tracing and the image of warped events. The extensive evaluation shows that our method achieves the state-of-the-art performance on the real-world datasets and competitive performance on the synthetic datasets. Also, the proposed method works without prior information (e.g., pretrained image reconstruction mod
107	core_reconstruction high	GHPT: Real-Time Relightable Gaussian Splatting using Hybrid Path Tracing 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian splatting (3DGS) has emerged as a promising approach for high-fidelity 3D scene representation. However, relighting and composition of Gaussian splatting remain challenging because path tracing is not directly applicable. Existing relighting methods for Gaussian splatting typically adopt either approximate rendering formulations or rely on Gaussian ray tracing, yielding low relighting performance and low rendering efficiency. To address these limitations, we propose Gaussian hybrid path tracing (GHPT), a three-stage framework to acquire relightable Gaussian splatting models. The first stage utilizes planar-based Gaussian splatting reconstruction representation (PGSR) to enable multi-view consistent depth rendering and reconstruct the surface mesh of a scene. The second stage performs physically-based differentiable rendering on the obtained mesh to reconstruct the material ma
108	core_reconstruction high	GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficie
109	core_reconstruction high	GOR-IS: 3D Gaussian Object Removal In the Intrinsic Space 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; generation_editing; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Recent advances in Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have made it standard practice to reconstruct 3D scenes from multi-view images. Removing objects from such 3D representations is a fundamental editing task that requires complete and seamless inpainting of occluded regions, ensuring consistency in geometry and appearance. Although existing methods have made notable progress in improving inpainting consistency, they often neglect global lighting effects, leading to physically implausible results. Moreover, these methods struggle with view-dependent non-Lambertian surfaces, where appearance varies across viewpoints, leading to unreliable inpainting. In this paper, we present 3D Gaussian Object Removal in the Intrinsic Space (GOR-IS), a novel framework for physically consistent and visually coherent 3D object removal. Our approach decompose
110	core_reconstruction high	GP-4DGS: Probabilistic Analysis of 4D Gaussian Splattings for Monocular Video Reconstruction via Variational Gaussian Processes 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We present GP-4DGS, a probabilistic framework for monocular video reconstruction that models the motion of 4D Gaussian Splatting (GS) primitives using variational Gaussian Processes (GPs). In contrast to prior approaches that depend on manually designed motion priors, our kernel-based probabilistic formulation enables flexible, data-adaptive motion modeling while implicitly providing appropriate priors for unobserved regions. GP-4DGS employs variational GPs with spatial kernels to capture geometric correlations and periodic kernels to characterize temporal dynamics, achieving efficient scalability to large sets of primitives compared to standard GPs. To train GP-4DGS, we introduce an optimization strategy that jointly optimizes GS primitive parameters as well as GP hyperparameters, establishing a complementary relationship between probabilistic and geometric modeling. Beyond improved rec
111	core_reconstruction high	HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Diffusion priors have recently demonstrated strong capability in enhancing the quality of sparse-view 3D reconstruction by augmenting training views at novel viewpoints, but they inevitably introduce hallucinated content-- artifacts inconsistent with the input views -- into the final 3D model. To address this challenge, we propose Hallucination-Aware Diffusion prior (HAD), which estimates pixel-wise hallucination score maps for augmented images by leveraging multi-view reasoning capabilities from a feedforward novel view synthesis (NVS) network pre-trained on large-scale 3D data. These hallucination scores enable selective masking of unreliable pixels during the progressive 3D reconstruction procedure, preventing the introduction of non-existent artifacts into the 3D model. To further enhance performance, we create multiple versions of augmented images at each novel view by conditioning
112	core_reconstruction high	Hermite Radial Basis Function for Surface Reconstruction via Differentiable Rendering 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Recent advances in novel view synthesis have enabled differentiable rendering methods to reconstruct 3D scenes directly from images. Algorithms such as 3D Gaussian Splatting and RayGauss use local basis functions to represent radiance fields, enabling fast, high-quality rendering of real-world scenes. However, these methods lack an exact geometric representation of the scene. In this work, inspired by Hermite Radial Basis Function (HRBF) implicits, we introduce a global implicit function constructed from local RBFs and their derivatives to represent surfaces. The proposed formulation enables learning scene geometry through differentiable rendering of an implicit function. By leveraging local basis functions, it achieves both an efficient geometric representation and fast rendering, using a bounding volume hierarchy (BVH) to accelerate intersections with the local basis functions. The imp
113	core_reconstruction high	HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) has recently emerged as a promising approach in novel view synthesis, combining photorealistic rendering with real-time efficiency. However, its success heavily relies on dense camera coverage; under sparse-view conditions, insufficient supervision leads to irregular Gaussian distributions—characterized by globally sparse coverage, blurred background, and distorted high-frequency areas.To address this, we propose HeroGS—Hierarchical Guidance for Robust 3D Gaussian Splatting—a unified framework that establishes hierarchical guidance across the image, feature, and parameter levels. At the image level, sparse supervision is converted into pseudo-dense guidance, globally regularizing the Gaussian distributions and forming a consistent foundation for subsequent optimization. Building upon this, Feature-Adaptive Densification and Pruning (FADP) at the feature leve
114	core_reconstruction high	IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Generalizable 3D Gaussian Splatting aims to directly predict Gaussian parameters using a feed-forward network for scene reconstruction. Among these parameters, Gaussian means are particularly difficult to predict, so depth is usually estimated first and then unprojected to obtain the Gaussian sphere centers. Existing methods typically rely solely on a single warp to estimate depth probability, which hinders their ability to fully leverage cross-view geometric cues, resulting in unstable and coarse depth maps. To address this limitation, we propose IDESplat, which iteratively applies warp operations to boost depth probability estimation for accurate Gaussian mean prediction. First, to eliminate the inherent instability of a single warp, we introduce a Depth Probability Boosting Unit (DPBU) that integrates multi-level epipolar attention maps in a multiplicative manner. Next, we construct a
115	core_reconstruction high	Illumination-Consistent Human-Scene Reconstruction from Monocular Video 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Reconstructing 3D humans and scenes from monocular videos is a challenging task, particularly due to human motion, varying illumination, and dynamic scene shadows. While recent works have explored scene disentanglement by jointly modeling humans and their surrounding scenes, they often overlook illumination and shadow effects—resulting in inconsistent human appearance and degraded scene realism. To address this gap, we propose a photometrically consistent integration of human and scene reconstruction based on 3D Gaussian Splatting, with a key focus on modeling spatially-varying illumination and shadows. Central to our method is a learnable light volume that provides localized lighting cues to human Gaussians, enabling more realistic and consistent appearance synthesis. To further ensure accurate human geometry and alignment, we adopt a two-stage reconstruction strategy: we first optimize
116	core_reconstruction high	iLRM: An Iterative Large 3D Reconstruction Model 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input images to enabl
117	core_reconstruction high	iSplat: Iterative Learning for Fine-Grained Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Recent advances in feed-forward 3D Gaussian splatting have demonstrated remarkable efficiency by reconstructing scenes in a single pass. However, the reconstruction fidelity of these methods lags behind that of traditional optimization-based approaches, which gradually correct reconstruction flaws through a lengthy iterative process. In this paper, we leverage the strengths of both paradigms and introduce iSplat, a novel framework that reformulates reconstruction as an iterative feed-forward process involving multiple (typically three) passes.Central to iSplat is a recurrent GRU-based optimizer that refines both geometry and appearance in a synergistic loop. To address geometric inaccuracies, we propose an uncertainty-driven depth refinement strategy that progressively narrows the search space for each Gaussian based on its estimated uncertainty from the previous step. To further improve
118	core_reconstruction high	Layered 4D-Rotor Gaussian Splatting: A Compressed Representation for Long Dynamic Scenes 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We address the challenge of reconstructing long dynamic scenes from multi-view videos in a storage-efficient manner. Recent advances in Gaussian Splatting and its extensions to dynamic scenes have demonstrated impressive visual quality, but remain limited to short duration (<10 s), large storage size (>500 MB), and high GPU VRAM usage.To overcome these limitations, we introduce Layered 4D-Rotor Gaussian Splatting (L4DRotorGS), a novel compressed representation designed for long dynamic scenes. Our approach integrates a layered 4D representation, efficient training, and effective compression into a unified framework. Specifically, 4D Gaussians are first organized into layers based on their temporal extents and then partitioned into discrete temporal buckets. This structure allows for selective access and rendering of only the necessary subsets of 4D Gaussians, substantially reducing GPU m
119	core_reconstruction high	Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; robotics_mapping	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics.We introduce $\textbf{\textit{UniSplat}}$, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a $\textit{dual-masking strategy}$ that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural informatio
120	core_reconstruction high	Learning Compact 3D Representations from Feed-Forward Novel View Synthesis 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Reconstructing and understanding 3D scenes from sparse views in a feed-forward manner remains challenging. While recent approaches use per-pixel 3D Gaussian Splatting for reconstruction and 2D-to-3D feature lifting for scene understanding, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation. We propose a feed-forward framework that estimates compact Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns to efficiently lift features. Extensive experiments on 3D open-vocabulary segmentation and view-invariant feature generation demon
121	core_reconstruction high	LumiMotion: Improving Gaussian Relighting with Scene Dynamics 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract In 3D reconstruction, the problem of inverse rendering, namely recovering the illumination of the scene and the material properties, is fundamental. Existing Gaussian Splatting-based methods primarily target static scenes and often assume simplified or moderate lighting to avoid entangling shadows with surface appearance. This limits their ability to accurately separate lighting effects from material properties, particularly in real-world conditions. We address this limitation by leveraging dynamic elements - regions of the scene that undergo motion - as a supervisory signal for inverse rendering. Motion reveals the same surfaces under varying lighting conditions, providing stronger cues for disentangling material and illumination. This thesis is supported by our experimental results which show we improve LPIPS by 23\% for albedo estimation and by 15% for scene relighting relative to nex
122	core_reconstruction medium	LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing; data_benchmark	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract We present a novel approach for interactive light editing in indoor scenes from a single multi-view scene capture. Our method leverages a generative image-based light decomposition model that factorizes complex indoor scene illumination into its constituent light sources. This factorization enables independent manipulation of individual light sources, specifically allowing control over their state (on/off), chromaticity, and intensity. We further introduce multi-view lighting harmonization to ensure consistent propagation of the lighting decomposition across all scene views. This is integrated into a relightable 3D Gaussian splatting representation, providing real-time interactive control over the individual light sources. Our results demonstrate highly photorealistic lighting decomposition and relighting outcomes across diverse indoor scenes. We evaluate our method on both synthetic and
123	core_reconstruction high	MAPo : Motion-Aware Partitioning of Deformable 3D Gaussian Splatting for High-Fidelity Dynamic Scene Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting, known for enabling high-quality static scene reconstruction with fast rendering, is increasingly being applied to multi-view dynamic scene reconstruction. A common strategy involves learning a deformation field to model the temporal changes of a canonical set of 3D Gaussians. However, these deformation-based methods often produce blurred renderings and lose fine motion details in highly dynamic regions due to the inherent limitations of a single, unified model in representing diverse motion patterns. To address these challenges, we introduce Motion-Aware Partitioning of Deformable 3D Gaussian Splatting (MAPo), a novel framework for high-fidelity dynamic scene reconstruction. Its core is a dynamic score-based partitioning strategy that distinguishes between high- and low-dynamic 3D Gaussians. For high-dynamic 3D Gaussians, we recursively partition them temporally an
124	core_reconstruction high	Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object’s complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object’s overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We
125	core_reconstruction high	MeshSplatting: Differentiable Rendering with Opaque Meshes 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Primitive-based splatting methods like 3D Gaussian Splatting (3DGS) have revolutionized novel view synthesis with real-time rendering.However, their point-based representations remain incompatible with mesh-based pipelines that power AR/VR and game engines. We present MeshSplatting, a mesh-based reconstruction approach that jointly optimizes geometry and appearance through differentiable rendering.By enforcing connectivity via restricted Delaunay triangulation and refining surface consistency, MeshSplatting creates end-to-end smooth, visually high-quality meshes that render efficiently in real-time 3D engines.On Mip-NeRF360, it boosts PSNR by +0.69 dB over the current state-of-the-art MiLo for mesh-based novel view synthesis, while training 2x faster and using 2x less memory, bridging neural rendering and interactive 3D graphics for seamless real-time scene interaction.
126	core_reconstruction high	MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Recently, 3D Gaussian Splatting and its derivatives have achieved significant breakthroughs in large-scale scene reconstruction. However, how to efficiently and stably achieve high-quality geometric fidelity remains a core challenge. To address this issue, we introduce MetroGS, a novel Gaussian Splatting framework for efficient and robust reconstruction in complex urban environments. Our method is built upon a distributed 2D Gaussian Splatting representation as the core foundation, serving as a unified backbone for subsequent modules. To handle potential sparse regions in complex scenes, we propose a structured dense enhancement scheme that utilizes SfM priors and a pointmap model to achieve a denser initialization, while incorporating a sparsity compensation mechanism to improve reconstruction completeness. Furthermore, we design a progressive hybrid geometric optimization strategy that
127	core_reconstruction high	Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs. free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-base
128	core_reconstruction high	MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectioanl Blending with Hierarchical Densification 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Recent advances in 4D Gaussian Splatting (4DGS) have extended the high-speed rendering capability of 3D Gaussian Splatting (3DGS) into the temporal domain, enabling real-time rendering of dynamic scenes.However, one of the major remaining challenges lies in modeling long-range motion-contained dynamic videos, where a naïve extension of existing methods leads to severe memory explosion, temporal flickering, and failure to handle appearing or disappearing occlusions over time. To address these challenges, we propose a novel 4DGS framework characterized by an Anchor Relay-based Bidirectional Blending (ARBB) mechanism, named MoRel, which enables temporally consistent and memory-efficient modeling of long-range dynamic scenes.Our method progressively constructs locally canonical anchor spaces at key-frame time index and models inter-frame deformations at the anchor level, enhancing temporal c
129	core_reconstruction high	MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Online reconstruction of dynamic scenes aims to learn from streaming multi-view inputs under low-latency constraints. The fast training and real-time rendering capabilities of 3D Gaussian Splatting have made on-the-fly reconstruction practically feasible, enabling online 4D reconstruction. However, existing online approaches, despite their efficiency and visual quality, fail to learn per-Gaussian motion that reflects true scene dynamics. Without explicit motion cues, appearance and motion are optimized solely under photometric loss, causing per-Gaussian motion to chase pixel residuals rather than true 3D motion. To address this, we propose MoRGS, an efficient online per-Gaussian motion reasoning framework that treats Gaussian movement as a core modeling object. Specifically, we efficiently leverage optical flow on a sparse set of key views as a lightweight motion cue to guide per-Gaussi
130	core_reconstruction high	MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We present MOSAIC-GS, a novel, fully explicit, and computationally efficient approach for high-fidelity dynamic scene reconstruction from monocular videos using Gaussian Splatting.Monocular reconstruction is inherently ill-posed due to the lack of sufficient multiview constraints, making accurate recovery of object geometry and temporal coherence particularly challenging. To address this, we leverage multiple geometric cues, such as depth, optical flow, dynamic object segmentation, and point tracking. Combined with rigidity-based motion constraints, these cues allow us to estimate preliminary 3D scene dynamics during an initialization stage.Recovering scene dynamics prior to the photometric optimization reduces reliance on motion inference from visual appearance alone, which is often ambiguous in monocular settings.To enable compact representations, fast training, and real-time rendering
131	core_reconstruction high	MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Realistic reconstruction of dynamic 4D scenes is essential for understanding the physical world.Despite recent progress in monocular view synthesis, existing methods still struggle to recover accurate 3D geometry and temporally consistent motion in complex environments.To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences, enabling faithful reconstruction of high-fidelity scene structures and coherent motion representation under complex dynamics.To handle motion with arbitrary flexibility and long-term variation, we introduce a scalable motion field built upon cluster-based bases that adaptively grow to capture diverse motion patterns over time.Moreover, we introduce a progressive optimization strategy that extends naturally to unseen frames. This strategy comprises two propagation modules: 1)
132	core_reconstruction high	MSCD-GS: Motion-Separated Cooperative Deblurring Dynamic Reconstruction via Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Although 4D reconstruction based on Gaussian Splatting has achieved many impressive results, reconstructing real-world images captured by a casual monocular camera remains a significant challenge. In dynamic scenes, as the camera and objects move during the exposure time, these input images inevitably contain a considerable amount of motion blur, which severely compromises the quality of reconstruction and new viewpoint synthesis. The existing deblurring 3D Gaussian models still cannot handle motion blur issues in real dynamic scenes. To address these challenges, we propose MSCD-GS—a novel method for motion-separated collaborative deblurring 4D reconstruction via Gaussian Splatting, capable of effectively handling motion-blurred inputs. Specifically, due to the distinct motion characteristics of static and dynamic Gaussians, we perform separate motion modeling to achieve dynamic scene re
133	core_reconstruction high	MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Generalizable Neural Radiance Fields (GeNeRF) enable high-quality scene reconstruction from a limited number of views and can generalize to unseen scenes. However, in real-world environments, transient distractors disrupt structural consistency across views, leading to deviated supervision signals and degraded reconstruction quality. Existing distractor-free NeRF methods rely on per-scene optimization and they estimate uncertainty from per-view reconstruction errors to remove distractors, but this is unreliable to GeNeRF, because it may misjudge inconsistent static structures from source views as distractors. To address this issue, we propose MUGeNeRF: a multi-view uncertainty-guided distractor-aware GeNeRF method, aim to effectively alleviate GeNeRF's robust modeling challenges in dynamic scenes with transient distractions. We explicitly decompose distractor awareness into two complemen
134	core_reconstruction high	Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Recent years have witnessed the rapid emergence of 3D Gaussian Splatting (3DGS) as a powerful approach for 3D reconstruction and novel view synthesis. Its explicit representation with Gaussian primitives enables fast training, real-time rendering, and convenient post-processing such as editing and surface reconstruction. However, 3DGS suffers from a critical drawback: the number of primitives grows drastically for scenes with high-frequency appearance details, since each primitive can represent only a single color, requiring multiple primitives for every sharp color transition.To overcome this limitation, we propose Neural Gabor splatting, which augments each Gaussian primitive with a lightweight multi-layer perceptron (MLP) that models a wide range of color variations within a single primitive. To further control primitive numbers, we introduce a frequency-aware densification strategy t
135	core_reconstruction high	No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present the first study of cross-sensor view synthesis across different modalities. We examine a practical, fundamental, yet widely overlooked problem: getting aligned RGB-X data, where most RGB-X prior work assumes such pairs exist and focuses on modality fusion, but it empirically requires huge engineering effort in calibration. We propose a match-densify-consolidate method. First, we perform RGB-X image matching followed by guided point densification. Using the proposed confidence-aware densification and self-matching filtering, we attain better view synthesis and later consolidate them in 3D Gaussian Splatting (3DGS). Our method uses no 3D priors for X-sensor and only assumes nearly no-cost COLMAP for RGB. We aim to remove the cumbersome calibration for various RGB-X sensors and advance the popularity of cross-sensor learning by a scalable solution that breaks through the bottlene
136	core_reconstruction high	Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; generation_editing	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Feed-forward 3D Gaussian Splatting (3DGS) models enable real-time scene generation but are hindered by suboptimal pixel-aligned primitive placement, which relies on a dense, rigid grid and limits both quality and efficiency. We introduce a new feed-forward architecture that detects 3D Gaussian primitives at a sub-pixel level, replacing the pixel grid with an adaptive, ``Off The Grid" distribution. Inspired by keypoint detection, our multi-resolution decoder learns to distribute primitives across image patches. This module is trained end-to-end with a 3D reconstruction backbone using self-supervised learning. Our resulting pose-free model generates photorealistic scenes in seconds, achieving state-of-the-art novel view synthesis for feed-forward models. It outperforms competitors while using far fewer primitives, demonstrating a more accurate and efficient allocation that captures fine de
137	core_reconstruction high	P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit representation enabling fast, high-fidelity rendering, making it a promising foundation for closed-loop simulators and perception models in autonomous driving. However, conventional 3DGS implicitly assumes consistent exposure and tone mapping across views. Real driving data violates this assumption due to heterogeneous camera pipelines and dynamic outdoor illumination, baking exposure discrepancies and sensor noise into the radiance field and producing artifacts and inconsistent illumination especially in static backgrounds crucial for realistic simulation. These issues are amplified in autonomous driving, where sparse viewpoints, varying exposures, and outdoor lighting interact, while prior work mainly targets dynamic-object reconstruction and overlooks cross-view photometric consistency.To address this limitation,
138	core_reconstruction high	PackUV: Packed Gaussian UV Maps for 4D Volumetric Video 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; generation_editing; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Volumetric videos offer immersive 4D experiences, but remain difficult to reconstruct, store, and stream at scale. Existing Gaussian Splatting based methods achieve high-quality reconstruction but break down on long sequences, temporal inconsistency, and fail under large motions and disocclusions. Moreover, their outputs are typically incompatible with conventional video coding pipelines, preventing practical applications. We introduce PackUV, a novel 4D Gaussian representation that maps all Gaussian attributes into a sequence of structured, multi-scale UV atlas, enabling compact, image-native storage. To fit this representation from multi-view videos, we propose PackUV-GS, a temporally consistent fitting method that directly optimizes Gaussian parameters in the UV domain. A flow-guided Gaussian labeling and video keyframing module identifies dynamic Gaussians, stabilizes static regions,
139	core_reconstruction high	Part$^{2}$GS: Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Articulated objects are common in the real world, yet modeling their structure and motion remains a challenging task for 3D reconstruction methods. In this work, we introduce Part$^{2}$GS, a novel framework for modeling articulated digital twins of multi-part objects with high-fidelity geometry and physically consistent articulation. Part$^{2}$GS leverages a part-aware 3D Gaussian representation that encodes articulated components with learnable attributes, enabling structured, disentangled transformations that preserve high-fidelity geometry. To ensure physically consistent motion, we propose a motion-aware canonical representation guided by physics-based constraints, including contact enforcement, velocity consistency, and vector-field alignment. Furthermore, we introduce a field of repel points to prevent part collisions and maintain stable articulation paths, significantly improving
140	core_reconstruction high	ParticleGS: Learning Neural Gaussian Particle Dynamics from Videos for Prior-free Physical Motion Extrapolation 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; generation_editing	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract The ability to extrapolate dynamic 3D scenes beyond the observed timeframe is fundamental to advancing physical world understanding and predictive modeling. Existing dynamic 3D reconstruction methods have achieved high-fidelity rendering of temporal interpolation, but typically lack physical consistency in predicting the future. To overcome this issue, we propose ParticleGS, a physics-based framework that reformulates dynamic 3D scenes as physically grounded systems. ParticleGS comprises three key components: 1) an encoder that decomposes the scene into static properties and initial dynamic physical fields; 2) an evolver based on Neural Ordinary Differential Equations (Neural ODEs) that learns continuous-time dynamics for motion extrapolation; and 3) a decoder that reconstructs 3D Gaussians from evolved particle states for rendering. Through this design, ParticleGS integrates physical re
141	core_reconstruction high	PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present PercHead, a model for single-image 3D head reconstruction and disentangled 3D editing - two tasks that are inherently challenging due to ambiguity in plausible explanations for the same input. At the heart of our approach lies our novel perceptual loss based on DINOv2 and SAM 2.1. Unlike widely-adopted low-level losses like LPIPS, SSIM or L1, we rely on deep visual understanding of images and the resulting generalized supervision signals. We show that our new loss can be a drop-in replacement for standard losses and used to improve visual quality in high-frequency areas. We base our model architecture on Vision Transformers (ViTs), allowing us to decouple the 3D representation from the 2D input. We train our method on multi-view images for view-consistency and in-the-wild images for strong transferability to new environments. Our model achieves state-of-the-art performance in
142	core_reconstruction high	PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Understanding physical properties such as friction, stiffness, hardness, and material composition is essential for enabling robots to interact safely and effectively with their surroundings. However, existing 3D reconstruction methods focus on geometry and appearance and cannot infer these underlying physical properties. We present PhysGS, a Bayesian-inferred extension of 3D Gaussian Splatting that estimates dense, per-point physical properties from visual cues and vision--language priors. We formulate property estimation as Bayesian inference over Gaussian splats, where material and property beliefs are iteratively refined as new observations arrive. PhysGS also models aleatoric and epistemic uncertainties, enabling uncertainty-aware object and scene interpretation. Across object-scale (ABO-500), indoor, and outdoor real-world datasets, PhysGS improves accuracy of the mass estimation by
143	core_reconstruction high	Physically Inspired Gaussian Splatting for HDR Novel View Synthesis 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract High dynamic range novel view synthesis (HDR-NVS) reconstructs scenes with dynamic details by fusing multi-exposure low dynamic range (LDR) views, yet it struggles to capture ambient illumination-dependent appearance. Implicitly supervising HDR content by constraining tone-mapped results fails in correcting abnormal HDR values, and results in limited gradients for Gaussians in under/over-exposed regions. To this end, we introduce PhysHDR-GS, a physically inspired HDR-NVS framework that models scene appearance via intrinsic reflectance and adjustable ambient illumination. PhysHDR-GS employs a complementary image-exposure (IE) branch and Gaussian-illumination (GI) branch to faithfully reproduce standard camera observations and capture illumination-dependent appearance changes, respectively. During training, the proposed cross-branch HDR consistency loss provides explicit supervision fo
144	core_reconstruction high	Plug-and-Play PDE Optimization for 3D Gaussian Splatting: Toward High-Quality Rendering and Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction by achieving high-quality novel view synthesis with fast rendering speed, introducing 3D Gaussian primitives to represent the scene. However, 3DGS encounters blurring and floaters when applied to complex scenes, caused by the reconstruction of redundant and ambiguous geometric structures. We attribute this issue to the unstable optimization of the Gaussians. To address this limitation, we present a plug-and-play PDE-based optimization method that overcomes the optimization constraints of 3DGS-based approaches in various tasks, such as novel view synthesis and surface reconstruction. Firstly, we theoretically derive that the 3DGS optimization procedure can be modeled as a PDE, and introduce a viscous term to ensure stable optimization. Secondly, we use the Material Point Method (MPM) to obtain a stable numerical
145	core_reconstruction medium	PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting 3D Vision & Geometry / Point Cloud	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; robotics_mapping	direct reconstruction/3DGS/4D title linked to core representation cluster	abstract Unsupervised point cloud segmentation is critical for embodied intelligence and autonomous driving, as it mitigates the prohibitive cost of dense point-level annotations required by fully supervised methods. Integrating 2D pre-trained models such as SAM to supplement semantic information is a natural choice, yet this approach faces a fundamental mismatch between discrete 3D points and continuous 2D images. This mismatch leads to inevitable projection overlap and complex modality alignment, resulting in compromised semantic consistency across 2D-3D transfer. To address these limitations and achieve semantic-consistent segmentation, this paper proposes PointGS, a simple yet effective pipeline for unsupervised 3D point cloud segmentation. PointGS leverages 3D Gaussian Splatting as a unified intermediate representation to bridge the discrete-continuous domain gap. Input sparse point clouds
146	core_reconstruction high	PolarGuide-GSDR: 3D Gaussian Splatting Driven by Polarization Priors and Deferred Reflection for Real-World Reflective Scenes 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Polarization-aware Neural Radiance Fields (NeRF) enable novel view synthesis of specular-reflection scenes but face challenges in slow training, inefficient rendering, and strong dependencies on material/viewpoint assumptions. However, 3D Gaussian Splatting (3DGS) enables real-time rendering yet struggles with accurate reflection reconstruction from reflection-geometry entanglement, adding a deferred reflection module introduces environment map dependence. We address these limitations by proposing PolarGuide-GSDR, a polarization-forward-guided paradigm establishing a bidirectional coupling mechanism between polarization and 3DGS: first 3DGS’s geometric priors are leveraged to resolve polarization ambiguity, and then the refined polarization information cues are used to guide 3DGS’s normal and spherical harmonic representation. This process achieves high-fidelity reflection separation an
147	core_reconstruction high	Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Omnidirectional 3D Gaussian Splatting with panoramas is a key technique for 3D scene representation, and existing methods typically rely on slow SfM to provide camera poses and sparse points priors. In this work, we propose a pose-free omnidirectional 3DGS method, named PFGS360, that reconstructs 3D Gaussians from unposed omnidirectional videos. To achieve accurate camera pose estimation, we first construct a spherical consistency-aware pose estimation module, which recovers poses by establishing consistent 2D–3D correspondences between the reconstructed Gaussians and the unposed images using Gaussians' internal depth priors. Besides, to enhance the fidelity of novel view synthesis, we introduce a depth-inlier-aware densification module to extract depth inliers and Gaussian outliers with consistent monocular depth priors, enabling efficient Gaussian densification and achieving photoreali
148	core_reconstruction medium	PoseGaussian: 6D Pose Estimation for Unseen Objects via Sparse-View Object-Level 3D Gaussian Splatting 3D Vision & Geometry / Pose Estimation	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract 6D pose estimation is a key technology in computer vision and robotic manipulation. However, many methods remain heavily dependent on CAD models that are difficult to obtain. Object-level 3D reconstruction provides an alternative route, and 3D Gaussian Splatting (3DGS) shows convincing potential owing to its training and rendering efficiency. Nevertheless, under sparse reference views, 3DGS is prone to floating artifacts and appearance overfitting, which weakens the stability of pose estimation. We present PoseGaussian, a method for sparse-view 6D pose estimation for unseen object that builds on improved 3DGS. First, we use sparse RGB-D views to inject a depth structure prior into the 3DGS initialization for stable structure, and we adopt adaptive density control, view-warping augmentation, and joint photometric–depth supervision to reduce floaters and appearance overfitting under sparse
149	core_reconstruction high	Radiance Meshes for Volumetric Reconstruction 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We introduce Radiance Meshes for representing radiance fields with constant density tetrahedral cells produced with a Delaunay tetrahedralization.Unlike a Voronoi diagram, a Delaunay tetrahedralization yields simple triangles that are natively supported by existing hardware. As such, our model is able to perform exact and fast volume rendering using both rasterization and ray-tracing. We introduce a new rasterization method that achieve faster rendering speeds than all prior radiance field representations (assuming an equivalent number of primitives and resolution) across a variety of platforms.Optimizing the positions of Delaunay vertices introduces topological discontinuities (edge flips). To solve this, we use a Zip-NeRF-style backbone which allows us to express a smoothly varying field even when the topology changes.Our rendering method exactly evaluates the volume rendering equation
150	core_reconstruction high	REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Articulated objects are pervasive in daily environments, such as drawers and refrigerators. Towards their part-level surface reconstruction and joint parameter estimation, REArtGS introduces a category-agnostic approach using multi-view RGB images at two different states. However, we observe that REArtGS still struggles with screw-joint or multi-part objects and lacks geometric constraints for unseen states. In this paper, we propose REArtGS++, a novel method towards generalizable articulated object reconstruction with temporal geometry constraint and planar Gaussian splatting. We first model a decoupled screw motion for each joint without type prior, and jointly optimize part-aware Gaussians with joint parameters through part motion blending. To introduce time-continuous geometric constraint for articulated modeling, we encourage Gaussians to be planar and propose a temporally consisten
151	core_reconstruction high	RelightAnyone: A Generalized Relightable 3D Gaussian Head Model 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; robotics_mapping; generation_editing; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) has become a standard approach to reconstruct and render photorealistic 3D head avatars. A major challenge is to relight the avatars to match any scene illumination. For high quality relighting, existing methods require subjects to be captured under complex time-multiplexed illumination, such as one-light-at-a-time (OLAT). We propose a new generalized relightable 3D Gaussian head model that can relight any subject observed in a single- or multi-view images without requiring OLAT data for that subject. Our core idea is to learn a mapping from flat-lit 3DGS avatars to corresponding relightable Gaussian parameters for that avatar. Our model consists of two stages: a first stage that models flat-lit 3DGS avatars without OLAT lighting, and a second stage that learns the mapping to physically-based reflectance parameters for high-quality relighting. This two-stage
152	core_reconstruction high	ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract High-quality 3D garment reconstruction plays a crucial role in mitigating the sim-to-real gap in applications such as digital avatars, virtual try-on and robotic manipulation. However, existing garment reconstruction methods, typically rely on the unstructured representations, such as 3D Gaussian Splats, which struggle to provide accurate reconstructions of garment topology and sewing structures. As a result, the reconstructed outputs are often unsuitable for high-fidelity physical simulation. We propose \textbf{ReWeaver}, a novel framework for topology-accurate 3D garment and sewing pattern reconstruction from \textit{sparse} multi-view RGB images. Given as few as four input views, ReWeaver predicts seams and panels as well as their connectivities in both the 2D UV space and the 3D space. The reconstructed seams and panels align precisely with the input images, and can be easily convert
153	core_reconstruction high	RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Neural fields (NFs) have achieved remarkable success in scene reconstruction and novel view synthesis. However, existing NF approaches that rely on RGB or LiDAR inputs often struggle under adverse weather conditions, limiting their robustness in real-world outdoor environments such as autonomous driving. In contrast, millimeter-wave radar is inherently resilient to environmental variations, yet its integration with NFs remains largely underexplored. Moreover, outdoor driving scenes frequently involve dynamic objects, making spatiotemporal modeling crucial for temporally consistent novel view synthesis. To address these challenges, we present RF4D, a radar-based neural field framework tailored for novel view synthesis in outdoor dynamic scenes. RF4D explicitly incorporates temporal information into its representation, enabling more accurate modeling of object motion. A dedicated \textbf{s
154	core_reconstruction high	RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Reconstructing people, objects, and their interactions in 3D is a long-standing and fundamental goal for intelligent systems. Often the input is RGB video from a moving camera, making the task ill-posed; depth is ambiguous, humans and objects occlude each other, and camera and object motion entangle to create apparent motion. Most prior work addresses humans or objects in isolation, ignoring their interplay, or assumes known 3D shapes or cameras, which is impractical for real-world applications. We develop RHINO (Reconstructing Human Interactions with Novel Objects), a novel three-step framework that recovers in 3D a human, novel (unseen) manipulated object, and static scene in a common world frame from a monocular RGB video. First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and
155	core_reconstruction high	S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Explicit 3D representations have already become an essential medium for 3D simulation and understanding.However, the most commonly used point cloud and 3D Gaussian Splatting (3DGS) each suffer from non-photorealistic rendering and significant degradation under sparse inputs.In this paper, we introduce Sparse to Dense lifting (S2D), a novel pipeline that bridges the two representations and achieves high-quality 3DGS reconstruction with minimal inputs.Specifically, the S2D lifting is two-fold.We first present an efficient one-step diffusion model that lifts sparse point cloud for high-fidelity image artifact fixing.Meanwhile, to reconstruct 3D consistent scenes, we also design a corresponding reconstruction strategy with random sample drop and weighted gradient for robust model fitting from sparse input views to dense novel views.Extensive experiments show that S2D achieves the best consis
156	core_reconstruction medium	ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing; data_benchmark	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Recent advancements in 3D object generation using diffusion models have achieved remarkable success, but generating realistic 3D urban scenes remains challenging. Existing methods relying solely on 3D diffusion models tend to suffer a degradation in appearance details, while those utilizing only 2D diffusion models typically compromise camera controllability. To overcome this limitation, we propose ScenDi, a method for urban scene generation that integrates both 3D and 2D diffusion models. We first train a 3D latent diffusion model to generate 3D Gaussians, enabling the rendering of images at a relatively low resolution. To enable controllable synthesis, this 3DGS generation process can be optionally conditioned by specifying inputs such as 3d bounding boxes, road maps, or text prompts. Then, we train a 2D video diffusion model to enhance appearance details conditioned on rendered images
157	core_reconstruction high	SDGS: Spatial Difference Guided Gaussian Splatting for Simultaneous Localization and 3D Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit 3D representation, enabling photorealistic and real-time novel view synthesis. However, most 3DGS pipelines still assume precomputed camera poses and offline optimization, which introduces latency and makes them brittle in fast-motion, real-world scenarios. Existing online 3DGS systems mostly fall into two camps: (1) hybrid systems that rely on a separate traditional SLAM system for camera poses and optimize Gaussians decoupled from tracking, increasing system complexity; and (2) purely Gaussian-based systems that estimate poses from dense photometric errors, requiring repeated rendering of a large number of Gaussians and thus incurring high computational cost. Moreover, current online methods are often sensitive to motion blur and high dynamic range scenes, limiting their applicability in practice.We address these
158	core_reconstruction high	Semantic Foam: Unifying Spatial and Semantic Scene Decomposition 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Current generation scene reconstruction methods like 3D Gaussian Splatting are capable of producing photo-realistic novel view synthesis at real-time speeds, yet see only limited adoption in many practical graphics applications.One significant contributing factor to this gap is the difficulty of interacting with and editing these representations in comparison to classic human-authored 3D assets.While work has been done to impose semantic decomposition onto these representations, there are still significant limitations in the quality and consistency of these segmentations.We address this by proposing a semantically decomposed variant of the recently introduced Radiant Foam method.Our approach, Semantic Foam, combines the natural spatial volumetric decomposition provided by Radiant Foam's Voronoi mesh with an explicit semantic feature field parameterized on the cells.The explicit mesh stru
159	core_reconstruction high	SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We presents SGS-Intrinsic, an indoor inverse rendering framework that works well for sparse-view images. Unlike existing 3D Gaussian Splatting (3DGS) based methods that focus on object-centric reconstruction and fail to work under sparse view settings, our method allows to achieve high-quality geometry reconstruction and accurate disentanglement of material and illumination. The core idea is to construct a dense and geometry-consistent Gaussian semantic field guided by semantic and geometric priors, providing a reliable foundation for subsequent inverse rendering. Building upon this, we perform material–illumination disentanglement by combining a hybrid illumination model and material prior to effectively capture illumination–material interactions. To mitigate the impact of cast shadows and enhance the robustness of material recovery, we introduce illumination-invariant material constrai
160	core_reconstruction high	SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Novel view synthesis of dynamic scenes is fundamental to achieving photorealistic 4D reconstruction and immersive visual experiences. Recent progress in Gaussian-based representations has significantly improved real-time rendering quality, yet existing methods still struggle to maintain a balance between long-term static and short-term dynamic regions in both representation and optimization. To address this, we present SharpTimeGS, a lifespan-aware 4D Gaussian framework that achieves temporally adaptive modeling of both static and dynamic regions under a unified representation.Specifically, we introduce a learnable lifespan parameter that reformulates temporal visibility from a Gaussian-shaped decay into a flat-top profile, allowing primitives to remain consistently active over their intended duration and avoiding redundant densification. In addition, the learned lifespan modulates each
161	core_reconstruction high	Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Reconstructing high-fidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25-200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while optimization-based methods achieve higher fidelity but require dense views and expensive computation. We bridge this gap with a hybrid approach that combines the strengths of both paradigms.Our method introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass. We then
162	core_reconstruction high	SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy b
163	core_reconstruction high	SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Reconstructing dynamic fluids from sparse views is a long-standing and challenging problem, due to the severe lack of 3D information from insufficient view coverage. While several pioneering approaches have attempted to address this issue using differentiable rendering or novel view synthesis, they are often limited by time-consuming optimization under ill-posed conditions. We propose SmokeSVD, an efficient and effective framework to progressively reconstruct dynamic smoke from a single video by integrating the generative capabilities of diffusion models with physically guided consistency optimization. Specifically, we first propose a physically guided side-view synthesizer based on diffusion models, which explicitly incorporates velocity field constraints to generate spatio-temporally consistent side-view images frame by frame, significantly alleviating the ill-posedness of single-view
164	core_reconstruction high	SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive exp
165	core_reconstruction high	Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Neural Radiance Fields (NeRF) have shown remarkable success in image novel view synthesis (NVS), inspiring extensions to LiDAR NVS. However, most methods heavily rely on accurate camera poses for scene reconstruction. The sparsity and textureless nature of LiDAR data also present distinct challenges, leading to geometric holes and discontinuous surfaces. To address these issues, we propose SG-NLF, a pose-free LiDAR NeRF framework that integrates spectral information with geometric consistency. Specifically, we design a hybrid representation based on spectral priors to reconstruct smooth geometry. For pose optimization, we construct a confidence-aware graph based on feature compatibility to achieve global alignment. In addition, an adversarial learning strategy is introduced to enforce cross-frame consistency, thereby enhancing reconstruction quality. Comprehensive experiments demonstrate
166	core_reconstruction high	Splatent: Splatting Diffusion Latents for Novel View Synthesis 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Radiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusion-based pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D spa
167	core_reconstruction high	SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) enables high-quality novel view synthesis, motivating interest in generating higher-resolution renders than those available during training. A natural strategy is to apply super-resolution (SR) to low-resolution (LR) input views, but independently enhancing each image introduces multi-view inconsistencies, leading to blurry renders. Prior methods attempt to mitigate these inconsistencies through learned neural components, temporally consistent video priors, or joint optimization on LR and SR views, but all uniformly apply SR across every image. In contrast, our key insight is that close-up LR views may contain high-frequency information for regions also captured in more distant views, and that we can use the camera pose relative to scene geometry to inform where to add SR content. Building from this insight, we propose SplatSuRe, a method that selectively app
168	core_reconstruction high	SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D super-resolution (3DSR) aims to reconstruct high-resolution (HR) 3D scenes from low-resolution (LR) multi-view images. Existing methods rely on dense LR inputs and per-scene optimization, which restricts the high-frequency priors for constructing HR 3D Gaussian Splatting (3DGS) to those inherited from pretrained 2D super-resolution (2DSR) models. This severely limits reconstruction fidelity, cross-scene generalization, and real-time usability. We propose to reformulate 3DSR as a direct feed-forward mapping from sparse LR views to HR 3DGS representations, enabling the model to autonomously learn 3D-specific high-frequency geometry and appearance from large-scale, multi-scene data. This fundamentally changes how 3DSR acquires high-frequency knowledge and enables robust generalization to unseen scenes. Specifically, we introduce \textbf{SR3R}, a feed-forward framework that directly predi
169	core_reconstruction high	STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Reconstructing high-fidelity and animatable 3D head avatars from monocular videos remains a challenging yet essential task. Existing methods based on 3D Gaussian Splatting typically bind Gaussians to mesh triangles and model deformations solely via Linear Blend Skinning, which results in rigid motion and limited expressiveness. Moreover, they struggle to reconstruct frequently occluded regions (e.g., mouth interiors, eyelids). To address these limitations, we propose STAvatar, which consists of two key components: (1) a UV-Adaptive Soft Binding framework that leverages both image- and FLAME-based priors to learn per-Gaussian feature offsets within the UV space. This UV representation supports dynamic resampling, ensuring full compatibility with Adaptive Density Control (ADC) and enhanced adaptability to geometric and textural variations. (2) a Temporal ADC strategy, which first clusters
170	core_reconstruction high	SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step.However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object’s motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven
171	core_reconstruction high	Tavatar: Topology-Aware Gaussian Attribute Derivation for Animatable Human Avatars 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Reconstructing high-fidelity, animatable human avatars from monocular videos remains a critical challenge. Existing 3DGS-based human animation methods constrain Gaussian parameters but exclude scale, which we argue is crucial for adapting human poses to challenging out-of-distribution poses. To achieve robust animation under unseen poses, we propose Tavatar, which derives key parameters such as scale, rotation, and other geometric attributes directly from the local mesh geometry, instead of learning them through unconstrained optimization. This paradigm shift enforces topological consistency by design, as each Gaussian is analytically anchored to the local mesh geometry, inheriting its spatial structure and deformation behavior. Specifically, we bind Gaussians to mesh faces and vertices, deriving their scales and orientations from triangle properties and local edge lengths to ensure cohe
172	core_reconstruction high	TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We present TokenSplat, a feed-forward framework for joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images.At its core, TokenSplat introduces a Token-aligned Gaussian Prediction module that aligns semantically corresponding information across views directly in the feature space.Guided by coarse token positions and fusion confidence, it aggregates multi-scale contextual features to enable long-range cross-view reasoning and reduce redundancy from overlapping Gaussians.To further enhance pose robustness and disentangle viewpoint cues from scene semantics, TokenSplat employs learnable camera tokens and an Asymmetric Dual-Flow Decoder (ADF-Decoder) that enforces directionally constrained communication between camera and image tokens. This maintains clean factorization within a feed-forward architecture, enabling coherent reconstruction and stab
173	core_reconstruction high	Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; generation_editing; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Generating dynamic 4D objects from sparse inputs is difficult because it demands joint preservation of appearance and motion coherence across views and time while suppressing artifacts and temporal drift. We hypothesize that the view discrepancy arises from supervision limited to pixel- or latent-space video-diffusion losses, which lack explicitly temporally aware, feature-level tracking guidance.We present \emph{Track4DGen}, a two-stage framework that couples a multi-view video diffusion model with a foundation point tracker and a hybrid 4D Gaussian Splatting (4D-GS) reconstructor. The central idea is to explicitly inject tracker-derived motion priors into intermediate feature representations for both multi-view video generation and 4D-GS. In Stage One, we enforce dense, feature-level point correspondences inside the diffusion generator, producing temporally consistent features that cur
174	core_reconstruction high	tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model’s capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforwar
175	core_reconstruction high	TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Novel view synthesis from sparse-view inputs poses a significant challenge in 3D computer vision, particularly for achieving high-quality scene reconstructions with limited viewpoints. We introduce TWINGS, a framework that enhances 3D Gaussian Splatting (3DGS) by directly addressing point sparsity. We employ Thin Plate Splines (TPS), a smooth non-rigid deformation model that minimizes bending energy to estimate a globally coherent warp from control-point correspondences, to align backprojected points from estimated depth with triangulated 3D control points, yielding calibrated backprojected points. By sampling these calibrated points near the control points, TWINGS provides a fast and geometrically accurate initialization for 3DGS, ultimately improving structural detail preservation and color fidelity in reconstructed scenes. Extensive experiments on DTU, LLFF, and Mip-NeRF360 demonstrat
176	core_reconstruction high	Uika: Universal Head Avatar from Pose-Free Images 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of unposed inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation. First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise UV coordinate estimation. Such UV coordinate estimation allows us to project each valid pixel from screen space to UV space, which is independent of camera pose and character expression. We thus leverage this UV space to represent our Gaussian head avatar. To this end, we design learnable UV tokens on which the attent
177	core_reconstruction high	Unblur-SLAM: Dense Neural SLAM for Blurry Inputs 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We propose Unblur-SLAM, an RGB SLAM pipeline for sharp 3D reconstruction from blurred image inputs. In contrast to previous work, our approach is able to handle different types of blur and demonstrates state-of-the-art performance in the presence of both motion blur and defocus blur. Moreover, we adjust the computation effort with the amount of blur in the input image.As a first stage, our method uses a feed-forward image deblurring model for which we propose a suitable training scheme that can improve both tracking and mapping modules.Frames that are successfully deblurred by the feed-forward network obtain refined poses and depth through local-global multi-view optimization and loop closure. Frames that fail the first stage deblurring are directly modeled through the global 3DGS representation and an additional blur network to model multiple blurred sub-frames and simulate the blur for
178	core_reconstruction high	Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We present Gaussian Splatting Anisotropic Visibility Field (GAVIS), a novel framework for uncertainty quantification and active mapping in 3DGS. Our key insight is that regions unseen from the training views yield unreliable predictions from the 3DGS. To address this, we introduce a principled and efficient method for quantifying the visibility field in 3DGS, defined as the anisotropic visibility of each particle with respect to the training views, and represented using spherical harmonics. The resulting visibility field is integrated into a Bayesian Network–based uncertainty-aware volume rendering process, enabling real-time (200 FPS) uncertainty quantification for synthesized views. Active mapping is further performed within a maximum information gain framework building on this formulation.Extensive experiments across diverse environments demonstrate that GAVIS consistently and signifi
179	core_reconstruction high	Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce a novel feed-forward framework that reconstructs 3D scenes from unposed multi-view images. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction—all within a single, feed-forward pass. Extensive experiments demonstrate this method establishes a new state-of-the-art across multiple benchmarks, including RE10K and ScanNet. Our work signifies a novel paradigm towards generalizable 3D scene reconstruction.
180	core_reconstruction high	VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian splatting (3DGS) has demonstrated impressive performance in synthesizing high-fidelity novel views. Nonetheless, its effectiveness critically depends on the quality of the initialized point cloud. Specifically, achieving uniform and complete point coverage over the underlying scene structure requires overlapping observation frustums, an assumption that is often violated in unbounded, dynamic urban environments. Training Gaussian models with partially initialized point clouds often leads to distortions and artifacts, as camera rays may fail to intersect valid surfaces, resulting in incorrect gradient propagation to Gaussian primitives associated with occluded or invisible geometry. Additionally, existing densification strategies simply clone and split Gaussian primitives from existing ones, incapable of reconstructing geometry from missing structures. To address these limitati
181	core_reconstruction high	VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Simultaneous Localization and Mapping (SLAM) with 3D Gaussian Splatting (3DGS) enables fast, differentiable rendering and high-fidelity reconstruction across diverse real-world scenes. However, existing 3DGS-SLAM approaches handle measurement reliability implicitly, making pose estimation and global alignment susceptible to drift in low-texture regions, transparent surfaces, or areas with complex reflectance properties. To this end, we introduce VarSplat, an uncertainty-aware 3DGS-SLAM system that explicitly learns per-splat appearance variance. By using the law of total variance with alpha compositing, we then compute corresponding differentiable per-pixel uncertainty map. This variance map guides tracking, submap registration, and loop detection toward focusing on reliable regions and contributes to more stable optimization. Experimental results on Replica (synthetic) and TUM-RGBD, Sca
182	core_reconstruction high	VDFE: Difference-Aware 3D Scene Editing with Non-Intrusive Video Diffusion Priors for Multi-View Consistency and Efficiency 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; generation_editing; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Text-driven 3D editing, enabled by advancements in 3D reconstruction techniques such as NeRF and 3D Gaussian Splatting, aims to provide intuitive scene customization. However, existing methods frequently exhibit limitations in controllability and consistency. To address these shortcomings, we propose \textbf{VDFE}, a difference-aware 3D scene editing method based on non-intrusive utilization of pre-trained video diffusion priors, which integrates Optimal Control Guided Flow Editing (FlowOCE), Decoupled Flow Difference (DFD), and Difference-Aware Gaussians Editing (DAGE). Specifically, FlowOCE treats the editing process as an optimal control problem, optimizing a noise-free editing trajectory to minimize unintended modifications in non-target region; DFD precisely locates editing region by analyzing flow differences, which supplies priors for the subsequent optimization process; and DAGE
183	core_reconstruction high	Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Existing single-image 3D human avatar methods primarily rely on rigid joint transformations, limiting their ability to model realistic cloth dynamics. We present DynaAvatar, a zero-shot framework that reconstructs animatable 3D human avatars with motion-dependent cloth dynamics from a single image. Trained on large-scale multi-person motion datasets, DynaAvatar employs a Transformer-based feed-forward architecture that directly predicts dynamic 3D Gaussian deformations without subject-specific optimization. To overcome the scarcity of dynamic captures, we introduce a static-to-dynamic knowledge transfer strategy: a Transformer pretrained on large-scale static captures provides strong geometric and appearance priors, which are efficiently adapted to motion-dependent deformations through lightweight LoRA fine-tuning on dynamic captures. We further propose the DynaFlow loss, an optical flow
184	core_reconstruction high	GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract While 3D Gaussian splatting has emerged as a powerful paradigm, it fundamentally fails to model transparency such as glass panels, which are prevalent in everyday environments. The core challenge lies in decoupling the intertwined radiance contributions from transparent interfaces and the transmitted geometry observed through the glass. We present GLINT, a framework that models scene-scale transparency through explicit decomposed Gaussian representation. GLINT reconstructs the primary interface and separates outgoing radiance into reflection and transmission components according to its optical properties, enabling coherent Gaussian radiance transport. During the optimization, GLINT bootstraps transparency localization by utilizing geometry separation cues that emerge from our decomposition with the geometry and material priors from a pre-trained video relighting model. Extensive experime
185	core_reconstruction high	MatSpray: Fusing 2D Material World Knowledge on 3D Geometry 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Manual modeling of material parameters and 3D geometry is a time consuming yet essential task in the gaming and film industries. While recent advances in 3D reconstruction have enabled accurate approximations of scene geometry and appearance, these methods often fall short in relighting scenarios due to the lack of precise, spatially varying material parameters. At the same time, diffusion models operating on 2D images have shown strong performance in predicting physically based rendering (PBR) properties such as albedo, roughness, and metallicity. However, transferring these 2D material maps onto reconstructed 3D geometry remains a significant challenge. We propose a framework for fusing 2D material data into 3D geometry using a combination of novel learning-based and projection-based approaches. We begin by reconstructing scene geometry via Gaussian Splatting. From the input images, a
186	core_reconstruction high	Multi-view Pyramid Transformer: Look Coarser to See Broader 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the un
187	core_reconstruction high	Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) represents scenes through primitives with coupled intrinsic properties: geometric attributes (position, covariance, opacity) and appearance attributes (view-dependent color). Faithful reconstruction requires intrinsic geometry-appearance consistency, where geometry accurately captures 3D structure while appearance reflects photometry. However, sparse observations lead to appearance overfitting and underconstrained geometry, causing severe novel-view artifacts.We present ICO-GS (Intrinsic Geometry-Appearance Consistency Optimization for 3DGS), a principled framework that enforces this consistency through tightly coupled geometric regularization and appearance learning. Our approach first regularizes geometry via feature-based multi-view photometric constraints by employing pixel-wise top-k selection to handle occlusions and edge-aware smoothness to preserve sh
188	core_reconstruction high	RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Human perceive the 3D world through 2D observations from limited viewpoints. While recent feed-forward generalizable 3D reconstruction models excel at recovering 3D structures from sparse images, their representations are often confined to observed regions, leaving unseen geometry un-modeled. This raises a key, fundamental challenge: Can we infer a complete 3D structure from partial 2D observations? We present RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation. At the core of RnG, we propose a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level, and treats the KV-cache as an implicit 3D representation. Then, arbitrary poses can efficiently query this cache to render high-fidelity, novel-view RGBD outputs. As a result, RnG n
189	core_reconstruction medium	Motion-Aware Animatable Gaussian Avatars Deblurring 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; data_benchmark	core genus=3D Gaussian Splatting, but title/abstract signal is narrower	abstract The creation of 3D human avatars from multi-view videos is a significant yet challenging task in computer vision. However, existing techniques rely on high-quality, sharp images as input, which are often impractical to obtain in real-world scenarios due to variations in human motion speed and intensity. This paper introduces a novel method for directly reconstructing sharp 3D human Gaussian avatars from blurry videos. The proposed approach incorporates a 3D-aware, physics-based model of blur formation caused by human motion, together with a 3D human motion model designed to resolve ambiguities in motion-induced blur. This framework enables the joint optimization of the avatar representation and motion parameters from a coarse initialization. Comprehensive benchmarks are established using both a synthetic dataset and a real-world dataset captured with a 360-degree synchronous hybrid-expos
190	core_reconstruction high	PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Multi-view 3D reconstruction methods remain highly sensitive to photometric inconsistencies arising from camera optical characteristics and variations in image signal processing (ISP). Existing mitigation strategies such as per-frame latent variables or affine color corrections lack physical grounding and generalize poorly to novel views. We propose the Physically-Plausible ISP (PPISP) correction module, which disentangles camera-intrinsic and capture-dependent effects through physically based and interpretable transformations. A dedicated PPISP controller, trained on the input views, predicts ISP parameters for novel viewpoints, analogous to auto exposure and auto white balance in real cameras. This design enables realistic and fair evaluation on novel views without access to ground-truth images. PPISP achieves SoTA performance on standard benchmarks, while providing intuitive control a
191	core_reconstruction medium	High-Fidelity Mobile Avatars with Pruned Local Blendshapes 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract We propose a method to reconstruct high-fidelity human avatars from multi‑view video that can run on mobile devices. Many works can model high‑quality Gaussian-based full-body avatars from multi‑view video. However, these methods require heavy computation to obtain pose‑dependent appearance, making deployment on mobile devices very difficult. Recent methods distill from pretrained models and model pose‑dependent nonlinear Gaussian attributes by linearly combining global pose features with blendshapes. Although they can run on mobile devices, they suffer some loss of detail. We observe that nearby Gaussians are often highly correlated within a local region of the body, and can be linearly modeled with less error. Therefore, we use local linear blendshapes in small body parts to capture global nonlinear changes of Gaussian attributes. To further reduce computation and model size, we propos
192	core_reconstruction medium	Learning Convex Decomposition via Feature Fields 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract This work proposes a new formulation to the long-standing problem of convex decomposition through learning feature fields, enabling the first feed-forward model for open-world learning of convex decomposition. Our method produces high-quality decompositions of 3D shapes into a union of convex bodies, which are essential to accelerate collision detection in physical simulation, amongst many other applications.The key insight is to adopt a feature learning approach and learn a continuous feature field that can later be clustered to yield a good convex decomposition via our self-supervised, purely-geometric objective derived from the classical definition of convexity.Our formulation can be used for single shape optimization, but more importantly, feature prediction unlocks scalable, self-supervised learning on large datasets resulting in the first learned open-world for convex decompositio
193	core_reconstruction high	EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Feed-forward 3D Gaussian Splatting (3DGS) enables efficient one-pass scene reconstruction, providing 3D representations for novel view synthesis without per-scene optimization. However, existing methods typically predict pixel-aligned primitives per-view, producing an excessive number of primitives in dense-view settings and offering no explicit control over the number of predicted Gaussians. To address this, we propose EcoSplat, the first efficiency-controllable feed-forward 3DGS framework that adaptively predicts the 3D representation for any given target primitive count at inference time. EcoSplat adopts a two-stage optimization process. The first stage is Pixel-aligned Gaussian Training (PGT) where our model learns initial primitive prediction. The second stage is Importance-aware Gaussian Finetuning (IGF) stage where our model learns rank primitives and adaptively adjust their param
194	core_reconstruction high	More Natural, More Real: Object-aware Gaussian Splatting for 3D Visual Decoding from Human Brain 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Exploring human visual perception and understanding of the stereoscopic world represents a significant topic in computational neuroscience. Recent studies have provided rich Brain-3D datasets, conducted preliminary explorations into 3D visual reconstruction. However, existing research struggles to capture the differences in dynamic changes of 3D stimulus views, and there remains room for improvement in high-fidelity reconstruction and rendering. 3D Gaussian Splatting (3DGS) has recently achieved significant progress in stereoscopic view synthesis. Inspired by it, we propose BrainGS -- an innovative framework for decoding more realistic 3D objects from the brain. BrainGS incorporates a Fusion Time-Spatial Network to achieve comprehensive encoding of the brain, combined with the Multi-Attribute Controller (MAC), it decouples features using visual, semantic, and color as anchors, effectivel
195	core_reconstruction high	CoRoGS: Contextual Gaussian Splatting for Robust Large-Deviation View Synthesis 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Novel view synthesis (NVS) under large view deviations remains an underexplored challenge for 3D Gaussian Splatting (3DGS). In urban scenes with limited training coverage, models often fail to maintain geometric consistency when extrapolating to unseen viewpoints, resulting in severe distortions and degraded rendering quality. We introduce Context-Aware Gaussian Splatting (CoRoGS), a $\textbf{Co}$ntext-aware framework for $\textbf{Ro}$bust large-deviation novel view synthesis (LD-NVS) that embeds contextual reasoning into 3DGS. Instead of treating Gaussians as independent primitives, CoRoGS adopts a contextual formulation that explicitly models inter-Gaussian dependencies. This representation is implemented by constructing a 3D Gaussian graph, which propagates relational geometry and semantics via message passing, resulting in context-aware Gaussian updates. To further maintain structura
196	core_reconstruction high	DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) has enabled efficient 3D scene reconstruction from everyday images with real-time, high-fidelity rendering, greatly advancing VR/AR applications. Fisheye cameras, with their wider field of view (FOV), promise high-quality reconstructions from fewer inputs and have recently attracted much attention. However, since 3DGS relies on rasterization, most subsequent works involving fisheye camera inputs first undistort images before training, which introduces two problems: 1) Black borders at image edges cause information loss and negate the fisheye’s large FOV advantage; 2) Undistortion’s stretch‐and‐interpolate resampling spreads each pixel’s value over a larger area, diluting detail density— causes 3DGS overfitting these low‐frequency zones, producing blur and floating artifacts.In this work, we integrate fisheye camera model into the original 3DGS framework, enab
197	core_reconstruction high	Evidential Neural Radiance Fields 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Understanding sources of uncertainty is fundamental to trustworthy three-dimensional scene modeling. While recent advances in neural radiance fields (NeRFs) achieve impressive accuracy in scene reconstruction and novel view synthesis, the lack of uncertainty estimation significantly limits their deployment in safety-critical settings. Existing uncertainty quantification methods for NeRFs fail to capture both aleatoric and epistemic uncertainty. Among those that do quantify one or the other, many of them either compromise rendering quality or incur significant computational overhead to obtain uncertainty estimates. To address these issues, we introduce Evidential Neural Radiance Fields, a probabilistic approach that seamlessly integrates with the NeRF rendering process and enables direct quantification of both aleatoric and epistemic uncertainty from a single forward pass. We compare mult
198	core_reconstruction high	LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Novel View Synthesis has often relied on explicit 3D representations, which inject a strong 3D bias in the process; however, recent work has shown that network-based rendering can work better despite lacking 3D inductive biases. In this paper, we show that much better quality can be obtained by leveraging a strong 3D bias without a 3D representation. To do so, we introduce LagerNVS, an encoder-decoder network that uses 3D-aware features as a latent scene encoding. The encoder is initialized from a 3D reconstruction network, paired with a lightweight decoder, and trained end-to-end with photometric losses. LagerNVS achieves state-of-the-art deterministic feed-forward Novel View Synthesis results (including 31.1 PSNR on Re10k), with and without known cameras, renders in real-time, generalizes to in-the-wild data without known cameras, and can be paired with a diffusion decoder for generati
199	core_reconstruction high	Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We present an approach for high-quality dynamic Gaussian Splatting from monocular videos. To this end, we in this work go one step further beyond previous methods to explicitly model continuous position and orientation deformation of dynamic Gaussians, using an SE(3) B-spline motion bases with a compact set of control points. To improve computational efficiency while enhancing the ability to model complex motions, an adaptive control mechanism is devised to dynamically adjust the number of motion bases and control points. Besides, we develop a soft segment reconstruction strategy to mitigate long-interval motion interference, and employ a multi-view diffusion model to provide multi-view cues for avoiding overfitting to training views. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in novel view synthesis. Our code and trained model will be made pub
200	core_reconstruction high	NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We present NimbusGS, a unified framework for reconstructing high-quality 3D scenes from degraded multi-view inputs captured under diverse and mixed adverse weather conditions. Unlike existing methods that target specific weather types, NimbusGS addresses the broader challenge of generalization by modeling the dual nature of weather: a continuous, view-consistent medium that attenuates light, and dynamic, view-dependent particles that cause scattering and occlusion. To capture this structure, we decompose degradations into a global transmission field and per-view particulate residuals. The transmission field represents static atmospheric effects shared across views, while the residuals model transient disturbances unique to each input. To enable stable geometry learning under severe visibility degradation, we introduce a geometry-guided gradient scaling mechanism that mitigates gradient i
201	core_reconstruction high	RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) has emerged as a leading technology for high-quality 3D scene reconstruction. However, the iterative refinement and densification process leads to the generation of a large number of primitives, each contributing to the reconstruction to a substantially different extent. Estimating primitive importance is thus crucial, both for removing redundancy during reconstruction and for enabling efficient compression and transmission.Existing methods typically rely on rendering-based analyses, where each primitive is evaluated through its contribution across multiple camera viewpoints. However, such methods are 1) sensitive to the number and selection of views; 2) rely on specialized differentiable rasterizers; and 3) have long calculation times that grow linearly with view count, making them difficult to integrate as plug-and-play modules, as well as resulting in limi
202	core_reconstruction high	ReLaGS: Relational Language Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Achieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either object-centric or rely on costly training for inter-object reasoning. We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. A Gaussian pruning mechanism refines scene geometry, while a robust multi-view language alignment strategy aggregates noisy 2D features into accurate 3D object embeddings. On top of this hierarchy, we build an open-vocabulary 3D scene graph with Vision Language-derived annotations and Graph Neural Network-based relational reasoning. Our approach enables efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships, validated a
203	core_reconstruction high	ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos 3D Vision & Geometry / 3D Reconstruction	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract The ubiquity of monocular videos capturing daily hand-object interactions presents a valuable resource for embodied intelligence. While 3D hand reconstruction from in-the-wild videos has seen significant progress, reconstructing the involved objects remains challenging due to severe occlusions and the complex, coupled motion of the camera, hands, and object. In this paper, we introduce ForeHOI, a novel feed-forward model that directly reconstructs 3D object geometry from monocular hand-object interaction videos within one minute of inference time, eliminating the need for any pre-processing steps. Our key insight is that, the joint prediction of 2D mask inpainting and 3D shape completion in a feed-forward framework can effectively address the problem of severe occlusion in monocular hand-held object videos, thereby achieving results that outperform the performance of optimization-based
204	core_reconstruction high	PromptDepth: Efficient and Promptable Geometric 3D Vision Model \\ for Embodied Intelligence 3D Vision & Geometry / 3D Reconstruction	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Vision models for embodied intelligence require efficient 3D comprehension and interaction with objects within the scene. Existing 3D reconstruction models either overlook instance-level perception or rely on time-consuming offline reasoning, showing a less adaptability in real-time embodied scenario. In this paper, we present PromptDepth, the first promptable vision model that features both geometric 3D understanding and instance-level interaction especially designed for embodied intelligence. PromptDepth is a feed-forward network that quickly yields panoptic, instanced, or tracked depth map from two corresponding frames, enabling the real-time infer sequences from embodied agents. Specifically, following the minimal prediction problem, we design a promptable Dense Prediction Transformer, making it flexible to interact with unified dense prediction according to a specific prompt. Consid
205	core_reconstruction medium	Representing 3D Faces with Learnable B-Spline Volumes 3D Vision & Geometry / 3D Reconstruction	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract We present CUBE (Control-based Unified B-Splinie Encoding), a new geometric representation for digital humans that combines B-Spline volumes with learned features, and demonstrate its use as decoder for 3D scan registration and monocular 3D face reconstruction. Unlike existing B-Spline representations that use 3D control points, CUBE is parametrized by a lattice (e.g., $8 \times 8 \times 8$) of high-dimensional control features, increasing the models' expressivity. These control features define a continuous mapping from a 3D parametric domain to 3D Euclidean space through an intermediate feature space, which is evaluated in two stages. First, high-dimensional control features are locally blended using the B-Spline bases, yielding a high-dimensional feature vector, where the first three values are the 3D coordinates of a coarse base mesh. This feature vector is input to a small MLP to pre
206	core_reconstruction medium	SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings 3D Vision & Geometry / Pose Estimation	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Monocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustm
207	core_reconstruction high	SMVRT: Implicit Human 3D Modeling Using Sparse Multi-view Volumetric Reconstruction with Transformer Fusion 3D Vision & Geometry / 3D Reconstruction	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Recently, the community has witnessed significant progress in human modeling from a single view or multi-views, which often involves "guessing" the occluded parts using either generative models or template fitting. In this work, we address these challenges by exploring optimal fusion strategies from sparse views only. We propose an end-to-end implicit 3D reconstruction framework using a sparse multi-view setup. Specifically, we achieve this by exploring fusion blocks at three stages of the network. First, 2D feature encoders carrying out locally and globally, which produce enhanced features. Second, 3D feature grid, formed by attentional fusion of warped multi-view and multi-level 2D features, which follows 3D regularization of feature grids to aggregate spatially coherent multi-view features. Third, attentional 2D3D feature aggregation associated to query point generate enhanced latent
208	core_reconstruction high	OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	B. bridge: reconstruction becomes mapping/world model	gaussian_radiance; depth_correspondence; robotics_mapping; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Open-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit spatial attribute grids for the
209	core_reconstruction high	Reconstructing Functional 3D Scenes from Egocentric Interaction Videos 3D Vision & Geometry / 3D Reconstruction	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; surface_occupancy; robotics_mapping	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present FunREC, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGB-D interaction videos. Unlike existing methods on articulated reconstruction, which rely on controlled setups, multi-state captures, or CAD priors, FunREC operates directly on in-the-wild human interaction sequences to recover interactable 3D scenes. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. Across new real and simulated benchmarks, FunREC surpasses prior work by a large margin, achieving up to +50 mIoU improvement in part segmentation, 5$-$10$\times$ lower articulation and pose errors, and significantly higher reconstruction accuracy. We further demonstrate applications on URDF/USD export for simulation
210	core_reconstruction high	X-Part: High Fidelity And Structure Coherent Shape Decomposition And Completion 3D Vision & Geometry / 3D Reconstruction	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; surface_occupancy; robotics_mapping	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Generating 3D shapes at part level is pivotal for downstream applications such as mesh retopology, UV mapping, and 3D printing. However, existing part-based generation methods often lack sufficient controllability and suffer from poor semantically meaningful decomposition. To this end, we introduce X-Part, a controllable generative model designed to decompose a holistic 3D object into semantically meaningful and structurally coherent parts with high geometric fidelity. X-Part exploits the bounding box as prompts for the part generation and injects point-wise semantic features for meaningful decomposition. Furthermore, we design an editable pipeline for interactive part generation. Extensive experimental results show that X-Part achieves state-of-the-art performance in part-level shape generation. This work establishes a new paradigm for creating production-ready, editable, and structural
211	core_reconstruction high	AERGS-SLAM: Auto-Exposure-Robust Stereo 3D Gaussian Splatting SLAM 3D Vision & Geometry / 3D Gaussian Splatting	B. bridge: representation meets metric pose	gaussian_radiance; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian splatting (3DGS) has emerged as a revolutionary scene representation in simultaneous localization and mapping (SLAM) research. However, existing research on 3DGS-based SLAM fails to accurately address the appearance variations induced by camera auto-exposure in prevalent real-world scenarios, resulting in reduced localization and photorealistic mapping accuracy. To address this issue, we propose a stereo auto-exposure-robust Gaussian splatting SLAM (AERGS-SLAM), a framework robust to such variations and enables both reliable localization and exposure-controlled photorealistic mapping. Our key contributions are two fold. Firstly, we propose a camera exposure network to model the camera exposure process, which we integrate with Gaussian splatting to achieve exposure-controlled novel view synthesis. Secondly, we exploit an illumination-robust geometric feature for localization a
212	core_reconstruction high	Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM 3D Vision & Geometry / 3D Gaussian Splatting	B. bridge: representation meets metric pose	gaussian_radiance; pose_calibration_localization; depth_correspondence; dynamic_4d; robotics_mapping	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Handling the dynamic environments is a significant research challenge in Visual Simultaneous Localization and Mapping (SLAM). Recent research combines 3D Gaussian Splatting (3DGS) with SLAM to achieve both robust camera pose estimation and photorealistic renderings. However, using SLAM to efficiently reconstruct both static and dynamic regions remains challenging. In this work, we propose an efficient framework for dynamic 3DGS SLAM guided by optical flow. Using the input depth and prior optical flow, we first propose a category-agnostic motion mask generation strategy by fitting a camera ego-motion model to decompose the optical flow. This module separates dynamic and static Gaussians and simultaneously provides flow-guided camera pose initialization. We boost the training speed of dynamic 3DGS by explicitly modeling their temporal centers at keyframes. These centers are propagated usin
213	core_reconstruction high	ODGS-SLAM: Omnidirectional Gaussian Splatting SLAM 3D Vision & Geometry / 3D Gaussian Splatting	B. bridge: representation meets metric pose	gaussian_radiance; pose_calibration_localization; robotics_mapping; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract This work presents ODGS-SLAM, an omnidirectional simultaneous localization and mapping (SLAM) system utilizing 3D Gaussian Splatting (3DGS) as the unified representation for tracking and mapping.Thus, it reconstructs scene geometry from panoramic image sequences (RGB or RGBD) via splats while also detecting the camera poses.Such a framework is important to understand the full surrounding, e.g., for augmented reality applications or autonomous systems.We extended existing 3DGS-SLAM methods to handle omnidirectional input by including closed-form gradients for mapping and camera pose estimation, utilizing an equirectangular projection model.To lower memory footprint, a key frame removal procedure based on graph analysis is proposed, enabling the application to handle larger input sizes.For evaluation, we provide a data set of controlled real-world and synthetic test scenes (indoor and ou
214	core_reconstruction high	SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM 3D Vision & Geometry / 3D Gaussian Splatting	B. bridge: representation meets metric pose	gaussian_radiance; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) has made huge progress in RGBD SLAM. Current methods usually use 3D Gaussians or view-tied 3D Gaussians to represent radiance fields in tracking and mapping. However, these Gaussians are either too flexible or too limited in movements, resulting in slow convergence or limited rendering quality. To resolve this issue, we adopt pixel-aligned Gaussians but allow each Gaussian to adjust its position along its ray to maximize the rendering quality, even if Gaussians are simplified for improving scalability. To speed up the tracking, we model the depth distribution around each pixel as a Gaussian function, and then use these points to align each frame to the 3D scene quickly. We report our evaluations on widely used benchmarks, justify our designs, and show advantages over the latest methods in view rendering, camera tracking, runtime, and storage complexity.
215	core_reconstruction medium	ULF-Loc: Unbiased Landmark Feature for Robust Visual Localization with 3D Gaussian Splatting 3D Vision & Geometry / Pose Estimation	B. bridge: representation meets metric pose	gaussian_radiance; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Visual localization is a core technology for augmented reality and autonomous navigation. Recent methods combine the efficient rendering of 3D Gaussian Splatting (3DGS) with feature-based localization. These methods rely on direct matching between 2D query features and the 3D Gaussian feature field, but this often results in mismatches due to an inherent bias in the learned Gaussian feature. We theoretically analyze the feature learning process in 3DGS, revealing that the widely adopted $\alpha$-blending optimization inherently introduces bias into 3D point features. This bias stems from the entanglement between individual Gaussians and their neighboring Gaussians, making the learned features unsuitable for precise matching tasks. Motivated by these findings, we propose ULF-Loc, an unbiased landmark feature framework that replaces biased feature optimization with geometry-weighted featur
216	core_reconstruction high	Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty 3D Vision & Geometry / 3D Gaussian Splatting	B. bridge: representation meets metric pose	gaussian_radiance; pose_calibration_localization; depth_correspondence	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) has recently emerged as a powerful scene representation and is increasingly used for visual localization and pose refinement. However, despite its high-quality differentiable rendering, the robustness of 3DGS-based pose refinement remains highly sensitive to both the initial camera pose and the reconstructed geometry. In this work, we take a closer look at these limitations and identify two major sources of uncertainty: (i) pose prior uncertainty, which often arises from regression or retrieval models that output a single deterministic estimate, and (ii) geometric uncertainty, caused by imperfections in the 3DGS reconstruction that propagate errors into PnP solvers. Such uncertainties can distort reprojection geometry and destabilize optimization, even when the rendered appearance still looks plausible.To address these uncertainties, we introduce a relocaliza
217	core_reconstruction high	Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	B. bridge: representation meets metric pose	gaussian_radiance; pose_calibration_localization; depth_correspondence	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Visual relocalization is a fundamental task in the field of 3D computer vision, estimating a camera’s pose when it revisits a previously known scene. While point-based hierarchical localization methods have shown strong scalability and efficiency, they are often limited by sparse image observations and weak feature matching. In this work, we propose SplatHLoc, a novel hierarchical visual relocalization framework that uses Feature Gaussian Splatting as the scene representation. For feature matching, we observe that Gaussian-rendered features and those extracted directly from images exhibit different strengths across the two-stage matching process: the former performs better in the coarse stage, while the latter proves more effective in the fine stage. Therefore, we introduce a hybrid feature matching strategy, enabling more accurate and efficient pose estimation. Extensive experiments on
218	core_reconstruction high	GenSplat: Bridging the Generalization Gap in 3DGS Language Comprehension 3D Vision & Geometry / 3D Gaussian Splatting	B. bridge: representation meets metric pose	gaussian_radiance; pose_calibration_localization; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract In this paper, we propose GenSplat, a novel approach for language comprehension in 3D Gaussian Splatting (3DGS). Unlike previous methods that either achieve cross-scene generalization by being bounded to a predefined vocabulary or handle free-form language by overfitting to individual scenes, GenSplat is robust to free-form language queries and generalizable across 3DGS scene representations. Our key insight for this problem is to formulate a structured learning process to progressively align linguistic concepts with 3D Gaussians. It contains two novel technical contributions. First, we propose a Progressive Language Grounding Curriculum that structurally guides the model through learning category-level semantics to instance-level concepts and free-form language, preventing overfitting by building a generalizable language feature space. Second, we design a Multi-modal Large Language Mode
219	core_reconstruction high	GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	B. bridge: representation meets metric pose	gaussian_radiance; pose_calibration_localization	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and real-time rendering. Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points. Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issue, we propose graph-based spatial distribution optimization for compact 3D Gaussian Splatting (GS\textasciicircum2), which enhances reconstruction quality by optimizing the spatial distribution of Gaussian points. Specifically, we introduce an evidence lower bound (ELBO)-based adaptive densification strategy that automatically controls the densification process. In addition, an opacity-aware progressive pruning strategy is proposed to further reduce memory consumption by dynamically removing
220	core_reconstruction high	3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; pose_calibration_localization; surface_occupancy; generation_editing; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We introduce 3D-Fixer, a novel generalizable and efficient scheme for single-image to compositional 3D scene generation. Unlike existing feed-forward frameworks that lack generalization ability in open-set scenarios due to the limited dataset, or divide-and-conquer frameworks that suffer from slow inference or accumulated registration errors during layout alignment, 3D-Fixer extends pre-trained object-level 3D generation priors to perform in-place completion on the single-view estimated geometry, eliminating the need for pose alignment while preserving feed-forward efficiency. At its core, 3D-Fixer introduces a coarse-to-fine scheme to accurately determine the completion boundary and generate high quality completion 3D asset based on the single-view estimated fragmented geometry. Also, we design a dual-branch conditioning network that integrates 2D and 3D contextual information to guide
221	core_reconstruction high	AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and even surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks.
222	core_reconstruction high	Cupid: Generative 3D Reconstruction via Joint Object and Pose Modeling 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We introduce Cupid, a generative 3D reconstruction framework that jointly models the full distribution over both canonical objects and camera poses. Our two-stage flow-based model first generates a coarse 3D structure and 2D-3D correspondences to estimate the camera pose robustly. Conditioned on this pose, a refinement stage injects pixel-aligned image features directly into the generative process, marrying the rich prior of a generative model with the geometric fidelity of reconstruction. This strategy achieves exceptional faithfulness, outperforming state-of-the-art reconstruction methods by over 3 dB PSNR and 10\% in Chamfer Distance. As a unified generative model that decouples the object and camera pose, Cupid naturally extends to multi-view and scene-level reconstruction tasks without requiring post-hoc optimization or fine-tuning.
223	core_reconstruction high	Edges Compete for Trust: Group Relative Edge Optimization for Building Reconstruction from Point Clouds 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Building reconstruction aims to extract compact wireframes from point clouds. Recent edge-based methods achieve impressive results but suffer from sparse supervision from one-to-one matching, which leaves most edge proposals under-optimized. In this paper, we present Group Relative Edge Optimization (GREO), the first attempt to incentivize dense supervision across edges proposals through reinforcement learning-style optimization in wireframe reconstruction. Specifically, GREO computes edge-level rewards based on geometric alignment quality and transforms them into target confidence distributions via group-wise normalization. In addition, we incorporate entropy regularization to maintain distributional stability and prevent confidence collapse. This joint optimization enables dense and discriminative supervision across all edge proposals through cross-entropy minimization. Experiments
224	core_reconstruction high	JRM: Joint Reconstruction Model for Multiple Objects without Alignment 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Object-centric reconstruction seeks to recover the 3D structure of a scene through composition of independent objects. While this independence can simplify modeling, it discards strong signals that could improve reconstruction, notably repetition where the same object model is seen multiple times in a scene, or across scans. We propose the Joint Reconstruction Model (JRM) to leverage repetition by framing object reconstruction as one of personalized generation: multiple observations share a common subject that should be consistent for all observations, while still adhering to the specific pose and state from each. Prior methods in this direction rely on explicit matching and rigid alignment across observations, making them sensitive to errors and difficult to extend to non-rigid transformations. In contrast, JRM is a 3D flow-matching generative model that implicitly aggregates unaligned
225	core_reconstruction high	Long-Tail Internet Photo Reconstruction 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Internet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed, while most real-world sites contain only sparse, noisy, and uneven imagery that defeats classical and learned 3D methods. Existing 3D foundation models generalize well to curated datasets but collapse under the sparsity, ambiguity, and irregularity of Internet photos. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large-scale, clean, and depth-refined dataset, together with a sparse-aware sampling strategy that mimic camera distributions in long-tai
226	core_reconstruction high	ManifoldNeuS: Manifold-aware View Optimizability for Pose-Free Neural Surface Reconstruction 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; pose_calibration_localization; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Jointly optimizing camera poses and object geometry from unposed images is a challenging task in neural surface reconstruction. Existing methods often suffer from pose drift and geometric distortion, stemming from the easy-view bias --- uniform view optimization favors easy-to-optimize views with abundant texture and good overlap that dominate gradient updates, while hard-to-optimize counterparts with weak texture or limited overlap yet critical for geometric completeness are progressively marginalized. To address this, we propose ManifoldNeuS, a novel framework that explicitly models and leverages per-view optimizability to guide pose-free neural surface reconstruction. Specifically, we introduce the manifold-aware view optimizability score (MaVOS), which jointly assesses immediate fitness (the ease of optimizing each view) and long-term coverage gain (the value of optimizing each view)
227	core_reconstruction high	Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract The 3D characterization of microstructures is crucial for understanding and designing functional materials. However, the scanning electron microscope (SEM), widely used in scientific research, captures only 2D electron intensity distributions. Existing SEM 3D reconstruction methods struggle with textureless regions, shadowing artifacts, and calibration dependencies, whereas advanced learning-based approaches fail to generalize to microscopic SEM domains due to the lack of physical priors and domain-specific data. To address these challenges, we introduce NFH-SEM, a neural field-based hybrid reconstruction framework that recovers high-fidelity 3D surfaces from multi-view, multi-detector SEM images. NFH-SEM integrates coarse multi-view geometry with photometric stereo cues from detector signals through a continuous neural field, incorporating a learnable forward model that embeds SEM imagi
228	core_reconstruction high	PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Panoramic imagery offers a full $360^\circ$ field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting.We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth a
229	core_reconstruction medium	Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Recent learning-based face reconstruction and registration frameworks such as ToFu and TEMPEH have shown that dense correspondence between facial scans and a common topology can be learned directly from images. However, these approaches still depend on precomputed registrations obtained through iterative optimization pipelines that often require manual verification and correction by human annotators. We introduce MOCHI (Multi-view Optimizable Correspondence of Heads from Images), a fully differentiable and registration-free alternative. Instead of relying on optimization-based registrations, we employ a pseudo-linear inverse kinematic solver in conjunction with dense 2D keypoints produced by a tracker trained only on synthetic data to directly enforce a common face topology at the vertex level. We further find that the commonly used point-to-surface distance can lead to unstable training
230	core_reconstruction high	TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Topology-consistent dynamic model sequences are essential for applications such as animation and model editing. However, existing 4D reconstruction methods face challenges in generating high-quality topology-consistent meshes. To address this, we propose a topology-aware dynamic reconstruction framework based on Gaussian Splatting. We introduce a Gaussian topological structure that explicitly encodes spatial connectivity. This structure enables topology-aware densification and pruning, preserving the manifold consistency of the Gaussian representation. Temporal regularization terms further ensure topological coherence over time, while differentiable mesh rasterization improves mesh quality. Experimental results demonstrate that our method reconstructs topology-consistent mesh sequences with significantly higher accuracy than existing approaches. Moreover, the resulting meshes enable prec
231	core_reconstruction medium	TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; depth_correspondence; surface_occupancy; generation_editing	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract The dominant paradigm for high-fidelity 3D generation relies on a VAE-Diffusion pipeline, where the VAE's reconstruction capability sets a firm upper bound on generation quality. A fundamental challenge limiting existing VAEs is the \textit{representation mismatch} between ground-truth meshes and network predictions: GT meshes have arbitrary, variable topology, while VAEs typically predict fixed-structure implicit fields (\eg, SDF on regular grids). This inherent misalignment prevents establishing explicit mesh-level correspondences, forcing prior work to rely on indirect supervision signals such as SDF or rendering losses. Consequently, fine geometric details, particularly sharp features, are poorly preserved during reconstruction. To address this, we introduce TopoMesh, a sparse voxel-based VAE that unifies both GT and predicted meshes under a shared Dual Marching Cubes (DMC) topologic
232	core_reconstruction high	UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present UniSH, a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction. A key challenge in this domain is the scarcity of large-scale, annotated real-world data, forcing a reliance on synthetic datasets. This reliance introduces a significant sim-to-real domain gap, leading to poor generalization, low-fidelity human geometry, and poor alignment on in-the-wild videos. To address this, we propose an innovative training paradigm that effectively leverages unlabeled in-the-wild data. Our framework bridges strong, disparate priors from scene reconstruction and HMR, and is trained with two core components: (1) a robust distillation strategy to refine human surface details by distilling high-frequency details from an expert depth model, and (2) a two-stage supervision scheme, which first learns coarse localization on synthetic data, then fine-tunes on real
233	core_reconstruction medium	ViLearn: Accelerating Training Convergence of Image-to-3D Generation via Visibility Learning 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; depth_correspondence; surface_occupancy; generation_editing	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Single-image-to-3D shape generation has seen remarkable progress, driven by latent diffusion models trained on the compressed latent space of 3D VAEs. However, the task remains intrinsically ill-posed: recovering complete 3D geometry—especially occluded surfaces—from a single view is inherently ambiguous. Existing VecSet-based approaches further exacerbate this challenge by treating shape tokens as an unordered set without explicit positional encoding. This design forces diffusion models to simultaneously learn visible correspondences from the input image and hallucinate invisible geometry within a large, permutation-invariant token space, where the lack of structure significantly hinders training efficiency and convergence stability.To address this, we propose \textit{Visibility Learning}, a training paradigm that injects visibility structure and positional inductive bias into the image
234	core_reconstruction high	ART: Articulated Reconstruction Transformer 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We introduce ART, Articulated Reconstruction Transformer—a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as a part-based prediction problem. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable to standard simulation formats. Trained o
235	core_reconstruction high	PE3R: Perception-Efficient 3D Reconstruction 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Recent advances in 2D-to-3D perception have enabled the recovery of 3D scene semantics from unposed images. However, prevailing methods often suffer from limited generalization, reliance on per-scene optimization, and semantic inconsistencies across viewpoints. To address these limitations, we introduce PE3R, a tuning-free framework for efficient and generalizable 3D semantic reconstruction. By integrating multi-view geometry with 2D semantic priors in a feed-forward pipeline, PE3R achieves zero-shot generalization across diverse scenes and object categories without any scene-specific fine-tuning. Extensive evaluations on open-vocabulary segmentation and multi-view depth estimation show that PE3R not only achieves up to 9$\times$ faster inference but also sets new state-of-the-art accuracy in both semantic and geometric metrics. Our approach paves the way for scalable, language-driven 3D
236	core_reconstruction high	PhyGaP: Physically-Grounded Gaussians with Polarization Cues 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated great success in modeling reflective 3D objects and their interaction with the environment via deferred rendering (DR). However, existing methods often struggle with correctly reconstructing physical attributes such as albedo and reflectance, and therefore they do not support high-fidelity relighting. Observing that this limitation stems from the lack of shape and material information in RGB images, we present PhyGaP, a physically-grounded 3DGS method that leverages polarization cues to facilitate precise reflection decomposition and visually consistent relighting of reconstructed objects. Specifically, we design a polarimetric deferred rendering (PolarDR) process to model polarization by reflection, and a self-occlusion-aware environment map building technique (GridMap) to resolve indirect lighting of non-convex o
237	core_reconstruction high	SASNet: Spatially-Adaptive Sinusoidal Networks for INRs 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; pose_calibration_localization; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Sinusoidal neural networks (SIRENs) are powerful implicit neural representations (INRs) for low-dimensional signals in vision and graphics. By encoding input coordinates with sinusoidal functions, they enable high-frequency image and surface reconstruction. However, training SIRENs is often unstable and highly sensitive to frequency initialization: small frequencies produce overly smooth reconstructions in detailed regions, whereas large ones introduce spurious high-frequency components that manifest as noise in smooth areas such as image backgrounds. To address these challenges, we propose $\textbf{SASNet}$, a $\textit{Spatially-Adaptive Sinusoidal Network}$ that couples a $\textit{frozen frequency embedding layer}$, which explicitly fixes the network’s frequency support, with $\textit{jointly learned spatial masks}$ that localize neuron influence across the domain. This pairing stabili
238	core_reconstruction high	Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14×/16× (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across a
239	core_reconstruction medium	Particulate: Feed-Forward 3D Object Articulation 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; generation_editing; data_benchmark	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract We introduce Particulate, a feed-forward model that, given a single static 3D mesh of an everyday object, predicts its 3D parts, kinematic structure, and articulation parameters.Unlike prior work on articulated 3D object modeling that is limited by costly per-object optimization and small retrieval databases or requires large vision or language foundation models, our approach is based on a flexible, scalable and lightweight transformer architecture.Trained on a diverse collection of articulated 3D assets from public datasets, Particulate accurately infers the articulated structure of novel objects, including those generated by image-to-3D models, in a single feed-forward pass.We further introduce a benchmark for articulated 3D object estimation curated from high-quality public 3D assets.Quantitative and qualitative results show that Particulate significantly outperforms state-of-the-art
240	core_reconstruction high	SPE-MVS: Spatial Position Encoding Enhanced Multi-View Stereo with Monocular Depth Priors 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; depth_correspondence; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Learning-based Multi-View Stereo (MVS) methods have become the mainstream in the field, relying on the construction of cost volumes through multi-view feature similarity computation and regularization. However, existing methods depend heavily on photometric consistency across views, leading to poor performance in challenging regions, such as weakly textured or non-Lambertian surfaces. To overcome this limitation, we propose SPE-MVS, a novel MVS framework enhanced with Spatial Position Encoding (SPE). The SPE represents the 3D positional information of pixels in each image within a unified metric space, constructed using monocular depth priors. We integrate the SPE alongside image data as input and introduce a Photometric-Spatial Hybrid Feature Extractor, along with an SPE-enhanced cost volume construction module. These components incorporate spatial position-based similarity computation,
241	core_reconstruction high	OMGTex: One-stage Multi-style Facial Texture Reconstruction without Geometry Guidance 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; generation_editing; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We propose OMGTex, an end-to-end diffusion-based framework for reconstructing high-quality and editable facial UV textures from multi-style facial images. Existing texture reconstruction methods face two major limitations: (1) Fragility due to reliance on 3D geometry priors, which are difficult to estimate accurately, especially under facial occlusions or in stylized domains; and (2) A lack of semantic disentanglement, inhibiting region-specific texture editing and style transfer. Our work addresses both challenges simultaneously.Our core innovation is a geometry-free pipeline that directly maps a 2D face image to its corresponding editable UV texture. We introduce two key techniques: First, to address the challenge of UV misalignment common in diffusion generation, we introduce a gradient-guided refinement strategy at inference time, which explicitly corrects structural consistency. Sec
242	core_reconstruction high	eRetinexGS: Retinex Modeling for Low-Light Scene Enhancement via Event Streams and 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; dynamic_4d; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Perception under low illumination remains a major challenge for computer vision systems, as RGB sensors often fail to capture sufficient structural and color information in extremely dark environments. Event cameras, with their high dynamic range and temporal resolution, provide complementary cues that are well suited for such conditions. In this work, we present eRetinexGS, a novel framework that jointly leverages event streams and low-light frames through 3D Gaussian Splatting for scene-level enhancement and reconstruction. Unlike previous approaches that operate on individual frames, eRetinexGS enforces geometric and photometric consistency across multiple views, bridging the gap between degraded images and noisy event signals. By introducing an event-assisted Retinex decomposition and a reflectance–illumination representation within the 3DGS pipeline, our method reconstructs normal-l
243	core_reconstruction high	Global Structure-from-Motion Meets Feedforward Reconstruction 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; pose_calibration_localization; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Structure-from-Motion -- the process of simultaneously estimating camera poses and 3D scene structure from a collection of images -- remains a central challenge in computer vision, with many open problems yet to be solved.Recent advances in feedforward 3D reconstruction have made significant strides in overcoming persistent failure cases of classical SfM methods, particularly in scenarios characterized by low texture, limited image overlap, and symmetries.However, while feedforward approaches excel in these challenging conditions, they often face limitations regarding scalability, accuracy, and robustness, and typically fall short of classical methods in standard reconstruction settings.In this work, we systematically analyze these limitations and propose a new state-of-the-art Structure-from-Motion pipeline by combining the respective strengths of classical and feedforward methods.Exten
244	core_reconstruction high	RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; dynamic_4d; generation_editing	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Temporal retiming, the ability to reconstruct and render dynamic scenes at arbitrary timestamps, is crucial for applications such as slow-motion playback, temporal editing, and post-production. However, most existing 4D Gaussian Splatting (4DGS) methods overfit at discrete frame indices but struggle to represent continuous-time frames, leading to ghosting artifacts when interpolating between timestamps. We identify this limitation as a form of temporal aliasing and propose RetimeGS, a simple yet effective 4DGS representation that explicitly defines the temporal behavior of the 3D Gaussian and mitigates temporal aliasing. To achieve smooth and consistent interpolation, we incorporate optical flow–guided initialization and supervision, triple-rendering supervision, and other targeted strategies. Together, these components enable ghost-free, temporally coherent rendering even under large mo
245	core_reconstruction medium	ArtLLM: Generating Articulated Assets via 3D LLM 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; data_benchmark	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Creating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimization-based reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a varia
246	core_reconstruction medium	Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; generation_editing	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Existing 3D editing methods rely on computationally intensive scene-by-scene iterative optimization and suffer from multi-view inconsistency. We propose an effective and fully feedforward 3D editing framework based on the TRELLIS generative backbone, capable of modifying 3D models from a single editing view. Our framework addresses two key issues: adapting training-free 2D editing to structured 3D representations, and overcoming the bottleneck of appearance fidelity in compressed 3D features. To ensure geometric consistency, we introduce Voxel FlowEdit, an edit-driven flow in the sparse voxel latent space that achieves globally consistent 3D deformation in a single pass. To restore photorealistic details, we develop a normal-guided single to multi-view generation module as an external appearance prior, successfully recovering high-frequency textures. Experiments demonstrate that our meth
247	core_reconstruction high	LaS-Comp: Zero-shot 3D Completion with Latent–Spatial Consistency 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract This paper introduces LaS-Comp, a zero-shot and category-agnostic approach that leverages the rich geometric priors of 3D foundation models to enable 3D shape completion across diverse types of partial observations. Our contributions are threefold: First, LaS-Comp harnesses these powerful generative priors for completion through a complementary two-stage design: (i) an explicit replacement stage that preserves the partial observation geometry to ensure faithful completion; and (ii) an implicit refinement stage ensures seamless boundaries between the observed and synthesized regions. Second, our framework is training-free and compatible with different 3D foundation models. Third, we introduce Omni-Comp, a comprehensive benchmark combining real-world and synthetic data with diverse and challenging partial patterns, enabling a more thorough and realistic evaluation. Both quantitative and q
248	core_reconstruction high	Unified Primitive Proxies for Structured Shape Completion 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Structured shape completion recovers missing geometry as primitives rather than as unstructured points, which enables primitive-based surface reconstruction. Instead of following the prevailing cascade, we rethink how primitives and points should interact, and find it more effective to decode primitives in a dedicated pathway that attends to shared shape features. Following this principle, we present UniCo, which in a single feed-forward pass predicts a set of primitives with complete geometry, semantics, and inlier membership. To drive this unified representation, we introduce primitive proxies, learnable queries that are contextualized to produce assembly-ready outputs. To ensure consistent optimization, our training strategy couples primitives and points with online target updates. Across synthetic and real-world benchmarks with four independent assembly solvers, UniCo consistently ou
249	core_reconstruction high	2D-LFM: Lifting Foundation Model without 3D supervision 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; depth_correspondence; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Recent vision foundation models give the impression that 3D reconstruction from RGB is largely solved. Yet these systems struggle with object-specific 3D structure: the fine-grained geometry implied by an object’s landmarks or skeleton. In this paper, we show that when a model is given only 2D landmarks, it can recover more accurate 3D structure than state-of-the-art depth-from-RGB foundation models. Classical lifting approaches such as PAUL demonstrate this principle but do not scale beyond single categories, while methods like 3D-LFM scale but require extensive 3D supervision. We present the first lifting foundation model that learns object-specific 3D geometry using only 2D supervision. The key idea is to inject correspondence structure into the model via a positional encoding inspired by classical structure-from-motion. This simple inductive bias enables robust, object-agnostic 3D li
250	core_reconstruction high	EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; dynamic_4d; generation_editing	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Audio-driven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Few-shot methods enable instant personalization by reconstructing high-fidelity avatars from only a few seconds of video. However, achieving natural talking-head generation further requires strong emotion-aware motion modeling, and existing few-shot approaches exhibit geometric instability and audio-emotion mismatch under expressive facial motion. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, which introduces strong geometric priors for stable and interpretable motion. Building upon this, we propose a Gated Residual Motion Network (GRMN), w
251	core_reconstruction high	Dehallu3D: Hallucination-Mitigated 3D Generation from a Single Image via Cyclic View Consistency Refinement 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; generation_editing	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Large 3D reconstruction models have revolutionized the 3D content generation field, enabling broad applications in virtual reality and gaming. Just like other large models, large 3D reconstruction models suffer from hallucinations as well, introducing structural outliers (e.g., odd holes or protrusions) that deviate from the input data. However, unlike other large models, hallucinations in large 3D reconstruction models remain severely underexplored, leading to malformed 3D-printed objects or insufficient immersion in virtual scenes. Such hallucinations majorly originate from that existing methods reconstruct 3D content from sparsely generated multi-view images which suffer from large viewpoint gaps and discontinuities. To mitigate hallucinations by eliminating the outliers, we propose Dehallu3D for 3D mesh generation. Our key idea is to design a balanced multi-view continuity constraint
252	core_reconstruction high	FAITHFUL CONTOURING: NEAR-LOSSLESS 3D VOXEL REPRESENTATION FREE FROM ISO-SURFACE 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; generation_editing	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiment
253	core_reconstruction medium	PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; data_benchmark	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Existing autoregressive (AR) methods for generating artist-designed meshes struggle to balance global structural consistency with high-fidelity local details, and are susceptible to error accumulation. To address this, we propose PartDiffuser, a novel semi-autoregressive diffusion framework for point-cloud-to-mesh generation. The method first performs semantic segmentation on the mesh and then operates in a "part-wise" manner: it employs autoregression between parts to ensure global topology, while utilizing a parallel discrete diffusion process within each semantic part to precisely reconstruct high-frequency geometric features. PartDiffuser is based on the DiT architecture and introduces a part-aware cross-attention mechanism, using point clouds as hierarchical geometric conditioning to dynamically control the generation process, thereby effectively decoupling the global and local gene
254	core_reconstruction high	PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We introduce PixARMesh, the first method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post-hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative modeling, we enrich a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream of context, pose, and mesh tokens, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing
255	core_reconstruction high	RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; surface_occupancy; generation_editing	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual quality. However, existing methods struggle with semi-transparent specular surfaces that exhibit both complex reflections and clear transmission, often producing blurry reflections or overly occluded transmission. To address this, we present RT-Splatting, a framework that disentangles each Gaussian's geometric occupancy from its optical opacity. This factorization yields a unified surface-volume scene representation with a single set of Gaussian primitives. Our hybrid renderer interprets this representation both as a surface to capture high-frequency reflections and as a volume to preserve clear transmission. To mitigate the ambiguity in jointly optimizing reflection and transmission, we introduce Specular-Aware Gradient Gating, which suppresses misleading gradients from highly specular regions into
256	core_reconstruction high	GeoRelight: Learning Joint Geometrical Reconstruction and Relighting with Flexible Multi-Modal Diffusion Transformers 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; depth_correspondence; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Relighting a person from a single photo is an attractive but ill-posed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current methods either use sequential pipelines that suffer from error accumulation, or they do not explicitly leverage 3D geometry during relighting, which limits physical consistency. Since relighting and estimation of 3D geometry are mutually beneficial tasks, we propose a unified Multi-Modal Diffusion Transformer (DiT) that jointly solves for both: GeoRelight. We make this possible through two key technical contributions: isotropic NDC-Orthographic Depth (iNOD), a distortion-free 3D representation compatible with latent diffusion models; and a strategic mixed-data training method that combines synthetic and auto-labeled real data. By solving geometry and relighting jointly, GeoRelight achieves better performance than
257	core_reconstruction medium	Foundry: Distilling 3D Foundation Models for the Edge 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; data_benchmark	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient `specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable.In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher’s token-l
258	core_reconstruction high	Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; depth_correspondence; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) has emerged as an efficient approach for achieving photorealistic rendering. Recent MLP-based variants further improve visual fidelity but introduce substantial decoding overhead during rendering. To alleviate computation cost, several pruning strategies and level-of-detail (LOD) techniques have been introduced, aiming to effectively reduce the number of Gaussian primitives in large-scale scenes. However, our analysis reveals that significant redundancy still remains due to the lack of occlusion awareness. In this work, we propose Proxy-GS, a novel pipeline that exploits a proxy to introduce Gaussian occlusion awareness from any view.At the core of our approach is a fast proxy system capable of producing precise occlusion depth maps at resolution 1000$\times$1000 under 1 ms. This proxy serves two roles: first, it guides the culling of anchors and Gaussians to
259	core_reconstruction high	SAM 3D: 3Dfy Anything in Images 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a $5:1$ win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruc
260	core_reconstruction high	SimRecon: SimReady Compositional Scene Reconstruction from Real Videos 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; data_benchmark	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Compositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a ''Perception-Generation-Simulation" pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging
261	core_reconstruction high	WorldGen: From Text to Traversable and Interactive 3D Worlds 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; generation_editing	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We introduce WorldGen, a method for generating large, fully formed, navigable 3D worlds from a single text prompt. Existing approaches to 3D scene generation often trade off scene diversity, completeness, and correctness in different ways. We push this envelope by producing large scenes explicitly decomposed into individual, high-quality 3D meshes, making them compatible with standard game engines. Our approach first uses a language-driven procedural generator to lay out the scene's basic volumes and navigable regions. An image generator then establishes the scene's theme, style, and details. Next, we obtain a high-quality, compositional 3D reconstruction of the planned scene. This step first uses an image-to-3D model to perform a holistic reconstruction that implicitly determines the shape and location of all scene objects, accounting for context and navigability. The reconstruction is
262	core_reconstruction high	Confidence-Guided Multi-Scale Aggregation for Sparse-View High-Resolution 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; depth_correspondence; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Sparse-view 3D Gaussian Splatting (3DGS) reconstructs scenes using 3D Gaussians from sparse input views. Yet, this method is prone to overfitting, which is exacerbated at higher resolutions as the expanded dimensionality amplifies floating artifacts and reconstruction ambiguities. In this paper, we present a systematic study of 3DGS under sparse-view conditions and varying input resolutions. While prior work has overlooked resolution as a key factor in sparse-view performance, we identify and quantify a trade-off: lower-resolution inputs facilitate stable global geometry reconstruction, whereas higher-resolution inputs enable finer detail recovery but introduce high-frequency artifacts and instability. Building on this insight, we further propose CAGS, a Confidence-Guided Multi-Scale Aggregation that reconstructs scenes through a coarse-to-fine hierarchical optimization process‌. Our
263	core_reconstruction medium	Efficient unrolled networks for large-scale 3D inverse problems 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; data_benchmark	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Deep learning-based methods have revolutionized the field of imaging inverse problems, yielding state-of-the-art performance across various imaging domains. The best performing networks incorporate the imaging operator within the network architecture, typically in the form of deep unrolling. However, in large-scale problems, such as 3D imaging, most existing methods fail to incorporate the operator in the architecture due to the prohibitive amount of memory required by global forward operators, which hinder typical patching strategies. In this work, we present a domain partitioning strategy and normal operator approximations that enable the training of end-to-end reconstruction models incorporating forward operators of arbitrarily large problems into their architecture. The proposed method achieves state-of-the-art performance on 3D X-ray cone-beam tomography and 3D multi-coil accelerate
264	core_reconstruction high	EI-Part：Explode for Completion and Implode for Refinement 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; generation_editing	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Part-level 3D generation is crucial for various downstream applications, including gaming, film production, and industrial design. However, decomposing a 3D shape into geometrically plausible and meaningful components remains a significant challenge. Previous part-based generation methods often struggle to produce well-constructed parts, exhibiting either poor structural coherence, geometric implausibility, inaccuracy, or inefficiency. To address these challenges, we introduce EI-Part, a novel framework specifically designed to generate high-quality 3D shapes with components distinguished by structural coherence, geometric plausibility, accuracy, and generation efficiency. We propose utilizing distinct representations at different stages: an Explode state for part completion and an Implode state for geometry refinement. This strategy allows us to fully leverage spatial resolution, enabli
265	core_reconstruction medium	Fresco: Frequency–Spatial Consistent Optimization for Fine-Grained Head Avatar Modeling 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; data_benchmark	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract We propose Fresco, a unified optimization paradigm designed to mitigate early over-sharpening, and cross-view drifting in head avatar reconstruction. Fresco combines a Laplacian-pyramid-based frequency curriculum with UV-space consistency regularization to progressively enhance reconstruction quality. The optimization begins by stabilizing low-frequency appearance in the image domain, which suppresses spurious details and promotes reliable convergence. As learning proceeds, consistency across different viewpoints is reinforced through pixel-level alignment on shared UV texture coordinates. Finally, high-frequency components are refined under explicit frequency-band constraints, and seam boundary regularization is applied to preserve local continuity. By optimizing in a frequency- and UV-aligned space, Fresco achieves robust convergence without pseudo high-frequency artifacts and yields c
266	core_reconstruction medium	LoST: Level of Semantics Tokenization for 3D Shapes 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy; generation_editing	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation.However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D s
267	core_reconstruction high	SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; dynamic_4d	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Dynamic extensions of 3D Gaussian Splatting (3DGS) achieve high-quality reconstructions through neural motion fields, but per-Gaussian neural inference makes these models computationally expensive. Building on DeformableGS, we introduce Speedy Deformable 3D Gaussian Splatting (SpeeDe3DGS), which bridges this efficiency–fidelity gap through three complementary modules: Temporal Sensitivity Pruning (TSP) removes low-impact Gaussians via temporally aggregated sensitivity analysis, Temporal Sensitivity Sampling (TSS) perturbs timestamps to suppress floaters and improve temporal coherence, and GroupFlow distills the learned deformation field into shared SE(3) transformations for efficient groupwise motion. On the 50 dynamic scenes in MonoDyGauBench, integrating TSP and TSS into DeformableGS accelerates rendering by 6.78$\times$ on average while maintaining neural-field fidelity and using 10$\
268	core_reconstruction high	GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; dynamic_4d	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) has emerged as a prominent 3D representation for high-fidelity and real-time rendering. Prior work has coupled physics simulation with Gaussians, but predominantly targets soft, deformable materials, leaving brittle fracture largely unresolved. This stems from two key obstacles: the lack of volumetric interiors with coherent textures in GS representation, and the absence of fracture-aware simulation methods for Gaussians. To address these challenges, we introduce GaussianFluent, a unified framework for realistic simulation and rendering of dynamic object states. First, it synthesizes photorealistic interiors by densifying internal Gaussians guided by generative models. Second, it integrates an optimized Continuum Damage Material Point Method (CD-MPM) to enable brittle fracture simulation at remarkably high speed. Our approach handles complex scenarios includi
269	core_reconstruction high	EMR-SM: Explicit Mesh Reconstruction with Dynamic Topology Adaptation 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Reconstructing surface meshes from multi-view images has remained a core challenge in recent years. Most existing methods, whether implicit or explicit, depend on intermediate representations and post-processing steps like Marching Cubes or TSDF fusion, often resulting in artifacts and fragmented geometry. Directly optimizing explicit meshes is a promising approach. However, it presents two critical challenges. The first is how to adaptively refine mesh topology to capture detail without introducing degenerate faces. The second is how to maintain consistent UV coordinates for high-fidelity texturing as the mesh structure evolves. To overcome these, we propose EMR-SM, a novel framework that directly optimizes explicit meshes by integrating differentiable optimization with discrete topology updates. Specifically, we introduce an adaptive vertex splitting and merging strategy, along with re
270	core_reconstruction high	Faster-GS: Analyzing and Improving Gaussian Splatting Optimization 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; dynamic_4d	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Recent advances in 3D Gaussian Splatting (3DGS) have focused on accelerating optimization while preserving reconstruction quality. However, many proposed methods entangle implementation-level improvements with fundamental algorithmic modifications or trade performance for fidelity, leading to a fragmented research landscape that complicates fair comparison.In this work, we consolidate and evaluate the most effective and broadly applicable strategies from prior 3DGS research and augment them with several novel optimizations. We further investigate underexplored aspects of the framework, including numerical stability, Gaussian truncation, and gradient approximation.The resulting system, Faster-GS, provides a rigorously optimized algorithm that we evaluate across a comprehensive suite of benchmarks. Our experiments demonstrate that Faster-GS achieves up to 5$\times$ faster training while ma
271	core_reconstruction high	Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We introduce a feed-forward framework for one-shot animatable mesh head reconstruction that generates high-fidelity, directly animatable 3D head avatars from a single image. Unlike previous work that relies on time-consuming test-time optimization or extensive multi-view data, our method produces complete mesh representations with inherent animatability from a single image in a single forward pass. Our approach employs a dual shape and texture map architecture that simultaneously processes mesh vertices and texture map with extracted image features from a shared transformer backbone, allowing for coherent shape carving and appearance modeling. To prevent mesh collapse and ensure topological integrity during feed-forward deformation, we propose an iterative GRU-based decoding mechanism with progressive geometry deformation and texture refinement, coupled with a novel reprojection-based te
272	core_reconstruction medium	PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning. Recent approaches leverage vision and language foundation models to directly solve dense tasks through multi-view renderings and text queries. While promising, these pipelines require expensive inference over multiple renderings, depend heavily on large language-model (LLM) prompt engineering for captions, and fail to exploit the inherent 3D geometry of shapes. We address this gap by introducing an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds. Our pre-training approach builds on existing data engines that generate part-annotated 3D shapes by pairing multi-view SAM regions with VLM captioning. Using this data, we train a point cloud transformer encoder in two stages: (1) distillation of dense 2D
273	core_reconstruction medium	Bringing Your Portrait to 3D Presence 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract We present a unified framework for reconstructing animatable 3D human avatars from a single portrait across head, half-body, and full-body inputs. Our method tackles three bottlenecks: pose- and framing-sensitive feature representations, limited scalable data, and unreliable proxy-mesh estimation.We introduce a Dual-UV representation that maps image features to a canonical UV space via Core-UV and Shell-UV branches, eliminating pose- and framing-induced token shifts.We also build a factorized synthetic data manifold combining 2D generative diversity with geometry-consistent 3D renderings, supported by a training scheme that improves realism and identity consistency.A robust proxy-mesh tracker maintains stability under partial visibility. Together, these components enable strong in-the-wild generalization. Trained only on half-body synthetic data, our model achieves state-of-the-art head
274	core_reconstruction high	Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure.We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a va
275	core_reconstruction high	CrossHOI: Learning Cross-View Representations for Monocular 3D Human-Object Interaction Reconstruction 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Reconstructing 3D human-object interaction (HOI) from monocular images is highly challenging especially when human and object are mutually occluded. Existing methods primarily rely on single-view inputs, which fundamentally limit their ability to recover occluded regions and accurately estimate contact areas. To address these challenges, we for the first time, consider to introduce novel-view feature priors to enhance monocular 3D HOI reconstruction. We first design a cross-view generator that learns to infer novel-view image features from a single-view input, enriching spatial geometry at the feature level without requiring extra inputs during inference. Guided by both real and generated view features, a spatial cross-view feature fusion module adaptively aggregates complementary cues to enhance the initial reconstruction of human and object meshes. Built upon this reconstruction, we sa
276	core_reconstruction high	EDGS: Eliminating Densification for Efficient Convergence of 3DGS 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; depth_correspondence	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting reconstructs scenes by starting from a sparse Structure-from-Motion initialization and refiningunder-reconstructed regions. This process is slow, as it requires multiple densification steps where Gaussians arerepeatedly split and adjusted, following a lengthy optimization path. Moreover, this incremental approach often yieldssuboptimal renderings in high-frequency regions. We propose a fundamentally different approach: eliminate densification with a one-step approximation of scenegeometry using triangulated pixels from dense image correspondences. This dense initialization allows us to estimatethe rough geometry of the scene while preserving rich details from input RGB images, providing each Gaussian withwell-informed color, scale, and position. As a result, we dramatically shorten the optimization path and remove theneed for densification. Unlike methods that rely
277	core_reconstruction medium	FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Autoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertex-coordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our ``one-face-one-token'' strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder
278	core_reconstruction high	Human Interaction-Aware 3D Reconstruction from a Single Image 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Reconstructing textured 3D human models from a single image is fundamental for AR/VR and digital human applications. However, existing methods mostly focus on single individuals and thus fail in multi-human scenes, where naive composition of individual reconstructions often leads to artifacts such as unrealistic overlaps, missing geometry in occluded regions, and distorted interactions. These limitations highlight the need for approaches that incorporate group-level context and interaction priors. We introduce a holistic method that explicitly models both group- and instance-level information. To mitigate perspective-induced geometric distortions, we first transform the input into a canonical orthographic space. Our primary component, Human Group-Instance Multi-View Diffusion (HUG-MVD), then generates complete multi-view normals and images by jointly modeling individuals and group contex
279	core_reconstruction high	Intrinsic Image Fusion for Multi-View 3D Material Reconstruction 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We introduce Intrinsic Image Fusion, a method that reconstructs high-quality physically based materials from multi-view images.Material reconstruction is highly underconstrained and typically relies on analysis-by-synthesis, which requires expensive and noisy path tracing. To better constrain the optimization, we incorporate single-view priors into the reconstruction process. We leverage a diffusion-based material estimator that produces multiple, but often inconsistent, candidate decompositions per view.To reduce the inconsistency, we fit an explicit low-dimensional parametric function to the predictions.We then propose a robust optimization framework using soft per-view prediction selection together with confidence-based soft multi-view inlier set to fuse the most consistent predictions of the most confident views into a consistent parametric material space. Finally, we use inverse pat
280	core_reconstruction medium	Learning to Infer Parameterized Representations of Plants from 3D Scans 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Plants frequently contain numerous organs, organized in 3D branching systems defining the plant's architecture. Reconstructing the architecture of plants from unstructured observations is challenging because of self-occlusion and spatial proximity between organs, which are often thin structures. To achieve the challenging task, we propose an approach that allows to infer a parameterized representation of the plant's architecture from a given 3D scan of a plant. In addition to the plant's branching structure, this representation contains parametric information for each plant organ, and can therefore be used directly in a variety of tasks. In this data-driven approach, we train a recursive neural network with virtual plants generated using a procedural model. After training, the network allows to infer a parametric tree-like representation based on an input 3D point cloud. Our method is ap
281	core_reconstruction medium	Learning to Solve PDEs on Neural Shape Representations 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Solving partial differential equations (PDEs) on shapes underpins many shape analysis and engineering tasks; yet, prevailing PDE solvers operate on polygonal/triangle meshes while modern 3D assets increasingly live as neural representations. This mismatch leaves no suitable method to solve surface PDEs directly within the neural domain, forcing explicit mesh extraction or per-instance residual training, preventing end-to-end workflows. We present a novel, mesh-free formulation that learns a local update operator conditioned on neural (local) shape attributes, enabling surface PDEs to be solved directly where the (neural) data lives. The operator integrates naturally with prevalent neural surface representations, is trained once on a single representative shape, and generalizes across shape and topology variations, enabling accurate, fast inference without explicit meshing or per-instance
282	core_reconstruction high	Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Reconstructing object geometry from radio frequency (RF) signals is fundamentally challenging due to the lensless imaging nature of RF sensing, which leads to low spatial resolution and high noise. Unlike light signals, RF signals can penetrate occlusions and thus capture information about hidden scenes. Existing Non-Line-of-Sight (NLoS) 3D neural reconstruction methods can recover coarse surfaces inside enclosed environments but often suffer from unstable optimization, noisy surface geometry, and surface ambiguity, failing to produce accurate zero-level sets from the signed distance field (SDF). These limitations largely stem from neglecting the role of Line-of-Sight (LoS) geometry outside the enclosed region, which provides valuable physical constraints for modeling signal propagation. In this paper, we introduce a Unified LoS and NLoS neural geometry reconstruction framework that leve
283	core_reconstruction high	SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; surface_occupancy	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Recent progress in feed-forward 3D Gaussian Splatting (3DGS) has notably improved rendering quality. However, the spatially uniform and highly redundant 3DGS map generated by previous feed-forward 3DGS methods limits their integration into downstream reconstruction tasks. We propose SparseSplat, the first feed-forward 3DGS model that adaptively adjusts Gaussian density according to scene structure and information richness of local regions, yielding highly compact 3DGS maps. To achieve this, we propose entropy-based probabilistic sampling, generating large, sparse Gaussians in textureless areas and assigning small, dense Gaussians to regions with rich information. Additionally, we designed a specialized point cloud network that efficiently encodes local context and decodes it into 3DGS attributes, addressing the receptive field mismatch between the general 3DGS optimization pipeline and f
284	core_reconstruction high	TokenHand: Discrete Token Representation for Efficient Hand Mesh Reconstruction 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Hand mesh reconstruction has attracted growing attention in recent years.Despite significant progress, existing methods often struggle to balance reconstruction quality and inference efficiency.In this work, we propose TokenHand, a novel framework for single-view 3D hand mesh reconstruction that achieves both high accuracy and real-time inference.Our method represents a 3D hand model using $M$ discrete tokens, each describing a specific sub-structure of the hand.This compositional representation enables efficient modeling with minimal reconstruction error.Furthermore, we reformulate hand mesh reconstruction as a classification problem rather than a regression task.Specifically, a classifier predicts the categories of the $M$ tokens from an input image, and a pre-trained decoder network subsequently reconstructs the 3D hand mesh from the predicted tokens without any post-processing.Extens
285	core_reconstruction high	TouchDream: 3D Object Completion through Imagined Touch 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Point cloud completion is crucial for robust 3D perception but remains challenging due to its ill-posed nature. Coarse-to-fine methods can lead to unconstrained local guesses in the absence of key structures, whereas diffusion-based approaches may introduce geometric inconsistencies. To overcome these limitations, we present TouchDream, a novel framework that leverages a diffusion model to 'dream' of tactile sensing on object surfaces, which reformulates the sensing process as a learnable generative modeling task. Unlike visual cues, tactile data provides rich local geometry that can be directly converted into 3D space for point fusion, offering a powerful guide for detail-aware completion. Specifically, our approach generate compact tactile latent representations conditioned on coarse points and sampled touch poses. A touch-guided refinement module then leverages touch features to optim
286	core_reconstruction high	CaT-GS: Efficient 3DGS Rendering for Large Scale Scenes via Inter-frame Caching and Tile Scheduling 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Recent breakthroughs in 3D Gaussian Splatting (3DGS) have advanced neural rendering with high fidelity and speed. However, its performance degrades significantly in large-scale scenes due to the computational burden of tile-based rasterization. Existing optimization efforts either require costly scene re-training or focus on narrow aspects of the pipeline, overlooking critical inefficiencies in real-world deployments. Through a comprehensive analysis, we identify three primary sources of redundancy and low GPU utilization: redundant inter-frame pre-processing, viewpoint-based occlusion redundancy, and severe tile-level load imbalance. To address these issues, we propose CaT-GS, a novel and efficient 3DGS rendering pipeline. CaT-GS introduces a speculative multi-frame preprocessing method to eliminate redundant computations across consecutive frames, and an inter-frame caching mechanism t
287	core_reconstruction high	ASFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Indoor monocular semantic scene completion (MSSC) is notably more challenging than its outdoor counterpart due to complex spatial layouts and severe occlusions. While transformers are well suited for modeling global dependencies, their high memory cost and difficulty in reconstructing fine-grained details have limited their use in indoor MSSC. To address these limitations, we introduce ASFormer, a serialized transformer framework tailored for indoor MSSC. Our model features three key designs: (1) an Adaptive Serialized Transformer with learnable shifts that dynamically adjust receptive fields; (2) a Center-Relative Positional Encoding that captures spatial information richness; and (3) a Convolution-Modulated Layer Normalization that bridges heterogeneous representations between convolutional and transformer features. Extensive experiments on NYUv2 and Occ-ScanNet demonstrate that ASForm
288	core_reconstruction medium	Bidirectional Query-Driven Generation of Parametric CAD Sketch 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Learning-based CAD modeling shows great promise in automating parametric design, yet existing approaches often overlook the incremental and state-dependent nature of sketch construction. We present CADSketcher, a query-driven bidirectional framework for completing partial parametric sketches by internalizing the non-linear construction logic of interactive CAD processes. At the core of CADSketcher are two key innovations. First, a bidirectional sketch learner recovers both prior and posterior contexts from arbitrary-span partial sketches via a bidirectional query mechanism, enabling exploration of multiple plausible modeling trajectories. Second, a confidence-guided completion pipeline adaptively determines the expansion direction through a confidence gate and ensures executable instruction generation using a validity compiler, while a progressive context updater preserves sketch consist
289	core_reconstruction medium	BrepVGAE: Variational Graph Autoencoder with Unified Latent Representation for B-rep 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Due to the heterogeneity of faces and edges in B-rep, conventional graph-based representations is incapable of establishing a unified formulation for faces and edges, thereby constraining the capabilities of B-rep generative models. We propose a B-rep Variational Graph Auto Encoding (BrepVGAE), the first variational graph autoencoder framework capable of holistically encoding and decoding boundary representations of B-rep models.Firstly, we novelly represent both geometry faces and edges as nodes in a graph representation. We then design a sparse graph autoencoder to aggregate the complete B-rep structure into a compact global latent vector. We then construct a decoder that employs set-based generation, which uses bilinear layers to reconstruct adjacency relationships, i.e., topology, with a single latent vector. Afterwards, the same decoder generates node features for all faces and edge
290	core_reconstruction medium	Erasing Invisible Watermarks via Novel View Synthesis 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; depth_correspondence	core genus=3D Gaussian Splatting, but title/abstract signal is narrower	abstract Invisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerability of these schemes against sophisticated removal attacks remains essential to assess their reliability and guide robust design. In this work, we expose a fundamental vulnerability in invisible watermarks by reformulating watermark removal as a view synthesis problem. Our key insight is that generating a perceptually consistent alternative ``view" of the same semantic content, akin to re-observing a scene from a shifted perspective, naturally removes the embedded watermark while preserving visual fidelity. This reveals a critical gap: watermarks robust to pixel-space and frequency-domain attacks remain vulnerable to semantic-preserving viewpoint transformations. We introduce a zero-shot diffu
291	core_reconstruction medium	LoG3D: Ultra-High-Resolution 3D Shape Modeling via Local-to-Global Partitioning 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Generating high-fidelity 3D contents remains a fundamental challenge due to the complexity of representing arbitrary topologies—such as open surfaces and intricate internal structures—while preserving geometric details. Prevailing methods based on signed distance fields (SDFs) are hampered by costly watertight preprocessing and struggle with non-manifold geometries, while point-cloud representations often suffer from sampling artifacts and surface discontinuities. To overcome these limitations, we propose a novel 3D variational autoencoder (VAE) framework built upon unsigned distance fields (UDFs)—a more robust and computationally efficient representation that naturally handles complex and incomplete shapes. Our core innovation is a local-to-global (LoG) architecture that processes the UDF by partitioning it into uniform subvolumes, termed UBlocks. This architecture couples 3D convolutio
292	core_reconstruction high	MatMart: Material Reconstruction of 3D Objects via Diffusion 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Applying diffusion models to physically-based material estimation and generation has recently gained prominence. In this paper, we propose MatMart, a novel material reconstruction framework for 3D objects, offering the following advantages. First, MatMart adopts a two-stage reconstruction, starting with accurate material prediction from inputs and followed by prior-guided material generation for unobserved views, yielding high-fidelity results. Second, by utilizing progressive inference alongside the proposed view-material cross-attention (VMCA), MatMart enables reconstruction from an arbitrary number of input images, demonstrating strong scalability and flexibility. Finally, MatMart achieves both material prediction and generation capabilities through end-to-end optimization of a single diffusion model, without relying on additional pre-trained models, thereby exhibiting enhanced stabil
293	core_reconstruction medium	NeAR: Coupled Neural Asset–Renderer Stack 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract Neural asset authoring and neural rendering have emerged as largely disjoint threads: one generates digital assets using neural networks for traditional graphics pipelines, while the other develops neural renderers that map conventional assets to images. However, the joint design of the asset representation and renderer remains largely unexplored. We argue that coupling them can unlock an end-to-end learnable graphics stack with benefits in fidelity, consistency, and efficiency. In this paper, we explore this possibility with NeAR: a Coupled Neural Asset–Renderer Stack. On the asset side, we build on Trellis-style Structured 3D Latents and introduce a lighting-homogenized neural asset: from a casually lit input, a rectified-flow backbone predicts a Lighting-Homogenized SLAT that encodes geometry and intrinsic material cues in a compact, view-agnostic latent. On the renderer s
294	core_reconstruction high	Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. Specifically, we introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model to capture the knowledge of new environments while preserving the fundamental capability of the foundation model for geometry prediction. To solve the problems of missing groundtruth and the requirement of high efficiency when updating these visual prompts at test time, we introduce a local-global self-supervised learning strategy by enforcing the local and global consistency constraints on predictions. The local consistency constraints are conducted on intermediate and previously local fused results, enabling the model to be trained with high-quality pseudo groundtruth signals; the global cons
295	core_reconstruction high	Opti-NeuS: Neural Reconstruction for Dual-Layered Transparent and Opaque Objects 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract 3D reconstruction of transparent objects from multiple views has been a long-standing challenge. In contrast to opaque objects, transparent objects exhibit complex refraction that causes serious image distortion, resulting in a highly ill-posed problem. Existing reconstruction methods commonly depend on special capture devices or controlled environments, which provide more priors and simplify the modeling of refraction. More importantly, these methods lack the capability for reconstruction of mixed transparent and opaque objects, being confined to transparent or opaque materials. To address these challenges, we propose Opti-NeuS, a novel method for reconstructing transparent and opaque objects without controlled environments or additional input. Opti-NeuS incorporates a novel IoRNetwork to obtain spatially-varying IoR for tracing the refractive ray paths, which can finally model refracti
296	core_reconstruction high	PRIMU: Uncertainty Estimation for Novel Views in Gaussian Splatting from Primitive-Based Representations of Error and Coverage 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; depth_correspondence	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We introduce Primitive-based Representations of Uncertainty (PRIMU), a post-hoc uncertainty estimation (UE) framework for Gaussian Splatting (GS).Reliable UE is essential for deploying GS in safety-critical domains such as robotics and medicine.Existing approaches typically estimate Gaussian-primitive variances and rely on the rendering process to obtain pixel-wise uncertainties.In contrast, we construct primitive-level representations of error and visibility/coverage from training views, capturing interpretable uncertainty information. These representations are obtained by projecting view-dependent training errors and coverage statistics onto the primitives. Uncertainties for novel views are inferred by rendering these primitive-level representations, producing uncertainty feature maps, which are aggregate through pixel-wise regression on holdout data. We analyze combinations of uncerta
297	core_reconstruction medium	Residual Primitive Fitting of 3D Shapes with SuperFrusta 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract We introduce a framework for converting 3D shapes into compact and editable assemblies of analytic primitives, directly addressing the persistent trade-off between reconstruction fidelity and parsimony. Our approach combines two key contributions: a novel primitive, termed SuperFrustum, and an iterative inference algorithm, Residual Primitive Fitting (ResFit). SuperFrustum is a analytical primitive that is simultaneously (1) expressive, being able to express various common solids such as cylinders, spheres, cones & their tapered and bent forms, (2) editable, being compactly parameterized with 8 parameters, and (3) optimizable, with a sign distance field differentiable w.r.t. its parameters almost everywhere. ResFit is an unsupervised procedure that interleaves global shape analysis with local optimization, iteratively fitting primitives to the unexplained residual of a shape to discover
298	core_reconstruction high	Revisiting 3D Reconstruction Kernels as Low-Pass Filters 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract 3D reconstruction is to recover 3D signals from the sampled discrete 2D pixels, with the goal to converge continuous 3D spaces.In this paper, we revisit 3D reconstruction from the perspective of signal processing, identifying the periodic spectral extension induced by discrete sampling as the fundamental challenge.Previous 3D reconstruction kernels, such as Gaussians, Exponential functions, and Student's t distributions, serve as the low pass filters to isolate the baseband spectrum.However, their unideal low-pass property results in the overlap of high-frequency components with low-frequency components in the discrete-time signal’s spectrum.To this end, we introduce Jinc kernel with an instantaneous drop to zero magnitude exactly at the cutoff frequency, which is corresponding to the ideal low pass filters.As Jinc kernel suffers from low decay speed in the spatial domain, we further pro
299	core_reconstruction high	SparseOIT: Improving Order-Independent Transparency 3DGS via Active Set Method 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; depth_correspondence	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) has received tremendous popularity over the past few years due to its photorealistic visual appearance. However, 3DGS uses volumetric rendering that is not suitable for objects with non-lambertian or transparent materials. To remedy this issue, a family of Order-Independent Transparency (OIT) rendering methods propose to remove or modify the depth sorting step in the 3DGS rendering equation. However, the potential of OIT-based method is still underexplored. In this paper, we observe that the OIT modifications to the rendering equation significantly reduce the inter-independence among individual gaussian splats, resulting in very sparse variable dependencies that can be harnessed by specific optimization techniques such as active set method. To this end, we propose \textbf{SparseOIT}, an OIT-based 3DGS reconstruction algorithm that maintains an active set of g
300	core_reconstruction high	Stochastic Ray Tracing for the Reconstruction of 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; depth_correspondence	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Ray-tracing-based 3D Gaussian splatting (3DGS) enjoys the generality of supporting non-pinhole camera models and relightable formulations. However, they are usually lacking in performance, partially due to the need for depth-based sorting of all intersecting Gaussians along the traced rays.In this paper, we introduce a sorting-free differentiable stochastic formulation for ray-traced 3DGS, enabling efficient reconstruction and rendering of both standard and relightable 3DGS scenes.For standard 3DGS, our method offers performance comparable to rasterization-based 3DGS and outperforms sorting-based ray tracing.For relightable 3DGS, our technique provides higher-quality reconstructions and renderings thanks to the accurate shadow and shading computation provided by fully ray-traced shadow and light rays.
301	core_reconstruction high	TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction with direct reconstruction/geometry signal	abstract Joint reconstruction of 3D human and object from a single image is an active research area, with pivotal applications in robotics and digital content creation. Despite recent advances, existing approaches suffer from two fundamental limitations. First, their reconstructions rely heavily on physical contact information, which inherently cannot capture non-contact human–object interactions, such as gazing at or pointing toward an object. Second, the reconstruction process is primarily driven by local geometric proximity, neglecting the human and object appearances that provide global context crucial for understanding holistic interactions. To address these issues, we introduce TeHOR, a framework built upon two core designs. First, beyond contact information, our framework leverages text descriptions of human–object interactions to enforce semantic alignment between the 3D reconstruction an
302	core_reconstruction medium	Underground Plant Exploration: Non-Destructive 3D Root Assessment with GPR Based on Point Graph Neural Network 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; surface_occupancy	core genus=3D Reconstruction, but title/abstract signal is narrower	abstract This paper introduces a novel application of machine learning in agriculture for non-destructive 3D root structure reconstruction. Plant roots are critical for providing resources for the entire plant. Ground Penetrating Radar (GPR) is a key tool for identifying subterranean objects with easy and obvious shapes, such as large pipes, but remaining challenging to assess the 3D shapes of roots. In our study, we introduce a novel approach specifically designed based on GPR signal shape priors to detect target signals and perform curve parameter regression based on multiple B-scans from GPR. This process enables the derivation of a precise curve from the detection and regression outcomes. To achieve the reconstruction of a comprehensive 3D root structure, we have developed a shape reconstruction network that processes sparse sliced 3D points through a dedicated point graph network and an upsa
303	core_reconstruction high	A Geometric Algebra-Informed 3DGS Framework for Wireless Channel Prediction 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract In this paper, we introduce Geometric Algebra–Informed 3D Gaussian Splatting (GAI-GS), a framework for wireless modeling that couples 3D Gaussian splatting with a geometric-algebra–based attention mechanism to explicitly model ray–object interactions in complex propagation environments. GAI-GS encodes joint spatial–electromagnetic (EM) relations into token representations, enabling scene-level aggregation within a unified, end-to-end neural architecture. This design renders ray tracing for wireless propagation physically grounded, with token interactions that respect EM constraints including multipath, path-dependent attenuation, and reflection/diffraction. Through extensive evaluations on on multiple real-world indoor datasets, GAI-GS consistently surpasses current baselines across various wireless tasks.
304	core_reconstruction high	Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian splatting (3DGS) has become a vital tool for learning a radiance field from multiple posed images. Although 3DGS shows great advantages over NeRF in terms of rendering quality and efficiency, it remains a research challenge to further improve the efficiency of learning 3D Gaussians. To overcome this challenge, we propose novel training strategies and losses to shorten each Gaussian list used to render a pixel, which speeds up the splatting by involving fewer Gaussians along a ray. Specifically, we shrink the size of each Gaussian by resetting their scales regularly, encouraging smaller Gaussians to cover fewer nearby pixels, which shortens the Gaussian lists of pixels. Additionally, we introduce an entropy constraint on the alpha blending procedure to sharpen the weight distribution of Gaussians along each ray, which drives dominant weights larger while making minor weights s
305	core_reconstruction medium	OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; depth_correspondence	core genus=3D Gaussian Splatting, but title/abstract signal is narrower	abstract We propose OMG-Avatar, a novel One-shot method that leverages a Multi-LOD (Level-of-Detail) Gaussian representation for animatable 3D head reconstruction from a single image in 0.2s. Our method enables LOD head avatar modeling using a unified model that accommodates diverse hardware capabilities and inference speed requirements. To capture both global and local facial characteristics, we employ a transformer-based architecture for global feature extraction and projection-based sampling for local feature acquisition. These features are effectively fused under the guidance of a depth buffer, ensuring occlusion plausibility. We further introduce a coarse-to-fine learning paradigm to support Level-of-Detail functionality and enhance the perception of hierarchical details. To address the limitations of 3DMMs in modeling non-head regions such as the shoulders, we introduce a multi-region decom
306	core_reconstruction high	B$^3$-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta–Bernoulli Bayesian Updates 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; generation_editing	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Interactive 3D Gaussian Splatting (3DGS) segmentation is essential for real-time editing of pre-reconstructed assets in film and game production.However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use.We propose \textbf{B$^3$-Seg (Beta--Bernoulli Bayesian Segmentation for 3DGS)}, a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under \textbf{camera-free} and \textbf{training-free} conditions.Our approach reformulates segmentation as sequential Beta--Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG).This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy $(1{-}1/e)$ approximation to the optimal view sampling policy.Experiments on multiple datasets show t
307	core_reconstruction high	BEA-GS : BEyond RAdiance Supervision in 3DGS for Precise Object Extraction 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; generation_editing	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Most Gaussian Splatting techniques that provide a 3D semantic representation of the scene don't optimize the underlying 3D geometry of the scene. This makes object-level editing or asset extraction challenging. Recent methods, like COBGS, Trace3D, and ObjectGS, acknowledge this limitation and propose approaches that modify the geometry of the scene to represent the underlying semantics. We go a step further and propose a novel solution that provides near perfect boundaries in object extraction. We do so by introducing two new losses in the optimization that take care of: 1. Modifying the geometry of visible Gaussians to respect semantic boundaries, and, 2. Modifying the geometry of non-visible Gaussians that appear once the object is extracted. Our first loss propagates gradients directly through the rasterization to allow for seamless integration within the optimization of the Gaussian
308	core_reconstruction high	Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Understanding hand-object interaction from monocular videos is crucial for immersive and dexterous interactions in AR/VR and robotic applications. However, existing monocular reconstruction methods primarily assume rigid grasping and static object geometry. When applied to articulated manipulations, the continuous joint rotations and frequent component deformations introduce a strong coupling between shape and motion, leading to severe ambiguity and instability in articulation optimization under monocular observation. To address this challenge, we propose a Clay-to-Stone dual-phase framework, modeling the articulated manipulation at hierarchical granularities, enabling a progression from flexible semantic exploration to structured articulation recovery. In the CLAY phase, our method performs fine-grained control over geometric deformation, guided by inter-part semantic correlation learni
309	core_reconstruction high	ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Lifting 2D open-vocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. Mainstream methods, built on an embedding paradigm, suffer from three key flaws: (i) geometry-semantic inconsistency, where points, rather than objects, serve as the semantic basis, limiting semantic fidelity; (ii) semantic bloat from injecting gigabytes of feature data into the geometry; and (iii) semantic rigidity, as one feature per Gaussian struggles to capture rich polysemy. To overcome these limitations, we introduce ExtrinSplat, a framework built on the extrinsic paradigm that decouples geometry from semantics. Instead of embedding features, ExtrinSplat clusters Gaussians into multi-granularity, overlapping 3D object groups. A Vision-Language Model (VLM) then interprets these groups to generate lightweight textual hypotheses, creating an extrinsic index layer that natively su
310	core_reconstruction high	HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract We introduce HyperGaussians, a novel extension of 3D Gaussian Splatting for high-quality animatable face avatars. While tremendous successes have been achieved for static faces, animatable avatars from dynamic videos still fall in the uncanny valley. The de facto standard, 3D Gaussian Splatting (3DGS), represents a face through a collection of 3D Gaussian primitives. 3DGS excels at rendering static faces, but the state-of-the-art still struggles with nonlinear deformations, complex lighting effects, and fine details. While most related works focus on predicting better Gaussian parameters from expression codes, we rethink the 3D Gaussian representation itself and how to make it more expressive. Our insights lead to a novel extension of 3D Gaussians to high-dimensional multivariate Gaussians, dubbed 'HyperGaussians'. The higher dimensionality increases expressivity through conditioning on
311	core_reconstruction high	MLLMSplat: A 2D MLLM-Powered Framework for 3D Gaussian Splatting Understanding, Generation, and Editing 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; generation_editing	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting (3DGS) has emerged as a mainstream representation for 3D scenes, drawing increasing research attention to its understanding, generation, and editing. However, existing studies remain limited to low-level perception, low-quality generation, and low-efficiency editing, lagging far behind their image counterparts in the era of Multimodal Large Language Models (MLLMs). To bridge this gap, we propose MLLMSplat, a novel framework that adapts 2D MLLMs to achieve high-level understanding, high-quality generation, and high-efficiency editing of 3DGS scenes. Specifically, our comprehensive framework consists of three core designs: (1) a 3DGS tokenizer that can be seamlessly integrated into existing MLLMs in a training-free manner; (2) a 3DGS de-tokenizer that non-intrusively extends the 2D latent diffusion model in MLLMs using a dual positional encoding space, while augmentin
312	core_reconstruction high	Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Recent significant advances in 3D scene representation have been driven by 3D Gaussian Splatting (3DGS), which has enabled real-time rendering with photorealistic quality. 3DGS often requires a large number of primitives to achieve high fidelity, leading to redundant representations and high resource consumption, thereby limiting its scalability for complex or large-scale scenes. Consequently, effective pruning strategies and more expressive primitives that can reduce redundancy while preserving visual quality are crucial for practical deployment. We propose an efficient, integrated reconstruction-aware pruning strategy that adaptively determines pruning timing and refining intervals based on reconstruction quality, thus reducing model size while enhancing rendering quality. Moreover, we introduce a 3D Difference-of-Gaussians primitive that jointly models both positive and negative densi
313	core_reconstruction medium	Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; data_benchmark	core genus=3D Gaussian Splatting, but title/abstract signal is narrower	abstract Radiance field methods (e.g.~3D Gaussian Splatting) have emerged as a powerful paradigm for novel view synthesis, yet their appearance modeling often relies on Spherical Harmonics (SH), which impose fundamental limitations.SH struggle with high-frequency signals, exhibit Gibbs ringing artifacts, and critically fail to capture specular reflections -- a key component of realistic rendering. While alternatives like Spherical Gaussians offer improvements, they introduce significant optimization complexity.We propose Spherical Voronoi (SV) as a unified framework for appearance representation in 3D Gaussian Splatting.SV partitions the directional domain into learnable regions with smooth boundaries, providing an intuitive and stable parameterization for view-dependent effects. For diffuse appearance, SV achieves competitive results while maintaining simpler optimization compared to existing al
314	core_reconstruction high	FilterGS: Traversal-Free Parallel Filtering and Adaptive Shrinkage for Large-scale LoD 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; data_benchmark	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting has revolutionized neural rendering with real-time performance. However, scaling this approach to large scenes using Level-of-Detail methods faces critical challenges: inefficient serial traversal consuming over 60\% of rendering time, and redundant Gaussian-tile pairs that incur unnecessary processing overhead. To address these limitations, we propose FilterGS, featuring a parallel filtering mechanism with two complementary filters that enable efficient selection without tree traversal, coupled with a scene-adaptive Gaussian shrinkage strategy that minimizes redundancy through opacity-based scaling. Extensive experiments demonstrate that FilterGS achieves state-of-the-art rendering speeds while maintaining competitive visual quality across multiple large-scale datasets.
315	core_reconstruction high	Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; generation_editing	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting has recently enabled fast and photorealistic reconstruction of static 3D scenes. However, dynamic editing of such scenes remains a significant challenge. We introduce a novel framework, Physics-Guided Score Distillation, to address a fundamental conflict: physics simulation provides a strong motion prior that is insufficient for photorealism , while video-based Score Distillation Sampling (SDS) alone cannot generate coherent motion for complex, multi-particle scenarios. We resolve this through a unified optimization framework where physics simulation guides Score Distillation to jointly refine the motion prior for photorealism while simultaneously optimizing appearance. Specifically, we learn a neural dynamics model that predicts particle motion and appearance, optimized end-to-end via a combined loss integrating Video-SDS for photorealism with our physics-guidance
316	core_reconstruction high	3D Gaussian Splatting at Arbitrary Resolution with Compact Proxy Anchors 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Despite achieving high-quality rendering, 3D Gaussian Splatting suffers from aliasing when the rendering resolution changes, as it is typically trained at a fixed resolution. To address this limitation, we introduce a method that enables the model to generate resolution-adaptive 3D Gaussians under arbitrary resolution changes. In particular, built upon Scaffold-GS, we enhance the anchor feature representation by incorporating a resolution-embedding to encode continuous resolution information. From these enhanced anchor features, a pixel coverage gate dynamically forms resolution-adaptive 3D Gaussians. Furthermore, we drastically reduce storage requirements by selecting a compact subset of proxy anchors and designing a residual anchor predictor that reconstructs the unselected leaf anchors based on the proxy anchors, enabling faithful scene representation without compromising visual fidel
317	core_reconstruction high	IR-HGP: Physically-Aware Gaussian Inverse Rendering for High-Illumination Scenes via Generative Priors 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Applying 3D Gaussian Splatting to inverse rendering, especially for relightable assets under high-illuminance conditions, remains challenging. Strong specular highlights and complex reflections complicate material-light disentanglement, often baking in shadows and losing specular detail. To address this, we introduce IR-HGP, a framework that achieves robust disentanglement using three synergistic modules: First, a Hybrid Visibility Decomposition module ensures physical visibility consistency. Second, a Generative Illumination Field Prior module infers detailed and high-dynamic range environmental lighting. Finally, a Physics-Aware Radiance Correction module stabilizes optimization and mitigates illumination artifacts. Our framework achieves SOTA material recovery and relighting performance, outperforming existing methods under challenging illumination conditions. It reconstructs the vi
318	core_reconstruction high	Learning Differentiable Hierarchies in 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Although 3D Gaussian Splatting (3DGS) has achieved impressive performance in real-time rendering, its unordered Gaussians make level-of-detail (LoD) construction and model compression highly challenging, limiting its applicability in customized scenarios.In this work, we propose a learning-based Gaussian hierarchy representation that ranks Gaussians by their contribution to the scene, enabling flexible LoD representations across arbitrary Gaussian counts.We first introduce a unified, continuous formulation and metric for Gaussian hierarchy. Then, we introduce a hierarchy-based modulated rendering method built upon a Differentiable Decreasing Step Function, which enables efficient hierarchy learning while maintaining approximately equivalent rendering. Moreover, we develop a PDF-Guided Active-Region Sampling strategy that encourages the learned hierarchy to become widely distributed withi
319	core_reconstruction high	NG-GS: NeRF-guided 3D Gaussian Splatting Segmentation 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Recent advances in 3D Gaussian Splatting (3DGS) have enabled highly efficient and photorealistic novel view synthesis. However, segmenting objects accurately in 3DGS remains challenging due to the discrete nature of Gaussian representations, which often leads to aliasing and artifacts at object boundaries. In this paper, we introduce NG-GS, a novel framework for high-quality object segmentation in 3DGS that explicitly addresses boundary discretization. Our approach begins by automatically identifying ambiguous Gaussians at object boundaries using mask variance analysis. We then apply radial basis function (RBF) interpolation to construct a spatially continuous feature field, enhanced by multi-resolution hash encoding for efficient multi-scale representation. A joint optimization strategy aligns 3DGS with a lightweight NeRF module through alignment and spatial continuity losses, ensuring
320	core_reconstruction high	Urban-GS: A Unified 3D Gaussian Splatting Framework for Compact and High-Fidelity Aerial-to-Street Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Recently, 3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction, enabling efficient and high-fidelity novel view synthesis. However, seamless integration of both aerial and street view images to model urban scenes remains a significant challenge for 3DGS. This joint setting suffers from extreme view coverage disparity, complex multi-scale details, and imbalanced viewpoint distributions.In this work, we present Urban-GS, a novel framework built upon Gaussian Splatting for the compact unified reconstruction and high-fidelity rendering of urban scenes from both aerial and street views. Specifically, we first develop an Aerial-Street Joint Adaptive Densification method to resolve the densification conflicts arising from large view coverage disparity. We then introduce a Contribution-based Anchor Pruning strategy to effectively mitigate the storage overhead from captu
321	core_reconstruction high	Write Where It Matters: Policy-Guided Watermarks for 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Recent advances in 3D Gaussian Splatting (3DGS) enable photorealistic real-time rendering but also increase the risks of unauthorized copying and redistribution. Existing 3DGS watermarking methods typically rely on handcrafted thresholds or globally fixed hyperparameters to balance invisibility and robustness, making their embedding behavior static and scene-agnostic. We instead formulate 3DGS watermarking as a goal-directed decision process and introduce Write Where It Matters (W2M), the first reinforcement learning-based framework that adaptively learns where and how much to embed. By modeling the embedding process as a Markov Decision Process, W2M uses a lightweight policy network to allocate precise Gaussian updates directly from immediate reward feedback, iteratively. The reward incentivizes both rendering-space invisibility and decoding robustness under various image- and model-lev
322	core_reconstruction high	Z-Order Transformer for Feed-Forward Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in photorealistic novel view synthesis. However, traditional 3DGS relies on a slow, iterative optimization process, which limits its use in scenarios demanding real-time results. To overcome this bottleneck, recent feed-forward methods aim to predict Gaussian attributes directly from images, but they often struggle with the redundancy of Gaussian primitives and rendering quality. In this paper, we introduce a transformer-based architecture specifically designed for feed-forward Gaussian Splatting. Our key insight is that spatial and semantic relationships among Gaussians can be effectively captured through a sparse attention mechanism, enabled by a Z-order strategy that organizes the unstructured Gaussian set into a spatially coherent sequence. Furthermore, we incorporate this Z-order strategy to adaptively
323	core_reconstruction high	NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract 3D Gaussian Splatting can exploit frustum culling and level-of-detail strategies to accelerate rendering of scenes containing a large number of primitives. However, the semi-transparent nature of Gaussians prevents the application of another highly effective technique: occlusion culling. We address this limitation by proposing a novel method to learn the viewpoint-dependent visibility function of all Gaussians in a trained model using a small, shared MLP across instances of an asset in a scene. By querying it for Gaussians within the viewing frustum prior to rasterization, our method can discard occluded primitives during rendering. Leveraging tensor cores for efficient computation, we integrate these neural queries directly into a novel instanced software rasterizer. Our approach outperforms the current state of the art for composed scenes in terms of VRAM usage and image quality, utili
324	core_reconstruction high	SunFaded: Illumination-Aware Gaussian Splatting for Dark Scenes with Camera-Mounted Active Lighting 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance	core genus=3D Gaussian Splatting with direct reconstruction/geometry signal	abstract Gaussian Splatting has emerged as a popular 3D representation technique, but still struggles with appearance inconsistencies, especially in dark scenes that require active illumination (e.g., camera flashes or co-moving light sources) to capture usable images, leading to dramatic local appearance fluctuations.While existing methods mainly focus on modeling global appearance changes for in-the-wild scenes, such as those caused by different times of day or weather conditions, they fail to handle the severe variations present in dark scenes with moving light sources.In this paper, we propose a novel Gaussian Splatting–based approach for constructing scene representations in dark scenes where active light sources are rigidly attached to the camera and move together with it.Within this framework, we introduce an illumination-weighted loss function that drives the representation toward the und
325	core_reconstruction medium	Where, What, Why: Toward Explainable 3D-GS Watermarking 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance	core genus=3D Gaussian Splatting, but title/abstract signal is narrower	abstract As 3D Gaussian Splatting becomes the de facto representation for interactive 3D assets, robust yet imperceptible watermarking is critical. We present a representation-native framework that separates where to write from how to preserve quality. A Trio-Experts module operates directly on Gaussian primitives to derive priors for carrier selection, while a Safety and Budget Aware Gate (SBAG) allocates Gaussians to watermark carriers—optimized for bit resilience under perturbation and bitrate budgets—and to visual compensators that are insulated from watermark loss. To maintain fidelity, we introduce a channel-wise group mask that controls gradient propagation for carriers and compensators, thereby limiting Gaussian parameter updates, repairing local artifacts, and preserving high-frequency details without increasing runtime. Our design yields view-consistent watermark persistence and strong
326	core_reconstruction medium	Hyper-PCN: Hypergraph-based Point Cloud Completion via High-order Correlation Modeling 3D Vision & Geometry / Point Cloud	C. cluster representative	surface_occupancy	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Point cloud completion is an important yet challenging problem in 3D computer vision, which aims to reconstruct complete and dense 3D shapes from partial point clouds. Although transformer-based and geometry-based approaches have made significant progress, they often struggle to capture the complex, high-order correlations inherent in point clouds. To address this limitation, we propose Hyper-PCN, a point cloud completion framework that leverages hypergraphs to explicitly model complex, higher-order correlations within incomplete inputs for more accurate completion. It comprises two key modules: Hyper Refinement Stack, designed to progressively capture coarse-to-fine high-order correlations through a series of hypergraph learning stages, and Anchor-based Hypergraph Neural Network, which employs a two-stage sampling strategy to construct collaborative hypergraphs, ensuring robust modeling
327	core_reconstruction medium	3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects Data & Evaluation / Benchmark	D. adjacent but useful context	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Accurate 3D reconstruction of objects with reflective, transparent, or low-texture surfaces remains a significant challenge. Such materials often violate key assumptions in multi-view reconstruction pipelines, such as photometric consistency and the reliance on distinct geometric texture cues. Existing datasets primarily focus on diffuse, textured objects, thereby offering limited insight into performance under real-world material complexities. In this paper, we introduce 3DReflecNet, a large-scale hybrid dataset exceeding 22 TB that is specifically designed to benchmark and advance 3D vision methods for these challenging materials. 3DReflecNet combines two types of data: over 100,000 synthetic instances generated via physically-based rendering of more than 10,000 shapes, and over 1,000 real-world objects scanned using consumer RGB-D devices. Together, these data consist of more than 7 m
328	core_reconstruction medium	AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction Remote Sensing & Earth / Remote Sensing	D. adjacent but useful context	general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Recent advances in 4D scene reconstruction have greatly improved dynamic modeling across various domains. However, existing approaches remain limited under aerial conditions with single-view capture, wide spatial range, and dynamic objects of limited spatial footprint and large motion disparity. These challenges cause severe depth ambiguity and unstable motion estimation, making monocular aerial reconstruction inherently ill-posed.To this end, we present AeroDGS, a physics-guided 4D Gaussian splatting framework for monocular UAV videos. AeroDGS introduces a Monocular Geometry Lifting module that reconstructs reliable static and dynamic geometry from a single aerial sequence, providing a robust basis for dynamic estimation. To further resolve monocular ambiguity, we propose a Physics-Guided Optimization module that incorporates differentiable ground-support, upright-stability, and traject
329	core_reconstruction medium	Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction Computational Imaging / Computational Imaging	D. adjacent but useful context	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract The task of capturing and rendering 3D dynamic scenes from 2D images has become increasingly popular in recent years.However, most conventional cameras are bandwidth-limited to 30–60 FPS, restricting these methods to static or slowly evolving scenes.While overcoming bandwidth limitations is difficult in general scenes, recent years have seen a flurry of computational imaging methods that yield high-speed videos using conventional cameras for specific scenarios (e.g., motion capture and particle image velocimetry).However, most of these methods require modifications to camera optics or the addition of mechanically moving components, limiting them to a single-view high-speed capture. Consequently, these cannot be readily used to capture a 3D representation of rapid scene motion. In this paper, we propose a novel method to capture and reconstruct a volumetric representation of a high-speed
330	core_reconstruction medium	DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce Driving Gaussian Grounded Transformer (DGGT), a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves tem
331	core_reconstruction medium	DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Simulation is essential to the development and evaluation of autonomous robots such as self-driving vehicles. Neural reconstruction is emerging as a promising solution as it enables simulating a wide variety of scenarios from real-world data alone in an automated and scalable way. However, while methods such as NeRF and 3D Gaussian Splatting can produce visually compelling results, they often exhibit artifacts particularly when rendering novel views, and fail to realistically integrate inserted dynamic objects, especially when they were captured from different scenes. To overcome these limitations we introduce DiffusionHarmonizer, an online generative enhancement framework that transforms renderings from such imperfect scenes into photorealistic, temporally consistent outputs. At its core is a single-step temporally-conditioned enhancer that is converted from a pretrained multi-step imag
332	core_reconstruction medium	EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents Video & Motion / Human Motion	D. adjacent but useful context	general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting.However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild.To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame.The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly.Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruct
333	core_reconstruction medium	EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy Medical & Scientific Imaging / Medical Imaging	D. adjacent but useful context	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Volume electron microscopy (vEM) enables nanoscale 3D imaging of biological structures but remains constrained by acquisition trade-offs, leading to anisotropic volumes with limited axial resolution. Existing deep learning methods seek to restore isotropy by leveraging lateral priors; yet their assumptions break down for morphologically anisotropic structures. We present EMGauss, a general framework for 3D reconstruction from planar scanned 2D slices with applications in vEM, which circumvents the inherent limitations of isotropy-based approaches. Our key innovation is to reframe slice-to-3D reconstruction as a 3D dynamic scene rendering problem based on Gaussian splatting, where the progression of axial slices is modeled as the temporal evolution of 2D Gaussian point clouds. To enhance fidelity in data-sparse regimes, we incorporate a Teacher–Student bootstrapping mechanism that
334	core_reconstruction medium	FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract In controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce FaithFusion, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications. Extensive experiments on
335	core_reconstruction medium	FVGen: Scaling 3D Scene Datasets with Certainty-Aware Free-View Generation from Scene Geometry Reconstruction Generative Models / Video Generation	D. adjacent but useful context	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; generation_editing; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract The development of generalizable Novel View Synthesis (NVS) models is critically limited by the scarcity of large-scale training data with diverse and accurate camera trajectories. While real-world captures are photorealistic, they are typically sparse and discrete. Conversely, synthetic data scales but suffers from a domain gap and often lacks realistic semantics. We introduce FVGen, a novel framework that leverages the power of scene reconstruction to transform limited real-world image sequences into a scalable source of high-quality training data. Our key insight is that an imperfect reconstructed scene serves as a rich geometric proxy, but naively sampling from it amplifies artifacts. To this end, we propose a certainty-aware free-view sampling strategy that identifies novel viewpoints which are both semantically meaningful and minimally affected by reconstruction errors. We demonstr
336	core_reconstruction medium	Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark	direct reconstruction/3DGS/4D title linked to core representation cluster	abstract 3D semantic occupancy prediction is crucial for autonomous driving, yet vision-only approaches suffer from weak geometric cues, and existing multi-modal frameworks often depend on dense voxel or BEV tensors that impose heavy computational cost. We present Gau-Occ, a multi-modal framework that models the scene as a compact collection of semantic 3D Gaussians, enabling geometry-guided fusion without dense volumetric processing.To enhance geometric completeness, a learned LiDAR Completion Diffuser (LCD) trained on real-world priors recovers missing structures from sparse LiDAR, and the completed points are encoded as semantic Gaussian anchors.To further integrate multi-view image semantics, we introduce Gaussian Anchor Fusion (GAF), a geometry-aligned aggregation module that performs anchor-guided 2D sampling, local neighborhood encoding, and cross-modal alignment. By construc
337	core_reconstruction medium	ParkGaussian: Surround-view 3D Gaussian Splatting for Autonomous Parking Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; gaussian_radiance; pose_calibration_localization; robotics_mapping; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Parking is a critical task for autonomous driving systems (ADS), with unique challenges in crowded parking slots and GPS-denied environments. However, existing works focus on 2D parking slot perception, mapping, and localization, 3D reconstruction remains underexplored, which is crucial for capturing complex spatial geometry in parking scenarios. Naively improving the visual quality of reconstructed parking scenes does not directly benefit autonomous parking, as the key entry point for parking is the slots perception module. To address these limitations, we curate the first benchmark named ParkRecon3D, specifically designed for parking scene reconstruction. It includes sensor data from four surround-view fisheye cameras with calibrated extrinsics and dense parking slot annotations. We then propose ParkGaussian, the first framework that integrates 3D Gaussian Splatting (3DGS) for parking
338	core_reconstruction medium	PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems Robustness & Safety / Safety	D. adjacent but useful context	general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Poisoning input views of 3D reconstruction systems has been recently studied.However, we identify that existing studies simply backpropagate adversarial gradients through the 3D reconstruction pipeline as a whole, without uncovering the new vulnerability rooted in specific modules of the 3D reconstruction pipeline.In this paper, we argue that the structure-from-motion (SfM) initialization, as the geometric core of many widely used reconstruction systems, can be targeted to achieve strong poisoning effects. To this end, we propose PoInit-of-View, which optimizes adversarial perturbations to intentionally introduce cross-view gradient inconsistencies at projections of corresponding 3D points. These inconsistencies disrupt keypoint detection and feature matching, thereby corrupting pose estimation and triangulation within SfM, eventually resulting in low-quality rendered views. We also pro
339	core_reconstruction medium	Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Monocular Semantic Scene Completion (SSC) aims to reconstruct complete 3D semantic scenes from a single RGB image, offering a cost-effective solution for autonomous driving and robotics. However, the inherently imbalanced nature of voxel distributions—where over 93% of voxels are empty and foreground classes are rare—poses significant challenges. Existing methods often suffer from redundant emphasis on uninformative voxels and poor generalization to long-tailed categories. To address these issues, we propose VoxSAMNet (Voxel Sparsity-Aware Modulation Network), a unified framework that explicitly models voxel sparsity and semantic imbalance. Our approach introduces: (1) a Dummy Shortcut for Feature Refinement (DSFR) module that bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention; (2) a Foreground Modulation Strategy combining Foreground Dro
340	core_reconstruction medium	TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion Robotics & Embodied AI / Embodied AI	D. adjacent but useful context	gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Embodied 3D Semantic Scene Completion (SSC) infers dense geometry and semantics from continuous egocentric observations. Most existing Gaussian-based methods rely on random initialization of many primitives within predefined spatial bounds, resulting in redundancy and poor scalability to unbounded scenes. Recent depth-guided approach alleviates this issue but remains local, suffering from latency and memory overhead as scale increases.To overcome these challenges, we propose TGSFormer, a scalable Temporal Gaussian Splatting framework for embodied SSC. It maintains a persistent Gaussian memory for temporal prediction, without relying on image coherence or frame caches.For temporal fusion, a Dual Temporal Encoder jointly processes current and historical Gaussian features through confidence-aware cross-attention.Subsequently, a Confidence-aware Voxel Fusion module merges overlapping primiti
341	core_reconstruction medium	UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes Remote Sensing & Earth / Remote Sensing	D. adjacent but useful context	general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments make lighting changes unavoidable.However, existing datasets either restrict capture to short time windows, thus lacking meaningful illumination diversity, or span months and seasons, where geometric and semantic changes confound the isolated study of lighting robustness.We introduce UAVLight, a controlled-yet-real benchmark for illumination-robust 3D reconstruction. Each scene is captured along
342	core_reconstruction medium	Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping; generation_editing; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Creating realistic and simulation-ready 3D assets is crucial for autonomous driving research and virtual environment construction. However, existing 3D vehicle generation methods are often trained on synthetic data with significant domain gaps from real-world distributions. The generated models often exhibit arbitrary poses and undefined scales, resulting in poor visual consistency when integrated into driving scenes. In this paper, we present Unposed-to-3D, a novel framework that learns to reconstruct 3D vehicles from real-world driving images using image-only supervision. Our approach consists of two stages. In the first stage, we train a image-to-3D reconstruction network using posed images with known camera parameters. In the second stage, we remove camera supervision and use a camera prediction head that directly estimates the camera parameter from unposed images. The predicted pose
343	core_reconstruction medium	Video2Robo: 3DGS-based Synthetic Data from One Video Enables Scalable Robot Learning Robotics & Embodied AI / Embodied AI	D. adjacent but useful context	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Scalable robot learning is hindered by the high cost of acquiring diverse, high-quality embodied data. Existing data generation approaches partially mitigate this issue but typically depend on hard-to-access hardware and labor-intensive manual effort, with limited generalization to diverse scene configurations. To overcome these limitations, we propose Video2Robo, a framework that generates high-quality and diverse robot data directly from a single human demonstration video, enabling seamless deployment on physical robots. At its core, Video2Robo leverages 3D Gaussian Splatting (3DGS) as a powerful scene representation, enabling high-fidelity rendering and explicit 3D scene editing. The framework tracks temporally consistent motion trajectories of task-relevant objects from raw video footage and identifies key task skills, guiding robots to execute tasks kinematically plausibly under nov
344	core_reconstruction medium	VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction Robotics & Embodied AI / Embodied AI	D. adjacent but useful context	general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has sparked interest in extending these models to 3D scenes, with the goal of human-like visual-spatial intelligence. However, achieving deep spatial understanding comparable to human capabilities remains challenging for both model design and data acquisition. Existing methods often rely on external depth sensors for geometry capture or off-the-shelf algorithms for pre-constructing 3D maps, which limits their scalability.In this work, we introduce VLM-3R, a framework for Vision-Language Models that couples 3D reconstructive instruction tuning with scalable training data curation and a new benchmark for temporal reasoning. Specifically, VLM-3R processes monocular video frames with a geometry encoder that derives implicit 3D tokens representing scene context (spatial tokens) and camera motion (view tokens). In
345	core_reconstruction medium	WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Editable high-fidelity 4D scenes are crucial for autonomous driving, as they can be applied to end-to-end training and closed-loop simulation. However, existing reconstruction methods are primarily limited to replicating observed scenes and lack the capability for diverse weather simulation. While image-level weather editing methods tend to introduce scene artifacts and offer poor controllability over the weather effects. To address these limitations, we propose \textbf{WeatherCity}, a novel framework for 4D urban scene reconstruction and weather editing. Specifically, we leverage a text-guided image editing model to achieve flexible editing of image weather backgrounds. To tackle the challenge of multi-weather modeling, we introduce a novel weather Gaussian representation based on shared scene features and dedicated weather-specific decoders. This representation is further enhanced with
346	core_reconstruction medium	RecEdit-Drive: 3D Reconstruction-Guided Spatiotemporal Video Editing for Autonomous Driving Scenes Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract High-quality video editing and processing are crucial in domains such as filmmaking and autonomous driving, where accurate visual refinement and data preparation are essential. However, it is challenging to achieve precise control over dynamic objects while maintaining spatiotemporal consistency. Current approaches typically utilize text prompts or 2D structural priors for video editing to ensure consistency, yet they struggle to effectively constrain the spatial variations of dynamic 3D objects. In this paper, we introduce $\textbf{RecEdit-Drive}$, a framework that integrates $\textbf{Spatial Feature Warping}$ and $\textbf{Spatiotemporal Collaborative Modeling}$ to effectively control 3D object variations and enhance video consistency. The spatial feature warping enhances precise control over the edited foreground 3D objects, enhancing spatial consistency in the generated videos; and th
347	core_reconstruction medium	MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping Robotics & Embodied AI / Embodied AI	D. adjacent but useful context	general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Active mapping aims to determine how an agent should move to efficiently reconstruct an unknown environment. Most existing approaches rely on greedy next-best-view prediction, resulting in inefficient exploration and incomplete scene reconstruction.To address this limitation, we introduce MAGICIAN a novel long-term planning framework that maximizes accumulated surface coverage gain through Imagined Gaussians, a predicted scene representation derived from a pre-trained occupancy network with strong structural priors. This representation enables efficient computation of overage gain for any novel viewpoint via fast volumetric rendering.The resulting speedup allows the integration of the gain metric into a tree-search algorithm for planning long-horizon paths.We update Imagined Gaussians and refine the planned trajectory in a closed-loop manner.Our method achieves state-of-the-art performan
348	core_reconstruction medium	Exact-GS: Mathematically Rigorous and Accurate 3D Gaussian Splatting for 3D X-ray Reconstruction Medical & Scientific Imaging / Medical Imaging	D. adjacent but useful context	general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract We propose Exact-GS, a novel mathematically rigorous and accurate 3D Gaussian Splatting model designed to perform 3D X-ray computed tomography (CT) reconstruction and novel view synthesis. Recently, 3D Gaussian Splatting achieved considerable progress at 3D representation. Unfortunately, due to the affine approximation of the projective transformation, previous 3DGS-based methods inevitably suffer from artifacts and projection inconsistencies. To address this problem, some ray tracing based methods perform integration along the ray across Gaussians. However, these methods are computationally inefficient on the forward and backward pass. We introduce a novel closed-form splatting solution for this problem with mathematically rigorous derivation. Our model is the first to achieve the same exact rendering quality as ray tracing based methods without any approximation under a splatting-based
349	core_reconstruction medium	RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cue for 3D Object Detection Detection & Tracking / Detection	D. adjacent but useful context	gaussian_radiance; pose_calibration_localization; dynamic_4d; robotics_mapping	direct reconstruction/3DGS/4D title linked to core representation cluster	abstract 4D millimeter-wave radar is a promising sensing modality for autonomous driving, yet effective 3D object detection from 4D radar and monocular images remains challenging. Existing fusion approaches either rely on instance proposals lacking global context or dense BEV grids constrained by rigid structures, lacking a flexible and adaptive representation for diverse scenes. To address this, we propose RaGS, the first framework that leverages 3D Gaussian Splatting (GS) to fuse 4D radar and monocular cues for 3D object detection. 3D GS models the scene as a continuous field of Gaussians, enabling dynamic resource allocation to foreground objects while maintaining flexibility and efficiency. Moreover, the velocity dimension of 4D radar provides motion cues that help anchor and refine the spatial distribution of Gaussians. Specifically, RaGS adopts a cascaded pipeline to construct and progressi
350	core_reconstruction medium	ST4R-Splat: Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting Segmentation & Dense Prediction / Segmentation	D. adjacent but useful context	gaussian_radiance; dynamic_4d; robotics_mapping; data_benchmark	direct reconstruction/3DGS/4D title linked to core representation cluster	abstract Understanding and segmenting objects in dynamic 4D environments from natural language is crucial yet underexplored. Existing works either perform referring segmentation in static 3D scenes or build open-vocabulary 4D language fields, but none of them supports grounding complex spatio-temporal referring descriptions in explicit 4D reconstructions. Based on 4D Gaussian Splatting(4DGS), We formalize this missing setting as Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting (STRS-4DGS): given a 4DGS representation of a dynamic scene and a referring expression, the goal is to identify the target object and segment it across both space and time, resolving where the described instance is and when it exhibits the queried state. To tackle this challenge, we propose ST4R-Splat, the first framework for STRS-4DGS. ST4R-Splat builds on deformable 4D Gaussians and introduces an Instance-A
351	core_reconstruction medium	Revisiting Pose Sensitivity in Splat-based Computed Tomography under Sparse-view Reconstruction Medical & Scientific Imaging / Medical Imaging	D. adjacent but useful context	general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark	direct reconstruction/3DGS/4D title linked to core representation cluster	abstract X-ray computed tomography (CT) reconstructs volumetric representations of objects from projection images obtained by transmitting X-rays through a target. Recent splat-based tomography, which represents a volume as a continuous distribution of 3D Gaussians, has demonstrated both high reconstruction quality and fast convergence in cone-beam sparse-view CT. However, when deployed in real CT systems with limited and non-uniform view distributions, we observe distinctive streak and strip artifacts that are far more pronounced than in conventional reconstruction methods. Through detailed analysis, we show that these artifacts primarily originate from pose inaccuracies in the acquisition geometry rather than from view sparsity itself. We revisit pose sensitivity in the splatting formulation and derive a stable gradient-based framework that jointly refines geometric parameters during reconstruc
352	core_reconstruction medium	DetectSCI: Toward Object-Guided ROI Reconstruction for High-Resolution Video Snapshot Compressive Imaging Computational Imaging / Computational Imaging	D. adjacent but useful context	general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark	direct reconstruction/3DGS/4D title linked to core representation cluster	abstract Video snapshot compressive imaging (SCI) offers a promising alternative to high-speed cameras by encoding multiple frames into a single 2D measurement. However, SCI requires algorithms to reconstruct the high-speed video and as resolution increases, reconstruction becomes computationally expensive and memory-intensive. Much of resource is wasted on recovering large background regions that contain little useful information, highlighting the need for selective, object-driven reconstruction. Existing object detectors struggle to perform accurately on SCI measurements due to the spatial–temporal aliasing introduced by coded exposure. To address this challenge, we proposes DetectSCI, the first framework enabling object-guided region-of-interest (ROI) reconstruction for high-resolution SCI. The inside detector comprises two key components: an encoder built from weight-sharing Mamba-Implicit Mo
353	core_reconstruction medium	Generative Diffusion Priors for 3D Mapping of the Dark Universe Medical & Scientific Imaging / Medical Imaging	D. adjacent but useful context	general_reconstruction; gaussian_radiance; robotics_mapping; data_benchmark	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Reconstructing the three-dimensional distribution of dark matter from weak-lensing observations is a central but highly ill-posed inverse problem in cosmology. Unlike standard 3D reconstruction with multiple viewpoints, we observe the universe from a single line of sight, through noisy shape distortions of galaxies with uncertain distances, so meaningful recovery of the 3D matter field requires strong prior assumptions. Existing methods either produce point estimates with handcrafted priors or use neural ensembles for approximate Bayesian uncertainty, and struggle to capture the non-Gaussian, filamentary structure of the cosmic web. With the advent of new high-resolution cosmological simulations, we now have an alternative source of prior knowledge that captures the nonlinear statistics of structure formation with far greater fidelity than analytic prescriptions. We leverage these simula
354	core_reconstruction medium	REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting Multimodal & Language / VLM / MLLM	D. adjacent but useful context	gaussian_radiance; pose_calibration_localization; surface_occupancy; generation_editing	direct reconstruction/3DGS/4D title linked to core representation cluster	abstract Bridging the gap between complex human instructions and precise 3D object grounding remains a significant challenge in vision and robotics. Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions, while 2D vision-language models that excel at such reasoning lack intrinsic 3D spatial understanding. In this paper, we introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation without requiring extensive 3D-specific post-training. We perform segmentation directly on 3D Gaussian Splatting representations, capitalizing on their ability to render photorealistic novel views that are highly suitable for MLLM comprehension. As directly feeding one or more rendered views to the MLLM can lead to high sensitivity to viewpoint selection, we propose a novel Global-to-Local Spatial Grounding strategy. Specifically
355	core_reconstruction medium	GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction Medical & Scientific Imaging / Medical Imaging	D. adjacent but useful context	general_reconstruction; gaussian_radiance; surface_occupancy	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Slice-based volumetric imaging is widely applied and it demands representations that compress aggressively while preserving internal structure for analysis. This paper introduces GaussianPile, unifying 3D Gaussian splatting with an imaging system-aware focus model to address this challenge. Our new method introduces three key innovations: (i) a slice‑aware piling strategy that positions anisotropic 3D Gaussians to model through‑slice contributions, (ii) a differentiable projection operator that encodes the finite‑thickness point spread function of the imaging acquisition system, and (iii) a compact encoding and joint optimization pipeline that simultaneously reconstructs and compresses the Gaussian sets. Our CUDA-based design retains the compression and real‑time rendering efficiency of Gaussian primitives while preserving high‑frequency internal volumetric detail. Experiments on microsc
356	core_reconstruction medium	Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement Medical & Scientific Imaging / Medical Imaging	D. adjacent but useful context	general_reconstruction; dynamic_4d; surface_occupancy	direct reconstruction/3DGS/4D title linked to core representation cluster	abstract Prospective reconstruction is crucial in many clinical applications such as MRI-guided radiotherapy, which demands accurate image reconstruction and fast motion estimation from currently acquired measurements. However, prospective reconstruction remains challenging due to ultra-sparse sampling and stringent latency requirements. In this work, we propose PDMR, an Prospective Dynamic 3D MRI Reconstruction framework with latent-space motion tracking. Our core idea is to learn an efficient and generalizable latent manifold of motion fields offline, enabling rapid online adaptation for prospective reconstruction. Specifically, we parameterize the deformation vector fields (DVFs) on a low-dimensional manifold, effectively reducing the search space for fast online adaptation, and employ a tri-plane representation to achieve geometry-aware and memory-efficient encoding of 3D motion. Experiments
357	core_reconstruction medium	MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Driving Autonomous Driving / Autonomous Driving	D. adjacent but useful context	gaussian_radiance; depth_correspondence; robotics_mapping; data_benchmark	direct reconstruction/3DGS/4D title linked to core representation cluster	abstract Generative models have shown great potential in trajectory planning. Recent studies demonstrate that anchor-guided generative models are effective in modeling the uncertainty of driving behaviors and improving overall performance. However, these methods rely on discrete anchor vocabularies that must sufficiently cover the trajectory distribution during testing to ensure robustness, inducing an inherent trade-off between vocabulary size and model performance.To overcome this limitation, we propose MeanFuser, an end-to-end autonomous driving method that enhances both efficiency and robustness through three key designs. (1) We introduce Gaussian Mixture Noise (GMN) to guide generative sampling, enabling a continuous representation of the trajectory space and eliminating the dependency on discrete anchor vocabularies. (2) We introduce ``MeanFlow Identity", which models the mean velocity fiel
358	core_reconstruction medium	Adaptive Anisotropic Gaussian Splatting for Multi-contrast MRI Arbitrary-Scale Super-Resolution with Anatomy Guidance Medical & Scientific Imaging / Medical Imaging	D. adjacent but useful context	gaussian_radiance; surface_occupancy; robotics_mapping	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract Implicit neural representation (INR) based methods learn a continuous mapping from a low-resolution (LR) target magnetic resonance (MR) image and a high-resolution (HR) reference image to achieve arbitrary-scale super-resolution (SR). However, their inherent spectral bias favors learning low-frequency (LF) components, often failing to capture the sharp transitions at anatomical boundaries and resulting in the loss of high-frequency (HF) details. Inspired by 3D Gaussian splatting, we propose GaussM²ASR (Gaussian Multi-contrast MRI Arbitrary-scale Super-Resolution), which converts the challenging task of HF anatomical reconstruction into a smoother parameter optimization problem by learning the parameters of anisotropic 2D Gaussian kernels. To handle inter-contrast discrepancies, we introduce an anatomy-guided pipeline comprising three core modules: a Structure Prior Modulation Fusion (SPM
359	core_reconstruction medium	TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction Autonomous Driving / Autonomous Driving	D. adjacent but useful context	pose_calibration_localization; dynamic_4d; robotics_mapping	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract 3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass. However, when deployed in online settings such as driving scenarios, predictions are made over temporal windows, making it non-trivial to maintain consistency across time. Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry. In this work, we propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies. In addition, we adopt a point-agnostic submap registration design that is inherently robust to noisy geometry predictions. The proposed framework is fully pl
360	core_reconstruction medium	DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures Multimodal & Language / Agentic AI	D. adjacent but useful context	general_reconstruction; gaussian_radiance	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract 3D Gaussian Splatting achieves real-time photo-realistic rendering but struggles when training images contain transient objects that violate multi-view consistency. Existing methods face a fundamental dilemma: accurate transient detection requires well-reconstructed static scenes, yet clean reconstruction depends on reliable transient masks. This circular dependency causes persistent artifacts when both components are jointly optimized from poor initialization. We present DualSplat, a two-stage framework which sidesteps this dilemma by first generating pseudo masks from reconstruction failures, then using them to guide clean scene optimization. We observe that transient objects manifest as incomplete fragments during initial training, since they appear in only a subset of views. We consolidate these failures into pseudo masks via instance-level thresholding and a feature-residual filter
361	core_reconstruction medium	RemedyGS: Defend 3D Gaussian Splatting Against Computation Cost Attacks Robustness & Safety / Safety	D. adjacent but useful context	general_reconstruction; gaussian_radiance	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract As a mainstream technique for 3D reconstruction, 3D Gaussian splatting (3DGS) has been applied in a wide range of applications and services. Recent studies have revealed critical vulnerabilities in this pipeline and introduced computation cost attacks that lead to malicious resource occupancies and even denial-of-service (DoS) conditions, thereby hindering the reliable deployment of 3DGS. In this paper, we propose the first effective and comprehensive black-box defense framework, named RemedyGS, against such computation cost attacks, safeguarding 3DGS reconstruction systems and services. Our pipeline comprises two key components: a detector to identify the attacked input images with poisoned textures and a purifier to recover the benign images from their attacked counterparts, mitigating the adverse effects of these attacks. Moreover, we incorporate adversarial training into the purifier
362	core_reconstruction medium	Eulerian Gaussian Splatting using Hashed Probability Pyramids Robotics & Embodied AI / Embodied AI	D. adjacent but useful context	gaussian_radiance; robotics_mapping	3D Vision & Geometry paper with direct reconstruction title and abstract signal	abstract We introduce a probabilistic splat-based radiance field framework that retains the fast rasterization and test-time efficiency of 3D Gaussian Splatting (3DGS) while replacing heuristic primitive manipulation with gradient-based optimization of a volumetric probability density. Rather than relocating, splitting, or culling Gaussians via hand-tuned densification (e.g., ADC), we treat primitive locations as samples drawn from a persistent, learnable density. We instantiate this density with a novel, memory-efficient multi-scale hierarchical grid that enables end-to-end gradient-based control over primitive population density. To stabilize stochastic training, we derive an unbiased gradient estimator with control variates that markedly reduces variance. By allowing probability mass to flow to where the loss demands, our method eliminates brittle priors and naturally explores the volume, achi
363	strong_bridge medium	Clone Deterministic 3D Worlds 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future physical state of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. However, existing world models often focus on random generation of open worlds, but neglect the need for high-fidelity modeling of deterministic scenarios (such as fixed-map mazes and static space robot navigation). In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone a deterministic 3D world. 1) Through diagnostic experiment, we quantitatively demonstrate that high-fidelity cloning is feasible and the primary bottleneck for long-horizon fidelity is the geometric st
364	strong_bridge medium	NeuROK: Generative 4D Neural Object Kinematics 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	dynamic/4D paper with direct reconstruction signal	abstract Data-driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simulative 4D dynamics---realistic temporal deformations of static objects under various physical conditions---remains challenging and often ad hoc despite being critical for building comprehensive 3D world models. Most existing methods assume a predefined physical model and use system identification to estimate parameters, restricting these methods to specific categories and small-scale datasets. We propose that these restrictions can be overcome by learning a data-driven kinematic state parameterization for object-centric physical systems. Specifically, we learn both a latent space of all possible states of the object and a decoder that maps any sampled latent to a plausibly deformed shape of the object. We refer to this parameteriza
365	strong_bridge medium	SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy	dynamic/4D paper with direct reconstruction signal	abstract Learning dense correspondences across deformable 3D shapes remains a long-standing challenge due to structural variability, non-isometric deformation, and inconsistent topology. Existing methods typically trade off generalization, geometric fidelity, and efficiency.We address this by proposing SGSoft, a unified intrinsic pipeline that (i) constructs a geodesic correspondence field on a canonical template, (ii) learns multimodal dense descriptors guided by pretrained semantic priors with this geodesic correspondence field supervision, (iii) retrieves dense correspondences in a single feed-forward pass via nearest-neighbor search in descriptor space.This formulation enables stable and topology-invariant supervision under large pose variation, structural differences, and remeshing.SGSoft achieves state-of-the-art inter-category generalization while offering the best accuracy–efficiency trad
366	strong_bridge medium	Spatia: Video Generation with Updatable Spatial Memory 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; pose_calibration_localization; dynamic_4d; surface_occupancy; generation_editing	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory–aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic–static disentanglement design enhances spatial consistency throughout the generation process while preserving the model’s ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.
367	strong_bridge medium	D-Prism: Differentiable Primitives for Structured Dynamic Modeling 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy	dynamic/4D paper with direct reconstruction signal	abstract Capturing both geometry and rigid motion for structured dynamic objects, like multi-part assemblies or jointed mechanisms, remains a key challenge. Existing dynamic methods, such as deformable meshes or 3DGS, rely on unstructured representations and fail to jointly model suitable geometry and articulated motion. Primitive-based methods excel at structured static scenes, but their dynamic potential is still unexplored. We propose D-Prism, the first framework to achieve high-fidelity structured dynamic modeling by extending differentiable primitives to the dynamic domain.Specifically, we bind 3DGS to primitive surfaces, leveraging their respective strengths in appearance and geometry. We introduce a deformation network to control primitive motion, ensuring it accurately matches the object's movement. Furthermore, we design a novel adaptive control strategy to dynamically adjust primitive c
368	strong_bridge medium	Dark3R: Learning Structure from Motion in the Dark 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark	Gaussian/radiance representation linked to pose/mapping/metric bridge	abstract We introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signal-to-noise ratios (SNRs) below $-4$ dB—a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher–student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy--clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson–Gaussian noise model applied to well-exposed raw images.To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes $\sim$42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structur
369	strong_bridge medium	Perceptual 3D Simulation With Physical World Modeling 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Predicting how a scene will evolve after a desired 3D transformation from images is a central goal in vision, graphics, and robotics. Yet unlike ideal simulators with full access to 3D geometry and dynamics, real world systems must rely on perceptual inputs and local actions that are inherently partial and incomplete. In this work, we present P3Sim, a physical world modeling system that simulates future scene states under both partial observations and incomplete 3D transformation signals. P3Sim is composed of three interacting components: a learned physical world model, a geometric conditioning module, and a persistent scene memory. The world model interprets perception as probabilistic inference over multimodal scene variables, providing predictions of the distributions of any scene variable conditioned on any combination of others. The geometric conditioning module provides a partial 3
370	strong_bridge medium	Real-Time Dynamic Scene Rendering with Controlled Compressibility and Contact Awareness 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark	dynamic/4D paper with direct reconstruction signal	abstract Existing dynamic scene rendering methods often adopt rigid-body or direction-limited assumptions, yet real-world motion and contact routinely violate these, producing artifacts near occlusion boundaries. To address this, we introduce a unified, source-aware framework for dynamic rendering that enforces the consistency of Gaussian primitives under an explicit manifold constraints. We project predicted velocities onto physically grounded priors via efficient, parallel inner solves: (i) a Helmholtz parameterization that separates divergence-free and potential-flow motion components; (ii) an anisotropic, compressible directional prior; and (iii) an affine family that disentangles rotation from isotropic scaling. Experiments on extensive benchmarks show consistent improvements over state-of-the-art methods in reconstruction fidelity and temporal coherence. Our approach ensures physically real
371	strong_bridge medium	SceneTok: A Compressed, Diffusable Token Space for 3D Scenes 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing	Gaussian/radiance representation linked to pose/mapping/metric bridge	abstract We present SceneTok, a novel tokenizer for encoding view sets of scenes into a compressed and diffusable set of unstructured tokens. Existing approaches for 3D scene representation and generation commonly use 3D data structures or view-aligned fields. In contrast, we introduce the first method that encodes scene information into a small set of permutation invariant tokens that is disentangled from the spatial grid. The scene tokens are predicted by a multi-view tokenizer given many context views and rendered into novel views by employing a light-weight rectified flow decoder. A diffusion transformer enables scene generation on the compressed token space. We show that the compression is two orders of magnitude stronger than for other representations while still reaching state-of-the-art reconstruction quality. Further, our representation can be rendered from novel trajectories, including
372	strong_bridge medium	VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretr
373	strong_bridge medium	ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes 3D Vision & Geometry / 3D Gaussian Splatting	B. bridge: reconstruction becomes mapping/world model	gaussian_radiance; dynamic_4d; robotics_mapping; data_benchmark	dynamic/4D paper with direct reconstruction signal	abstract Multi-period image collections are common in real-world applications. Cities are re-scanned for mapping, construction sites are revisited for progress tracking, and natural regions are monitored for environmental change. Such data form multi-period scenes, where geometry and appearance evolve. Reconstructing such scenes is an important yet underexplored problem. Existing pipelines rely on incompatible assumptions: static and in-the-wild methods enforce a single geometry, while dynamic ones assume smooth motion, both failing under long-term, discontinuous changes. To solve this problem, we introduce ChronoGS, a temporally modulated Gaussian representation that reconstructs all periods within a unified anchor scaffold. It‘s also designed to disentangle stable and evolving components, achieving temporally consistent reconstruction of multi-period scenes. To catalyze relevant research, we re
374	strong_bridge high	DROID-SLAM in the Wild 3D Vision & Geometry / Pose Estimation	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; pose_calibration_localization; robotics_mapping; data_benchmark	pose/localization bridge genus=Pose Estimation with reconstruction/map signal	abstract We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 8 FPS. The source code will be publicly
375	strong_bridge high	HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps 3D Vision & Geometry / Pose Estimation	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; pose_calibration_localization; robotics_mapping; data_benchmark	pose/localization bridge genus=Pose Estimation with reconstruction/map signal	abstract Visual localization on standard-definition (SD) maps has emerged as a promising low-cost and scalable solution for autonomous driving. However, existing regression-based approaches often overlook inherent geometric priors, resulting in suboptimal training efficiency and limited localization accuracy. In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and standard-definition (SD) maps. We construct input pairs that satisfy a homography constraint by projecting ground-view features into the BEV domain and enforcing semantic alignment with map features. Then we leverage homography relationships to guide feature fusion and restrict the pose outputs to a valid feasible region, which significantly improves training efficiency and localization accuracy compared to prior methods relying on attention-based fusi
376	strong_bridge medium	StreamVLO: Streaming Visual–LiDAR Odometry with Cumulative Drift Compensation 3D Vision & Geometry / Pose Estimation	B. bridge: reconstruction becomes mapping/world model	pose_calibration_localization; depth_correspondence; dynamic_4d; robotics_mapping	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract We propose StreamVLO, a streaming visual–LiDAR odometry framework that performs unified spatio-temporal correlation with Mamba models and tackles the long-standing cumulative drift problem via an online Cumulative Drift Compensation scheme for localization in 4D dynamic environments. Specifically, StreamVLO introduces a unified spatio-temporal correlation module built on Mamba to fuse heterogeneous visual and LiDAR cues across multi-frame clips, overcoming the limited temporal exploration of prior pairwise methods. Furthermore, a Cumulative Drift Compensation module minimizes cumulative drift by iteratively learning residual corrections from multiple historical frames in a causal manner. To strengthen spatial feature representation on salient regions, we adopt a Keypoint-Aware Auxiliary Loss with a winner-takes-all strategy. StreamVLO achieves state-of-the-art performance on two commonly
377	strong_bridge high	VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models 3D Vision & Geometry / Pose Estimation	B. bridge: reconstruction becomes mapping/world model	pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark	pose/localization bridge genus=Pose Estimation with reconstruction/map signal	abstract Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird’s-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mecha
378	strong_bridge high	VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation 3D Vision & Geometry / Pose Estimation	B. bridge: reconstruction becomes mapping/world model	pose_calibration_localization; depth_correspondence; robotics_mapping	pose/localization bridge genus=Pose Estimation with reconstruction/map signal	abstract Accurate global localization is crucial for autonomous driving and robotics, especially in dense urban environments where GNSS is often unreliable due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. To address this challenge, we propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to establish horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to r
379	strong_bridge high	Similarity-Consistent Likelihood Diffusion enables Hidden Person Detection from Wall Reflections 3D Vision & Geometry / Pose Estimation	B. bridge: reconstruction becomes mapping/world model	pose_calibration_localization; robotics_mapping	pose/localization bridge genus=Pose Estimation with reconstruction/map signal	abstract This paper studies passive non-line-of-sight corner-camera detection and human localization using faint indirect reflections on a visible wall. The challenge is twofold: multi-exposure wall observations are unstable and entangled with sensor nonlinearities, and mapping these observations to a hidden-view RGB image is severely underdetermined, making purely discriminative regressors brittle and unconstrained diffusion priors stochastic. To address these challenges, we introduce the Similarity-Likelihood Diffusion Network (SLD-Net), a two-stage framework that produces measurement-consistent, deterministic reconstructions. First, DeLi-Inversion forms an exposure-aware differential representation and jointly predicts an initial reconstruction and a pixel-wise precision map, yielding a heteroscedastic pseudo-likelihood. Second, SiCo-Diffusion injects this likelihood as precision-weighted ener
380	strong_bridge high	Uncertainty-Aware Modality Fusion for Unaligned RGB-T Salient Object Detection 3D Vision & Geometry / Pose Estimation	B. bridge: representation meets metric pose	gaussian_radiance; pose_calibration_localization; surface_occupancy	pose/localization bridge genus=Pose Estimation with reconstruction/map signal	abstract Unaligned RGB-T salient object detection (SOD) remains challenging due to severe cross-modal spatial discrepancies and unreliable feature fusion. Existing methods often assume perfect alignment or rely on geometric registration, which is computationally demanding and sensitive to cross-modal inconsistencies. To address these limitations, we propose an uncertainty-aware modality fusion network (UMFNet) that reformulates RGB-T SOD as an uncertainty-aware representation learning problem. Specifically, the proposed uncertainty alignment module (UAM) models pixel-wise features as Gaussian latent distributions to estimate local uncertainty and identify cross-modal consistency regions within the feature space, thereby achieving implicit alignment without explicit registration. Furthermore, the confidence-guided global modulation (CGM) mechanism leverages confidence maps derived from uncertainty
381	strong_bridge high	AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment 3D Vision & Geometry / Pose Estimation	C. cluster representative	general_reconstruction; pose_calibration_localization; depth_correspondence; data_benchmark	pose/localization bridge genus=Pose Estimation with reconstruction/map signal	abstract Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions.First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation.Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose minimizing the feature discrepancy between on-the-flyrendered object features and obser
382	strong_bridge high	CoLoR: The Devil is in Scene Coordinate Regression for Large-Scale Visual Localization 3D Vision & Geometry / Pose Estimation	C. cluster representative	general_reconstruction; pose_calibration_localization; depth_correspondence; data_benchmark	pose/localization bridge genus=Pose Estimation with reconstruction/map signal	abstract Scene Coordinate Regression (SCR) has emerged as a memory-efficient paradigm for visual localization.While SCR has demonstrated performance comparable to classic feature matching based approaches in small-scale scenes, it has consistently underperformed in large-scale environments.Large-scale localization is hampered by two challenges: sparse co-visibility and local appearance ambiguity.In this work, we propose CoLoR, a novel training framework tailored for large-scale SCR.First, we explicitly and efficiently partition scene points into multi-view and single-view sets and introduce a two-stage bootstrapping paradigm to provide complete and strong supervision for all points.Second, we propose a multi-granularity retrieval feature, which unifies the conventional global and local features as retrieval-oriented representations at the image and pixel levels, respectively, to enforce featu
383	strong_bridge medium	Diff4Splat: Repurposing Video Diffusion Models for Dynamic Scene Generation 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; dynamic_4d; generation_editing; data_benchmark	dynamic/4D paper with direct reconstruction signal	abstract We introduce Diff4Splat, a feed-forward framework for dynamic scene generation from a single image. Our method synergizes the powerful generative priors of video diffusion models with geometric and motion constraints learned from a large-scale 4D dataset. Given a single image, a camera trajectory, and an optional text prompt, our model directly predicts a dynamic scene represented by a deformable 3D Gaussian field. This approach captures appearance, geometry, and motion in a single pass, eliminating the need for test-time optimization or post-hoc processing. At the core of our framework is a video latent transformer that enhances existing video diffusion models, enabling them to jointly model spatio-temporal dependencies and predict 3D Gaussian Primitives over time. Supervised by objectives targeting appearance fidelity, geometric accuracy, and motion consistency, Diff4Splat generates hi
384	strong_bridge medium	Event6D: Event-based Novel Object 6D Pose Tracking 3D Vision & Geometry / Pose Estimation	C. cluster representative	pose_calibration_localization; depth_correspondence; dynamic_4d; data_benchmark	dynamic/4D paper with direct reconstruction signal	abstract Event cameras provide microsecond latency, making them suitable for 6D object pose tracking in fast, dynamic scenes where conventional RGB and depth pipelines suffer from motion blur and large pixel displacements. We introduce EventTrack6D, an event-depth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. Conditioned on the most recent depth measurement, our dual reconstruction recovers dense photometric and geometric cues from sparse event streams. Our EventTrack6D operates at over 120 FPS and maintains temporal consistency under rapid motion. To support training and evaluation, we introduce a comprehensive benchmark suite: a large-scale synthetic dataset for training and two complementary evaluation sets, including real and simulated event datasets. Trained exclus
385	strong_bridge high	PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization 3D Vision & Geometry / Pose Estimation	C. cluster representative	general_reconstruction; pose_calibration_localization; depth_correspondence; data_benchmark	pose/localization bridge genus=Pose Estimation with reconstruction/map signal	abstract We present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity.PiLoT breaks this paradigm by directly registering live video stream against a geo-referenced 3D map. To achieve robust, accurate, and real-time performance, we introduce three key contributions: 1) a Dual-Thread Engine that decouples map rendering from core localization thread, ensuring both low latency while maintaining drift-free accuracy; 2) a large-scale synthetic dataset with precise geometric annotations (camera pose, depth maps). This dataset enables the training of a
386	strong_bridge medium	ShapeR: Robust Conditional 3D Shape Generation from Casual Captures 3D Vision & Geometry / 3D Reconstruction	C. cluster representative	general_reconstruction; pose_calibration_localization; surface_occupancy; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given a image sequence, we leverage off-the-shelf visual-inertial SLAM,3D detection algorithms and VLMs to extract for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strateg
387	strong_bridge medium	SpatialVID: A Large-Scale Video Dataset with Spatial Annotations 3D Vision & Geometry / Pose Estimation	C. cluster representative	pose_calibration_localization; depth_correspondence; dynamic_4d; generation_editing; data_benchmark	dynamic/4D paper with direct reconstruction signal	abstract Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion.To this end, we collect SpatialVID, a dataset consisting of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions.Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subseq
388	strong_bridge high	Learning Scene Coordinate Reconstruction from Unposed Images via Pose Graph Optimization 3D Vision & Geometry / Pose Estimation	C. cluster representative	general_reconstruction; pose_calibration_localization; surface_occupancy	pose/localization bridge genus=Pose Estimation with reconstruction/map signal	abstract Learning-based structure-from-motion methods such as ACE-Zero have demonstrated strong performance in estimating camera poses and scene coordinates from unordered image collections without requiring ground truth supervision. However, the lack of global and multi-view consistency constraints in ACE-Zero can lead to pose drift and misalignment, particularly in complex or ambiguous scenes. In this work, we propose a hybrid framework that integrates pose graph optimization (PGO) into ACE-Zero to refine camera poses and suppress incorrect refinements. We construct pose graphs directly from ACE-Zero outputs by extracting relative pose constraints from predicted scene coordinates. Furthermore, we introduce an uncertainty-aware optimization strategy by estimating confidence scores using geometric priors, including epipolar and optical flow consistencies across views. Our approach improves the ro
389	strong_bridge high	Sparse–View Localization via Online Neural 3D Regression 3D Vision & Geometry / Pose Estimation	C. cluster representative	general_reconstruction; pose_calibration_localization; depth_correspondence	pose/localization bridge genus=Pose Estimation with reconstruction/map signal	abstract We present ON3R, an online-trained neural regressor addressing sparse-view structureless localization, where database images have limited visual overlap and no prebuilt 3D map. Given any sparse matches between a query and a $K$-tuple of posed database views, ON3R predicts 3D coordinates for matched query keypoints, supervised by database reprojection residuals and a monocular depth prior. Afterwards, the absolute pose of the query is estimated via P3P-RANSAC and refined with lightweight bundle adjustment. Across MegaDepth, Cambridge Landmarks, and a sparsified version of Aachen Day-Night, ON3R outperforms existing methods. ON3R is particularly effective when the data is extremely sparse -- we focus on $K\leq10$ database images. The code, data splits, and SfM models will be made available for full reproducibility.
390	strong_bridge high	JUMP-Hand: Learning Joint-wise Uncertainty to Gate Mixture of View Experts for Multi-View 3D Hand Reconstruction 3D Vision & Geometry / Pose Estimation	C. cluster representative	general_reconstruction; pose_calibration_localization; surface_occupancy	pose/localization bridge genus=Pose Estimation with reconstruction/map signal	abstract In this paper, JUMP-Hand is proposed as a novel method for multi-view 3D hand reconstruction, which is the first to introduce probabilistic joint-wise uncertainty as an explicit gating mechanism to fuse multi-view information.Existing approaches usually fuse multi-view information by naïve pooling or implicit attention.However, they overlook that each hand joint exhibits varying visibility and reliability across views, which may degrade performance by indiscriminately aggregating noisy or unreliable information.For instance, one joint may be clearly visible in one view, while another joint is occluded in that view but visible in a different view.In contrast, JUMP-Hand addresses this by introducing the core insight of Mixture of Experts (MoE) and regard each 2D view as an expert.The key idea is that the reliability of each view expert is quantified through joint-wise uncertainty modeling,
391	strong_bridge medium	MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; dynamic_4d; data_benchmark	dynamic/4D paper with direct reconstruction signal	abstract We present MoVieS, a motion-aware view synthesis model that reconstruct 4D dynamic scenes from monocular videos in one second. It represents dynamic 3D scenes with pixel-aligned Gaussian primitives and explicitly supervises their time-varying motions. This allows, for the first time, the unified modeling of appearance, geometry and motion from monocular videos, and enables reconstruction, view synthesis and 3D point tracking within a single learning-based framework. By bridging view synthesis with geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision. As a result, it also naturally supports a wide range of zero-shot applications, such as scene flow estimation and moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive p
392	strong_bridge medium	AvatarPointillist: AutoRegressive 4D Gaussian Avatarization 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; dynamic_4d; surface_occupancy	dynamic/4D paper with direct reconstruction signal	abstract We introduce AvatarPointillist, a novel framework for generating dynamic 4D Gaussian avatars from a single portrait image. At the core of our method is a decoder-only Transformer that autoregressively generates a point cloud for 3D Gaussian Splatting. This sequential approach allows for precise, adaptive construction, dynamically adjusting point density and the total number of points based on the subject's complexity. During point generation, the AR model also jointly predicts per-point binding information, enabling realistic animation. After generation, a dedicated Gaussian decoder converts the points into complete, renderable Gaussian attributes. We demonstrate that conditioning the decoder on the latent features from the AR generator enables effective interaction between stages and markedly improves fidelity. Extensive experiments validate that AvatarPointillist produces high-quality,
393	strong_bridge medium	Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; dynamic_4d; data_benchmark	dynamic/4D paper with direct reconstruction signal	abstract Forecasting dynamic scenes remains a fundamental challenge in computer vision, as limited observations make it difficult to capture coherent object-level motion and long-term temporal evolution.We present Motion Group-aware Gaussian Forecasting (MoGaF), a framework for long-term scene extrapolation built upon the 4D Gaussian Splatting representation.MoGaF introduces motion-aware Gaussian grouping and group-wise optimization to enforce physically consistent motion across both rigid and non-rigid regions, yielding spatially coherent dynamic representations.Leveraging this structured space-time representation, a lightweight forecasting module predicts future motion, enabling realistic and temporally stable scene evolution.Experiments on synthetic and real-world datasets demonstrate that MoGaF consistently outperforms existing baselines in rendering quality, motion plausibility, and long-ter
394	strong_bridge medium	EmoDiffTalk：Emotion-aware Diffusion for Editable 3D Gaussian Talking Head 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; dynamic_4d; generation_editing	dynamic/4D paper with direct reconstruction signal	abstract Recent photo-realistic 3D talking head via 3D Gaussian Splatting still has significant shortcoming in emotional expression manipulation, especially for fine-grained and expansive dynamics emotional editing using multi-modal control. This paper introduces a new editable 3D Gaussian talking head, i.e. EmoDiffTalk. Our key idea is a novel Emotion-aware Gaussian Diffusion, which includes an action unit (AU) prompt Gaussian diffusion process for fine-grained facial animator, and moreover an accurate text-to-AU emotion controller to provide accurate and expansive dynamic emotional editing using text input. Experiments on public EmoTalk3D and RenderMe-360 datasets demonstrate superior emotional subtlety, lip-sync fidelity, and controllability of our EmoDiffTalk over previous works, establishing a principled pathway toward high-quality, diffusion-driven, multimodal editable 3D talking-head synth
395	strong_bridge high	SAG-GNN: Semantic-Aware Guided GNN for Descriptor-Free 2D-3D Matching 3D Vision & Geometry / Pose Estimation	C. cluster representative	pose_calibration_localization; depth_correspondence; surface_occupancy	pose/localization bridge genus=Pose Estimation with reconstruction/map signal	abstract Image-to-point cloud matching (2D-3D matching) establishes accurate correspondences between image keypoints and 3D points for 6-DoF camera pose estimation. Existing methods either suffer from poor generalization due to scene-specific coordinate regression requiring per-scene retraining, or incur high storage and maintenance costs from descriptor-based matching that relies on large descriptor sets. Consequently, descriptor-free approaches have gained attention by avoiding heavy storage while improving generalizability; however, most rely only on low-level geometric cues, which limits performance. Leveraging the benefits of semantics in providing context, resolving ambiguities, and enhancing robustness in challenging scenes, we propose the Semantic-Aware Guided Graph Neural Network (SAG-GNN), integrating high-level semantics into descriptor-free 2D-3D matching. Specifically, we design a co
396	strong_bridge high	HumanBA: Human-Aware Bundle Adjustment via Global Human-Camera Decoupling 3D Vision & Geometry / Pose Estimation	C. cluster representative	general_reconstruction; pose_calibration_localization	pose/localization bridge genus=Pose Estimation with reconstruction/map signal	abstract Recovering global human and camera motion from monocular video is essential for world-coordinate human reconstruction but remains challenging due to entangled motions in image space. Traditional SLAM methods estimate monocular camera motion but fail in scenes dominated by foreground objects such as humans. A common workaround is to mask out dynamic objects, yet this approach becomes brittle when humans occupy most of the view or the background is too noisy, leading to unstable tracking and loss of constraints. This paper takes the opposite stance and reintegrates human motion as informative landmarks. We introduce HumanBA, a human-aware bundle adjustment framework that transforms dynamic humans into usable constraints via motion decoupling. HumanBA subtracts the human-induced component from observed joint trajectories, isolating a camera-induced (pseudo-static) component that can be saf
397	strong_bridge medium	PhysHO: Physics-Based Dynamic 3D Gaussian Human and Object from Monocular Video 3D Vision & Geometry / 3D Gaussian Splatting	C. cluster representative	gaussian_radiance; dynamic_4d	dynamic/4D paper with direct reconstruction signal	abstract Physically plausible reconstruction of human–object dynamics from a single video remains under-explored in physics-based methods. Most prior approaches omit human-generated internal actuation by assuming motion driven solely by gravity and simple contacts. They also rely on idealized constitutive laws that underfit heterogeneous and anisotropic materials. We introduce PhysHO, which tightly couples SMPL-driven Linear Blend Skinning (LBS) with a Material Point Method (MPM) simulator to address these gaps. Our key insight is to use LBS as an interpretable actuation prior and MPM to propagate those forces through contact under physical constraints. Concretely, we derive targeted actuation with a PD controller guided by LBS trajectories and gate it per particle via a learnable LBS-impact factor so that only particles inside the SMPL volume are directly actuated. We model real materials with r
398	strong_bridge medium	CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Autonomous driving must operate reliably across diverse surfaces to enable safe mobility. However, most driving datasets are captured on well-paved flat roads. Moreover, recent driving datasets primarily provide sparse LiDAR ground truth for images, which is insufficient for assessing fine-grained geometry in depth estimation and completion. To address these gaps, we introduce CARD, a multi-modal driving dataset that delivers quasi-dense 3D ground truth across continuous sequences rich in speed bumps, potholes, irregular surfaces and off-road segments. Our sensor suite includes synchronized global-shutter stereo cameras, front and rear LiDARs, 6-DoF poses from LiDAR-inertial odometry, per-wheel motion traces, and full calibration. Notably, our multi-LiDAR fusion yields ~500K valid depth pixels per frame, about 6.5x more than KITTI Depth Completion and 10x more on average than other publi
399	strong_bridge medium	Dexterous World Models Robotics & Embodied AI / Embodied AI	D. adjacent but useful context	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static—limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), an scene-action-conditioned video diffusion model enabling embodied interaction within static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human–scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues in the egocentric view to model action-conditioned dynamics directly. We train our model on a synthetic h
400	strong_bridge medium	DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Vision-based autonomous driving has gained much attention due to its low costs and excellent performance.Compared with dense BEV (Bird’s Eye View) or sparse query models, Gaussian-centric method is a comprehensive yet sparse representation by describing scene with 3D semantic Gaussians. In this paper, we introduce DLWM, a novel paradigm with Dual Latent World Models specifically designed to enable holistic gaussian-centric pre-training in autonomous driving using two stages. In the first stage, DLWM predicts 3D Gaussians from queries by self-supervised reconstructing multi-view semantic and depth images. Equipped with fine-grained contextual features, in the second stage, two latent world models are trained separately for temporal feature learning, including Gaussian-flow-guided latent prediction for downstream occupancy perception and forecasting tasks, and ego-planning-guided latent pr
401	strong_bridge medium	GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Physics-aware driving world model is essential for drive planning, out-of-distribution data synthesis, and closed-loop evaluation. However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We furth
402	strong_bridge medium	GSV2X: Geometry-Aware Uncertainty Modeling and Orthogonal Fusion for Robust Roadside Perception Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; gaussian_radiance; pose_calibration_localization; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Reliable 3D perception from multi-view roadside sensors hinges on the robust fusion of camera and LiDAR data, a task complicated by geometric misalignments and sensor calibration errors. This paper presents GSV2X, a fusion framework that tackles these challenges through two core contributions. First, to achieve robustness against spatial uncertainty, we lift 2D image features into a unified Bird's-Eye-View (BEV) space by representing them as 3D Gaussian distributions. By incorporating learnable perturbations guided by camera geometry, our model explicitly accounts for potential calibration inaccuracies. Second, to maximize the synergy between modalities, we propose a new orthogonal fusion module. This module employs constrained attention to enforce orthogonality between camera and LiDAR features, effectively disentangling redundant information and promoting the learning of complementary
403	strong_bridge medium	NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos Generative Models / Video Generation	D. adjacent but useful context	general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks.
404	strong_bridge medium	ORV: 4D Occupancy-centric Robot Video Generation Generative Models / Video Generation	D. adjacent but useful context	general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Recent embodied intelligence suffers from data scarcity, while conventional simulators lack visual realism. Controllable video generation is emerging as a promising data engine, yet current action-conditioned methods still fall short: generated videos are limited in fidelity and temporal consistency, poorly aligned with controls, and often constrained to singleview settings. We attribute these issues to the representational gap between sparse control inputs and dense pixel outputs. Thus, we introduce ORV, a 4D occupancy-centric framework for robot video generation that couples action priors with occupancy-derived visual priors. Concretely, we align chunked 7-DoF actions with video latents via an Action-Expert AdaLN modulation, and inject 2D renderings of 4D semantic occupancy into the generation process as soft guidance. Meanwhile, a central obstacle is the lack of occupancy data for emb
405	strong_bridge medium	Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface Robotics & Embodied AI / Embodied AI	D. adjacent but useful context	general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping; generation_editing; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Recent progress in robot learning has been driven by large-scale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for spatial generalization in manipulation tasks. To reduce repetitive data collection, we present Real2Edit2Real, a framework that generates new demonstrations by bridging 3D editability with 2D visual data through a 3D control interface. Our approach first reconstructs scene geometry from multi-view RGB observations with a metric-scale 3D reconstruction model. Based on the reconstructed geometry, we perform depth-reliable 3D editing on point clouds to generate new manipulation trajectories while geometrically correcting the robot poses to recover physically consistent depth, which serves as a reliable condition for synthesizing new demonstrations. Finally,
406	strong_bridge medium	Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	Gaussian/radiance representation linked to pose/mapping/metric bridge	abstract Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments. However, this unstructured, in-the-wild video data is incompatible with ADS expecting structured, multi-modal sensor inputs for system validation and training purposes. To bridge this data gap, we propose Sensor2Sensor, a novel generative modeling paradigm that translates in-the-wild monocular dashcam videos into a high-fidelity, multi-modal sensor suite that we refer to as the AV log, which includes multi-view camera images and LiDAR point
407	strong_bridge medium	Stereo World Model Robotics & Embodied AI / Embodied AI	D. adjacent but useful context	general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower com
408	strong_bridge medium	U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we present U4D, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a "hard-to-easy" manner through two sequential stages: (1) uncertainty-region modeling, which reconstructs high-entropy regions with fine geometric fidelity, and (2) uncertainty-conditioned completion, which synthesizes the remaining a
409	strong_bridge medium	Unified Camera Positional Encoding for Controlled Video Generation Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; dynamic_4d; robotics_mapping; generation_editing; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in three-dimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real-world cameras. We introduce Relative Ray Encoding, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera-controlled text-to-video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective for Absolute Orientation Encoding, enabling full con
410	strong_bridge medium	UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching Robotics & Embodied AI / Embodied AI	D. adjacent but useful context	general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; robotics_mapping; generation_editing; data_benchmark	Gaussian/radiance representation linked to pose/mapping/metric bridge	abstract Recent progress in 3D reconstruction, such as NeRFs and 3D Gaussian Splatting, has made it easy to recover geometry and appearance from images. However, these static representations remain blind to the physics that govern how objects deform and respond to forces. Building interactive 3D worlds therefore requires predicting not only shape but the underlying material properties. Prior approaches either rely on slow test-time optimization or, more recently, a fast feed-forward predictor such as Pixie. However, these models produce only a single point estimate of physical parameters and are limited to a single simulation backend, restricting both expressiveness and portability. We introduce UniPixie, a generative physics-from-pixels framework that overcomes both limitations. UniPixie predicts a controllable, continuous soft-to-stiff distribution of plausible material properties from a single
411	strong_bridge medium	WRIVINDER: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery Remote Sensing & Earth / Remote Sensing	D. adjacent but useful context	general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Aligning ground-level imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth–based metric cues to produce a stable zenith-view rendering that can be directly matched to satellite context for metrically accurate camera geo-localization. To support systematic evaluation of this task—which lacks suitable benchmarks—we also release MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments. Together, Wrivin
412	strong_bridge medium	HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles Autonomous Driving / Autonomous Driving	D. adjacent but useful context	gaussian_radiance; dynamic_4d; robotics_mapping; generation_editing; data_benchmark	Gaussian/radiance representation linked to pose/mapping/metric bridge	abstract Controllable driving scene generation is critical for realistic and scalable autonomous driving simulation, yet existing approaches struggle to jointly achieve photorealism and precise control. We introduce \textbf{HorizonForge}, a unified framework that reconstructs scenes as editable Gaussian Splats and Meshes, enabling fine-grained 3D manipulation and language-driven vehicle insertion. Edits are rendered through a noise-aware video diffusion process that enforces spatial and temporal consistency, producing diverse scene variations in a single feed-forward pass without per-trajectory optimization. To standardize evaluation, we further propose \textbf{HorizonSuite}, a comprehensive benchmark spanning ego- and agent-level editing tasks such as trajectory modifications and object manipulation. Extensive experiments show that Gaussian Splatting delivers substantially higher fidelity than a
413	strong_bridge medium	GEM: Generating LiDAR World Model via Deformable Mamba Autonomous Driving / Autonomous Driving	D. adjacent but useful context	dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract World models, which simulate environmental dynamics and generate sensor observations, are gaining increasing attention in autonomous driving. However, progress in LiDAR-based world models has lagged behind those built on camera videos or occupancy data, primarily due to two core challenges: the inherent disorder of point clouds and the difficulty of distinguishing dynamic objects from static structures. To address these issues, we propose GEM: a Generative LiDAR world model that leverages dEformable Mamba architecture, significantly improving fidelity and imaginative capability. Specifically, leveraging the structural similarity between sequential laser scanning and Mamba's processing mechanism, we first tokenize LiDAR sweeps into compact representations via a custom LiDAR scene tokenizer. After unsupervised disentanglement of tokenized features via a dynamic-static separ
414	strong_bridge medium	An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving Data & Evaluation / Benchmark	D. adjacent but useful context	general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Panoptic occupancy prediction aims to jointly infer voxel-wise semantics and instance identities within a unified 3D scene representation. Nevertheless, progress in this field remains constrained by the absence of high-quality 3D mesh resources, instance-level annotations, and physically consistent occupancy datasets. Existing benchmarks typically provide incomplete and low-resolution geometry without instance-level annotations, limiting the development of models capable of achieving precise geometric reconstruction, reliable occlusion reasoning, and holistic 3D understanding. To address these challenges, this paper presents an instance-centric benchmark for the 3D panoptic occupancy prediction task. Specifically, we introduce ADMesh, the first unified 3D mesh library tailored for autonomous driving, which integrates over 15K high-quality 3D models with diverse textures and rich semantic
415	strong_bridge medium	Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI Robotics & Embodied AI / Embodied AI	D. adjacent but useful context	general_reconstruction; gaussian_radiance; robotics_mapping; data_benchmark	Gaussian/radiance representation linked to pose/mapping/metric bridge	abstract Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning
416	strong_bridge medium	Multi-Scale Gaussian-Language Map for Embodied Navigation and Reasoning Robotics & Embodied AI / Embodied AI	D. adjacent but useful context	gaussian_radiance; pose_calibration_localization; surface_occupancy; robotics_mapping	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Understanding the geometric and semantic structure of environments is essential for embodied agents. Existing semantic mapping methods trade off between explicit geometry and multi-scale semantics,and lack a native interface for large models, thus requiring additional training of feature projection for semantic alignment. To this end, we propose the multi-scale Gaussian-Language Map (GLMap), which introduces three key designs: (1) explicit geometry, (2) multi-scale semantics covering both instance and region level concepts, and (3) a dual-modality interface where each semantic unit jointly stores a natural language description and a 3D Gaussian representation. The 3D Gaussians enable compact storage and fast rendering of task-relevant images via Gaussian splatting. To enable efficient incremental construction, we further propose a Gaussian Estimator that analytically derives Gaussian par
417	strong_bridge medium	Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras Computational Imaging / Computational Imaging	D. adjacent but useful context	pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Conventional multi-projector calibration requires projecting and capturing structured light patterns for each projector sequentially, causing calibration time and effort to increase linearly with the number of projectors. This scalability bottleneck has long limited the deployment of large-scale projection mapping systems. We present a new calibration framework that breaks this limitation by embedding cameras into the surface of the calibration target. The embedded cameras directly capture the incoming projection light, enabling the separation of simultaneously projected structured light patterns from multiple projectors according to their incident directions. Our method establishes correspondences between the optical centers of the embedded cameras and the projector pixels, allowing the intrinsic and extrinsic parameters of all projectors to be simultaneously estimated. We further intro
418	strong_bridge medium	DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Although multimodal large language models (MLLMs) have shown remarkable capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs within a unified framework remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework for autonomous driving, performing spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through joint optimization. We term it 4D MLLM as it outputs both 3D occupancy and flow, capturing fine-grained spatial-temporal dynamics. Specifically, to capture both precise geometric information and rich appearance, our approach integrates point clouds, multi-view images and language instructions within a single MLLM architecture. Remarkably, desp
419	strong_bridge medium	OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective Data & Evaluation / Benchmark	D. adjacent but useful context	general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Semantic Scene Completion (SSC) is crucial for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial scenarios like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors represent the primary modality for SSC data generation, which poses challenges for most uncrewed aerial vehicles (UAVs) due to flight regulations, mass and energy constraints, and the sparsity of LiDAR-based point clouds from elevated viewpoints. To address these limitations, we introduce OccuFly, the first real-world, camera-based aerial SSC benchmark, captured at altitudes of 50m, 40m, and 30m during spring, summer, fall, and winter. OccuFly covers urban, indu
420	strong_bridge medium	Spatial Retrieval Augmented Autonomous Driving Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Existing autonomous driving systems rely on onboard sensors (cameras, LiDAR, IMU, etc) for environmental perception. However, this paradigm is limited by the drive-time perception horizon and often fails under limited view scope, occlusion or extreme conditions such as darkness and rain. In contrast, human drivers are able to recall road structure even under poor visibility. To endow models with this "recall" ability, we propose the spatial retrieval paradigm, introducing offline retrieved geographic images as an additional input. These images are easy to obtain from offline caches (e.g, Google Maps or stored autonomous driving datasets) without requiring additional sensors, making it a plug-and-play extension for existing AD stacks.For experiments, we first extend the nuScenes dataset with geographic images retrieved via Google Maps APIs and align the new data with ego-vehicle trajector
421	strong_bridge medium	LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; depth_correspondence; dynamic_4d; robotics_mapping	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Accurate metric depth is critical for autonomous driving perception and simulation, yet current approaches struggle to achieve high metric accuracy, multi-view and temporal consistency, and cross-domain generalization.To address these challenges, we present MVS-Pro, a novel multi-view stereo framework that reconciles these competing objectives through two key insights: (1) Sparse but metrically accurate LiDAR observations can serve as geometric prompts to anchor depth estimation in absolute scale, and (2) deep fusion of diverse cues is essential for resolving ambiguities and enhancing robustness, while a spatio-temporal decoder ensures consistency across frames.Built upon these principles, MVS-Pro embeds the LiDAR prompt in two ways: as a hard geometric prior anchoring the cost volume, and as soft feature-wise guidance fused by a triple cues combiner.As for temporal consistency, MVS-Pro
422	strong_bridge medium	Scene Reconstruction as Mapping Priors for 3D Detection Detection & Tracking / Detection	D. adjacent but useful context	general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract In autonomous driving, mapping is critical for motion planning but remains an under-utilized resource for perception tasks like 3D object detection. Maps can provide robust structural priors of the static environment, suited to resolving ambiguities and correcting for sensor data sparsity or noise — issues especially prevalent for distant objects or during adverse weather conditions. However, conventional High-Definition (HD) maps are resource-intensive to obtain and maintain, which presents a challenge for achieving efficient, large-scale deployment. In this paper, we propose a scalable solution to systemically leverage mapping to improve 3D detection by overcoming two primary challenges. First, we introduce a pipeline to automatically build dense mapping priors from aggregated sensor data, eliminating the need for human labeling. Second, we design a novel Mapping Prior Augmented 3D det
423	strong_bridge medium	URScenes: A Multi-scenario Dataset for Unstructured Road Environments Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract As autonomous driving technology transitions from small-scale validation to large-scale deployment, its development in unstructured road environments has become a critical and inevitable trend. Autonomous vehicles increasingly rely on high-quality and diverse datasets for perception systems. However, existing public datasets predominantly focus on clear-weather and urban-road scenarios, leaving a significant gap in the coverage of unstructured road environments. To bridge this gap, we construct URScenes, the first multi-scenario, open-source perception dataset for unstructured road environments. The dataset consists of 472 scenes, each lasting 30 seconds, and provides over 28K annotated samples and 119K sweeps. URScenes, for the first time, covers eight typical scenarios, including rainy, snowy, foggy, dusty, glare, night, cloudy, and sunny conditions. Additionally, URScenes supports mul
424	strong_bridge medium	QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy Autonomous Driving / Autonomous Driving	D. adjacent but useful context	dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving.Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels.Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability.We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames.The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data.To enable long-range supervision and reasoning under constant memory, we intr
425	strong_bridge medium	UniGeoRS: A Unified Benchmark for Tri-view Geo-Localization Remote Sensing & Earth / Remote Sensing	D. adjacent but useful context	pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Cross-view geo-localization (CVGL) aims to estimate an image’s geographic location by matching it with geo-referenced images from different viewpoints, supporting applications such as autonomous driving, UAV navigation, and visual surveillance. However, due to the high cost of image collection, current CVGL datasets often suffer from limited diversity in both drone and ground imagery, which constrains model generalization. Furthermore, existing methods primarily focus on either ground-to-satellite or drone-to-satellite matching, lacking a unified framework capable of handling image matching across all three platforms: satellite, drone, and ground. To this end, we introduce the Unified Geo-localization dataset with Real-world and Synthetic imagery (UniGeoRS), a comprehensive benchmark featuring satellite, drone, and ground-view images, with a particular emphasis on the richness and divers
426	strong_bridge medium	NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction Robotics & Embodied AI / Embodied AI	D. adjacent but useful context	dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Embodied navigation for long-horizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long-term planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination within a single, unified framework.Our approach empowers a single VLM to concurrently perform planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by decomposing the task, tracking its progress, and formulating the subsequent sub-goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting short-t
427	strong_bridge medium	OptiMVMap: Offline Vectorized Map Construction via Optimal Multi-vehicle Perspectives Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; surface_occupancy; robotics_mapping	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Offline vectorized maps constitute critical infrastructure for high-precision autonomous driving and mapping services. Existing approaches rely predominantly on single ego-vehicle trajectories, which fundamentally suffer from viewpoint insufficiency: while memory-based methods extend observation time by aggregating ego-trajectory frames, they lack the spatial diversity needed to reveal occluded regions. Incorporating views from surrounding vehicles offers complementary perspectives, yet naive fusion introduces three key challenges: computational cost from large candidate pools, redundancy from near-collinear viewpoints, and noise from pose errors and occlusion artifacts.We present OptiMVMap, which reformulates multi-vehicle mapping as a select-then-fuse problem to address these challenges systematically. An Optimal Vehicle Selection (OVS) module strategically identifies a compact subset
428	strong_bridge medium	Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving Autonomous Driving / Autonomous Driving	D. adjacent but useful context	depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract 3D occupancy prediction is crucial for autonomous driving perception, offering comprehensive geometric scene understanding and semantic recognition. However, existing methods struggle with geometric misalignment in view transformation due to lack of pixel-level accurate depth estimation, and severe spatial class imbalance where semantic categories exhibit strong spatial anisotropy. To address these challenges, we propose Dr.Occ, a depth- and region-guided occupancy prediction framework. Specifically, we introduce a depth-guided 2D-to-3D View Transformer (D$^2$-VFormer) that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors, thereby enabling precise geometric alignment of voxel features. Moreover, inspired by the Mixture-of-Experts (MoE) framework, we propose a region-guided Expert Transformer (R/R$^2$-EFormer) that adaptively allocate
429	strong_bridge medium	Complementary Prototype Mapping for Efficient Multimodal Anomaly Detection Detection & Tracking / Detection	D. adjacent but useful context	pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Multimodal unsupervised anomaly detection has garnered increasing attention for robust defect localization.Recent approaches rely on establishing cross-modal matching relationships under normal conditions without explicit guidance.However, in practice, a single modality may have multiple distinct representations corresponding to another modality, and such unconditional mappings struggle to adaptively capture these variations, resulting in mapping ambiguity and the misclassification of diverse yet normal variations as anomalies.Moreover, existing methods suffer from slow inference speed and high memory overhead, hindering their deployment in real-world production lines.To address these issues, we propose an efficient and effective Complementary Prototype Mapping (\textbf{CPMAD}) framework, which dynamically extracts consensus and supplementary prototypes to serve as complementary prior
430	strong_bridge medium	PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence Recognition & Classification / Retrieval	D. adjacent but useful context	pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Cross-view geo-localization is a critical task for UAV navigation, event detection, and aerial surveying, which establish correspondence between drone-captured and satellite imagery. Most existing approaches embed cross-view data into a joint feature space to maximize similarity between paired images. However, these methods typically assume perfect alignment of image pairs in training data, an assumption that rarely holds in practical scenarios. In real-world conditions, factors such as urban canyon effects, electromagnetic interference, and adverse weather frequently induce GPS drift, resulting in systematic alignment shifts where only partial correspondences exist between image pairs. Despite its prevalence, this source of noisy correspondence has received limited attention in current research.To our best knowledge, this work presents the first systematic investigation of the **Noi
431	strong_bridge medium	DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving Autonomous Driving / Autonomous Driving	D. adjacent but useful context	general_reconstruction; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensor data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. The recent success of the Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that expert specialization enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. First, we introduce Drive-π0, a Vision-Language-Action (VLA) baseline adapted from Embodied AI for autonomous driving, which serves as the foundation model for DriveMoE. Building on this, we strengthen perception through a carefully designed Vision MoE, where a router adaptively selects context-relevant camera views. This mechanism is inspired by human driving cognition, in which
432	strong_bridge medium	Think Before You Drive: World Model-Inspired Multimodal Grounding Autonomous Driving / Autonomous Driving	D. adjacent but useful context	pose_calibration_localization; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods in AD struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we presen
433	strong_bridge medium	NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks Learning Algorithms / Optimization	D. adjacent but useful context	depth_correspondence; surface_occupancy; robotics_mapping	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Partially Supervised Multi-Task Learning (PS-MTL) aims to leverage knowledge across tasks when annotations are incomplete. Existing approaches, however, have largely focused on the simpler setting of homogeneous, dense prediction tasks, leaving the more realistic challenge of learning from structurally diverse tasks unexplored. To this end, we introduce NexusFlow, a novel, lightweight, and plug-and-play framework effective in both settings. NexusFlow introduces a set of surrogate networks with invertible coupling layers to align the latent feature distributions of tasks, creating a unified representation that enables effective knowledge transfer. The coupling layers are bijective, preserving information while mapping features into a shared canonical space. This invertibility avoids representational collapse and enables alignment across structurally different tasks without reducing expres
434	strong_bridge medium	ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction Autonomous Driving / Autonomous Driving	D. adjacent but useful context	depth_correspondence; surface_occupancy; robotics_mapping	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract 3D semantic occupancy prediction is central to autonomous driving, yet current methods are vulnerable to long-tailed class bias and out-of-distribution (OOD) inputs, often overconfidently assigning anomalies to rare classes. We present ProOOD, a lightweight, plug-and-play method that couples prototype-guided refinement with training-free OOD scoring. ProOOD comprises (i) prototype-guided semantic imputation that fills occluded regions with class-consistent features, (ii) prototype-guided tail mining that strengthens rare-class representations to curb OOD absorption, and (iii) EchoOOD, which fuses local logit coherence with local and global prototype matching to produce reliable voxel-level OOD scores. Extensive experiments on five datasets demonstrate that ProOOD achieves state-of-the-art performance on both in-distribution 3D occupancy prediction and OOD detection. On SemanticKITTI, it
435	strong_bridge medium	Lipschitz Optimization for Formal Verification of Homographies Robustness & Safety / Safety	D. adjacent but useful context	pose_calibration_localization; robotics_mapping; data_benchmark	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, aerospace, and autonomous vehicles. However, current approaches are confined to incomplete statistical verification, or robustness to $\ell_p$-norm or affine transforms which represent a limited subset of perturbations to the image formation process.In this paper, we present a formal verification approach when the capturing camera undergoes 3D motion perturbations. We first establish a closed-form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. While our formulae are grounded in the vision-based landing problem, they gene
436	strong_bridge medium	WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration Autonomous Driving / Autonomous Driving	D. adjacent but useful context	pose_calibration_localization; robotics_mapping	system bridge signal: pose/localization/mapping/world-model plus reconstruction representation	abstract Collaborative perception is vital for autonomous driving yet remains constrained by tight communication budgets. Earlier work reduced bandwidth by compressing full feature maps with fixed-rate encoders, which adapts poorly to a changing environment, and it further evolved into spatial selection methods that improve efficiency by focusing on salient regions, but this object-centric approach often sacrifices global context, weakening holistic scene understanding. To overcome these limitations, we introduce WhisperNet, a bandwidth-aware framework that proposes a novel, receiver-centric paradigm for global coordination across agents. Senders generate lightweight saliency metadata, while the receiver formulates a global request plan that dynamically budgets feature contributions across agents and features, retrieving only the most informative features. A collaborative feature routing module t
437	adjacent_context low	AVGGT: Rethinking Global Attention for Accelerating VGGT Learning Algorithms / Efficient Models	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy	adjacent genus=Efficient Models; useful only if manually connected to reconstruction	abstract Since DUSt3R, models such as VGGT and $\pi^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $\pi^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by sub
438	adjacent_context low	CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation Generative Models / Video Generation	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy; generation_editing; data_benchmark	adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title	abstract Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional con
439	adjacent_context low	Group Editing: Edit Multiple Images in One Go Generative Models / Image Editing	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmark	adjacent genus=Image Editing with no direct reconstruction/SLAM/map signal in title	abstract In this paper, we tackle the problem of performing consistent and unified modifications across a set of related images. This task is particularly challenging because these images may vary significantly in pose, viewpoint, and spatial layout. Achieving coherent edits requires establishing reliable correspondences across the images, so that modifications can be applied accurately to semantically aligned regions. To address this, we propose GroupEditing, a novel framework that builds both explicit and implicit relationships among images within a group. On the explicit side, we extract geometric correspondences using VGGT, which provides spatial alignment based on visual features. On the implicit side, we reformulate the image group as a pseudo-video and leverage the temporal coherence priors learned by pre-trained video models to capture latent relationships. To effectively fuse these two t
440	adjacent_context low	MuM: Multi-View Masked Image Modeling for 3D Vision Learning Algorithms / Self-supervised	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence	adjacent genus=Self-supervised with no direct reconstruction/SLAM/map signal in title	abstract Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM, extensive
441	adjacent_context low	VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation Segmentation & Dense Prediction / Segmentation	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark	adjacent genus=Segmentation; useful only if manually connected to reconstruction	abstract Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in t
442	adjacent_context medium	Any Resolution Any Geometry: From Multi-View To Multi-Patch Robustness & Safety / Robustness	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. We address this challenge by adapting the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sa
443	adjacent_context low	MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation Multimodal & Language / Grounding	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; surface_occupancy; data_benchmark	adjacent genus=Grounding; useful only if manually connected to reconstruction	abstract Most existing 3D referring expression segmentation (3DRES) methods rely on dense, high-quality point clouds, while real-world agents such as robots and mobile phones operate with only a few sparse RGB views and strict latency constraints. We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images. Traditional two-stage pipelines, which first reconstruct a point cloud and then perform segmentation, often yield low-quality geometry, produce coarse or degraded target regions, and run slowly. We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning through a dual-branch design. Training in this setting exposes a critical optimization barrier,
444	adjacent_context low	Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis Recognition & Classification / Retrieval	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence	adjacent genus=Retrieval; useful only if manually connected to reconstruction	abstract Cross-view geo-spatial learning consists of two important tasks: Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS), both of which rely on establishing geometric correspondences between ground and aerial views. Recent Geometric Foundation Models (GFMs) have demonstrated strong capabilities in extracting generalizable 3D geometric features from images, but their potential in cross-view geo-spatial tasks remains underexplored. In this work, we present Geo^2, a unified framework that leverages Geometric priors from GFMs (e.g., VGGT) to jointly perform geo-spatial tasks, CVGL and bidirectional CVIS. Despite the 3D reconstruction ability of GFMs, directly applying them to CVGL and CVIS remains challenging due to the large viewpoint gap between ground and aerial imagery. We propose GeoMap, which embeds ground and aerial features into a shared 3D-aware latent space, effect
445	adjacent_context low	Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers Learning Algorithms / Efficient Models	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; depth_correspondence; data_benchmark	adjacent genus=Efficient Models with no direct reconstruction/SLAM/map signal in title	abstract Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT, $\pi^3$ and MapAnything have demonstrated remarkable performance with relatively simple architectures. However, their scalability is fundamentally constrained by the quadratic complexity of global attention, which imposes a significant runtime bottleneck when processing large image sets. In this work, we empirically analyze the global attention matrix of these models and observe that the probability mass concentrates on a small subset of patch-patch interactions corresponding to cross-view geometric correspondences. Building on this insight and inspired by recent advances in large language models, we propose a training-free, block-sparse replacement for dense global attention, implemented with highly optimized kernels. Our method acc
446	adjacent_context low	G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning Multimodal & Language / VLM / MLLM	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; surface_occupancy; generation_editing	adjacent genus=VLM / MLLM; useful only if manually connected to reconstruction	abstract Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations.Experim
447	adjacent_context low	LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging Learning Algorithms / Efficient Models	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; data_benchmark	adjacent genus=Efficient Models; useful only if manually connected to reconstruction	abstract 3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10× speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: 1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; 2) token similarity acroses adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging . We analyze each token’s geometric importance, optimizing anchor token selection to better pr
448	adjacent_context low	Rewis3d: Reconstruction for Weakly-Supervised Semantic Segmentation Segmentation & Dense Prediction / Segmentation	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; surface_occupancy	adjacent genus=Segmentation; useful only if manually connected to reconstruction	abstract We present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student–teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geo
449	adjacent_context medium	Sky2Ground: A Benchmark for Site Modeling under Varying Altitude Remote Sensing & Earth / Remote Sensing	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; depth_correspondence; surface_occupancy; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract In this work, we propose the problem of localizing cameras and producing renders of a scene, given multiple images captured from ground/aerial/satellite viewpoints. We introduce a dataset called Sky2Ground, which contains synthetic/real images across all 3 viewpoints, along with camera parameters, and dense depth-maps/surface-normals. Recent works have shown that transformer-based nets like VGGT are capable of inferring scene-parameters in a single-forward pass. However, we formally reveal that simply fine-tuning such models reduces performance, and can't be solved simply by bruteforce-scaling. We find the culprit to be satellite images, which inject too much noise during the learning process. Therefore, we propose SkyNet to enable learning using satellite-images. SkyNet is a two-stream neural-net, with one stream explicitly processing satellite, and another processing all modalities tog
450	adjacent_context medium	3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds 3D Vision & Geometry / Point Cloud	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; surface_occupancy; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Despite recent progress in 3D self-supervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds from unlabeled videos. We first introduce \data, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model. We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D
451	adjacent_context low	GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation Segmentation & Dense Prediction / Segmentation	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction; surface_occupancy	adjacent genus=Segmentation with no direct reconstruction/SLAM/map signal in title	abstract We introduce GeoSAM2, a prompt-controllable framework for 3D part segmentation that casts the task as multi-view 2D mask prediction. Given a textureless object, we render normal and point maps from predefined viewpoints and accept simple 2D prompts—clicks or boxes—to guide part selection. These prompts are processed by a shared SAM2 backbone augmented with LoRA and residual geometry fusion, enabling view-specific reasoning while preserving pretrained priors. The predicted masks are back-projected to the object, aggregated across views.Our method enables fine-grained, part-specific control without requiring text prompts, per-shape optimization, or full 3D labels. In contrast to global clustering or scale-based methods, prompts are explicit, spatially grounded, and interpretable. We achieve state-of-the-art class-agnostic performance on PartObjaverse-Tiny and PartNetE, outperforming both s
452	adjacent_context medium	SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead Robotics & Embodied AI / Embodied AI	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; dynamic_4d; robotics_mapping	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Vision–Language–Action (VLA) models built on pretrained Vision–Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that incrementally extracts 4D features from 2D images. Then, to enhance the VLM’s ability to exploit both 2D images and 4D features, we introduce \textit{Fusion Tokens}, a set of learnable tokens
453	adjacent_context low	Co-Me: Confidence Guided Token Merging for Visual Geometric Transformers Learning Algorithms / Efficient Models	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; general_reconstruction	adjacent genus=Efficient Models with no direct reconstruction/SLAM/map signal in title	abstract We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me employs a light-weight distilled confidence predictor to rank tokens and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to $11.3\times$ and $7.2\times$ speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.
454	adjacent_context low	Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models Learning Algorithms / Efficient Models	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; data_benchmark	adjacent genus=Efficient Models with no direct reconstruction/SLAM/map signal in title	abstract With the emergence of 3D foundation models, such as DUSt3R, VGGT, and their variants, there is a growing interest in fine-tuning them for various downstream tasks, where using LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in geometry, texture, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA sub-spaces associated with each type of variation? 2) Are these sub-spaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions. We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA sub-space associated with each type of variation. We show that these sub-spaces are approximately disentangled. Integrating them leads to a reduced LoRA
455	adjacent_context low	Towards Hierarchical 3D Spatial Understanding in Vision-Language Models Multimodal & Language / VLM / MLLM	A. thesis anchor: VGGT/feed-forward geometry	vggt_lineage; data_benchmark	adjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title	abstract Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex stages, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that generates over 1 billion 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised finetuning. We also develop an RGB-D VLM that incorporates metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoni
456	adjacent_context medium	Captain Safari: A Real-time World Engine 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; generation_editing; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers.To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dyn
457	adjacent_context medium	ESAM++: Efficient Online 3D Perception on the Edge 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Online 3D scene perception in real time is essential for robotics, AR/VR, and autonomous systems, particularly in edge computing scenarios where computational resources are limited and privacy is crucial. Recent state-of-the-art methods like EmbodiedSAM (ESAM) demonstrate the promise of online 3D perception by leveraging the Segment Anything Model (SAM) for real-time, fine-grained, and gen- eralized 3D instance segmentation. However, ESAM still relies on a computationally expensive 3D sparse UNet for point cloud feature extraction, which accounts for the majority of the 3D inference time, hindering its practicality on resource-constrained devices. In this paper, we propose ESAM++, a lightweight and scalable alternative for online 3D scene perception tailored to edge devices without GPU acceleration. Our method introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently capture
458	adjacent_context medium	Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; generation_editing	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Diffusion models have achieved impressive generative quality across modalities like 2D images, videos, and 3D shapes, but their inference remains computationally expensive due to the iterative denoising process. While recent caching-based methods effectively reuse redundant computations to speed up 2D and video generation, directly applying these techniques to 3D diffusion models can severely disrupt geometric consistency. In 3D synthesis, even minor numerical errors in cached latent features accumulate, causing structural artifacts and topological inconsistencies. To overcome this limitation, we propose Fast3Dcache, a training-free geometry-aware caching framework that accelerates 3D diffusion inference while preserving geometric fidelity. Our method introduces a Predictive Caching Scheduler Constraint (PCSC) to dynamically determine cache quotas according to voxel stabilization pattern
459	adjacent_context medium	Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract One of the most exciting applications of vision models involve pixel-level reasoning.Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level.Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction.We present a framework that learns pixel-accurate feature descriptors from videos, LILA.The core element of our training framework is linear in-context learning.LILA leverages spatio-temporal cue maps -- depth and motion -- estimated with off-the-shelf networks.Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner.We dem
460	adjacent_context medium	GeoWorld: Geometric World Models 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive expe
461	adjacent_context medium	Order Matters: 3D Shape Generation from Sequential VR Sketches 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract VR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD software. However, existing sketch-to-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for 3D shape generation from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates ordered VR sketches from arbitrary shapes, (ii) a dataset comprising over 20k synthetic and 900 hand-drawn sketch–shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work and generalizes effectively from synthetic to real sketches with minimal supervision. All data and models will
462	adjacent_context medium	RenderFlow: Single-Step Neural Rendering via Flow Matching 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Conventional physically-based rendering (PBR) pipelines generate photorealistic images through computationally expensive light transport simulations. Although recent deep learning approaches leverage diffusion model priors with geometry buffers (G-buffers) to produce visually compelling results without explicit scene geometry or light simulation, they remain constrained by two major limitations. First, the iterative nature of the diffusion process introduces substantial latency. Second, the inherent stochasticity of these generative models compromises physical accuracy and temporal consistency. In response to these challenges, we propose a novel, end-to-end, deterministic single-step neural rendering framework \textit{RenderFlow} built upon a flow matching paradigm. To further strengthen both rendering quality and generalization, we propose an efficient and effective module for sparse ke
463	adjacent_context medium	Spatial Matters: Position-Guided 3D Referring Expression Segmentation 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract 3D Referring Expression segmentation (3D-RES) is an emerging field that segments 3D objects in point cloud scenes based on given referring expressions. Although existing methods have achieved substantial progress, they primarily focus on semantic cues and often overlook spatial relations, which are essential for segmenting the referred objects in complex 3D scenes, especially those containing multiple visually similar instances. In this paper, we propose Position3D, a novel approach that explicitly incorporates spatial relation modeling into 3D-RES. Specifically, we introduce a spatial-aware query generation module that constructs point proxies by aggregating local context and incorporating spatial relations, from which the most text-relevant are selected as queries. Furthermore, we design a position-guided deformable attention in the decoder, which progressively refines attention to con
464	adjacent_context medium	SRGCD: Stability-Driven Region Growth Framework for 3D Change Detection 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract With the growing accessibility of large-scale 3D point clouds from LiDAR and photogrammetric techniques, 3D change detection (3DCD) has become essential for understanding dynamic scenes. Existing methods typically formulate this as segmentation, treating each point independently for binary classification. This leads to isolated misclassified noise points inside regions. Meanwhile, feature similarity at boundaries causes boundary ambiguity. The more severe class imbalance inherent to change detection further exacerbates this issue. To address these challenges, we propose SRGCD, a Stability-Driven Region Growth Framework that redefines 3DCD as region growing rather than segmentation. Our key insight is that progressively expanding from highly confident seeds avoids pitfalls of point-wise classification while elegantly alleviating class imbalance. Specifically, we first apply strict constra
465	adjacent_context medium	StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone.To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity.A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis.To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD).Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generat
466	adjacent_context medium	TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Current football imitation research primarily aims to optimize reward-based objectives, such as goals scored or win rate proxies, paying less attention to accurately replicating real-world team tactical behaviors. We introduce TacSIm, a large-scale dataset and benchmark for Tactical Style Imitation in football. TacSIm imitates the acitons of all 11 players in one team in the given broadcast footage of Premier League matches under a single broadcast view. Under a offensive or defensive broadcast footage, TacSIm projects the beginning positions and actions of all 22 players from both sides onto a standard pitch coordinate system. TacSIm offers an explicit style imitation task and evaluation protocols. Tactics style imitation is measured by using spatial occupancy similarity and movement vector similarity in defined time, supporting the evaluation of spatial and temporal similarities for on
467	adjacent_context medium	Tokenizing Vector Animation for Autoregresive Generation 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; generation_editing; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides , 3D meshes , LEGO sequences , and indoor layouts , suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometr
468	adjacent_context medium	Towards Visual Query Localization in the 3D World 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Visual query localization (VQL) aims to predict a spatial-temporal response of the most recent occurrence from a sequence given a query. Currently, most research focuses on visual query localization from 2D videos, while its counterpart in 3D space has received little attention. In this paper, we make the first attempt to visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. Specifically, 3DVQL contains 2,002 sequences with around 170,000 frames and 6.4K response track segments from 38 object categories. Each sequence in 3DVQL is provided with multiple modalities including point clouds (PC), RGB and depth images to support flexible research. To ensure high-quality annotation, each sequence is manually annotated with multiple rounds of verification and refinement. To our best knowledge, 3DVQL is the first benchmark towards 3D multimodal visual query loc
469	adjacent_context medium	VABench: A Comprehensive Benchmark for Audio-Video Generation 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization,
470	adjacent_context medium	VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy; generation_editing	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Though rectified flow models have achieved remarkable performance in image, video, and 3D generation, their practical deployments are challenged by slow inference speeds. Previous acceleration methods rely on caching and reusing, neglecting the growing mismatch between static cached values and evolving input, leading to reduced generated content fidelity.This work proposes Velocity Decomposition and Estimation (VDE), a training-free acceleration method that shifts the paradigm from caching-and-reusing to decomposing-and-estimating.VDE periodically anchors the model’s state with a full forward pass and estimates subsequent outputs analytically. VDE first decomposes the model’s velocity output into components parallel and orthogonal to the input, then exploiting the temporal predictability of the components' coefficients and the consistency of the orthogonal direction for precise, input-ad
471	adjacent_context medium	Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet truly dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally c
472	adjacent_context medium	MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: dynamic/4D recon	general_reconstruction; dynamic_4d; surface_occupancy	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract 3D morphing remains challenging due to the difficulty of generating semantically consistent and temporally smooth deformations, especially across categories. We present MorphAny3D, a training-free framework that leverages Structured Latent (SLAT) representations for high-quality 3D morphing. Our key insight is that intelligently blending source and target SLAT features within the attention mechanisms of 3D generators naturally produces plausible morphing sequences. To this end, we introduce Morphing Cross-Attention (MCA), which fuses source and target information for structural coherence, and Temporal-Fused Self-Attention (TFSA), which enhances temporal consistency by incorporating features from preceding frames. An orientation correction strategy further mitigates the pose ambiguity within the morphing steps. Extensive experiments show that our method generates state-of-the-art morphing
473	adjacent_context medium	Dynamic-Static Decomposition for Novel View Synthesis of Dynamic Scenes with Spiking Neurons 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Novel view synthesis for dynamic scenes remains challenging due to complex motion variations.Recent methods represent dynamic and static regions with separate Gaussians to improve efficiency and accuracy, but inaccurate assignment of static and dynamic Gaussian primitive still limits performance.We identify two key issues, namely inaccurate mask priors and improper tag representations, which lead to boundary artifacts, loss of fine-grained motion details, and overfitting on input views, resulting in degraded side-view synthesis.To address these problems, we propose a spatio-temporally fine-grained mask field and a discontinuous dynamic–static tagging field to achieve accurate assignment of dynamic and static Gaussian primitives, enabling high-quality novel view synthesis, especially in fine-grained motions, motion boundary regions, and side viewpoints.Experiments show that our method ac
474	adjacent_context medium	Feed-forward Gaussian Registration for Head Avatar Creation and Editing 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; generation_editing	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract We present MATCH (Multi-view Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. State-of-the-art multi-view head avatars require time-consuming head tracking, which is followed by an expensive avatar optimization, often resulting in a total creation time that exceeds one day. MATCH instead directly predicts Gaussian splat textures in correspondence from calibrated multi-view images in 0.5 seconds per frame. While the learned intra-subject correspondence across frames allows us to quickly build personalized head avatars, correspondence across subjects enables various applications such as expression transfer, optimization-free tracking, semantic editing, and identity interpolation. We learn to establish such correspondences end-to-end, with a transformer-based model that predicts textures of Gaussian
475	adjacent_context medium	FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract We present a novel formulation for mesh-free, reduced-order simulation of deformable hyperelastic objects. Existing work in reduced-order elastodynamic simulation represents the input geometry by either meshes, which can be difficult to obtain due to challenges in scanning and triangulating complex shapes, or by neural fields that require per-shape optimization. We propose to adopt a Reproducing Kernel Particle Method (RKPM) representation, which enables the construction of reduced-order skinning weights by solving a generalized eigensystem on the Hessian matrix of the elastic energy. We demonstrate that this formulation not only leads to a 40$\times$ training speedup compared with the per-shape optimization of neural fields, but also achieves lower simulation error when evaluated against the converged results of finite element method. We show our simulation results on a wide variety of
476	adjacent_context medium	LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; robotics_mapping; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Open-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain impractically slow, memory-intensive, and overly complex due to iterative optimization and dense feature assignments for every Gaussian. To address these limitations, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantics only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. To further streamline inference and ensure semantic consistency, we cluster Gaussians in a single step by linking geometrically and semantically related masks in 3D. In evaluation, we assess our method on diverse benchm
477	adjacent_context medium	MVInverse: Feed-forward Multi-view Inverse Rendering in Seconds 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Multi-view inverse rendering aims to recover geometry, materials, and illumination consistently across multiple viewpoints. Existing single-view approaches often ignore cross-view relationships, leading to inconsistent results, while multi-view optimization methods rely on slow differentiable rendering and per-scene refinement, making them computationally expensive and hard to scale. To address these limitations, we introduce a feed-forward multi-view inverse rendering framework that directly predicts spatially varying albedo, metallicity, roughness, diffuse shading, and surface normals from sequences of RGB images. By alternating attention across views, our model captures both intra-view long-range lighting interactions and inter-view material consistency, enabling coherent scene-level reasoning within a single forward pass. Due to the scarcity of real-world training data, models traine
478	adjacent_context medium	OrienPose: Orientation-Guided Novel View Synthesis for Single-Image Unseen Object Pose Estimation 3D Vision & Geometry / Pose Estimation	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Estimating the 3D pose of unseen objects from a single image remains a fundamental yet challenging problem in computer vision, especially under a CAD model-free setting.Pioneering attempts address this issue by matching templates generated through Novel View Synthesis (NVS), which essentially aims to learn the geometric transformation from a reference to a target view. While promising, these methods can only approximate this transformation under pixel-level supervision, as the starting orientation remains undefined. In the absence of explicit geometric constraints to verify the correctness of the predicted transformation, existing methods often synthesize novel views with geometry-distorted structures or severely blurred local textures, leading to unreliable template matching and suboptimal pose estimation results. To this end, we propose OrienPose, a novel object pose estimation framew
479	adjacent_context medium	Photo3D: Advancing Photorealistic 3D Generation through Structure‑Aligned Detail Enhancement 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; generation_editing; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Although recent 3D‑native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich surface details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non‑rigid motions of objects, and the limited precision of scanners.We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT‑4o‑Image model.Considering that the generated images can distort 3D structures due to their lack of multi‑view consistency, we design a structure‑aligned multi‑view synthesis pipeline and construct a detail‑enhanced multi‑view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages percep
480	adjacent_context medium	PhysHead: Simulation-Ready Gaussian Head Avatars 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Realistic digital avatars require expressive and dynamic hair motion, yet most existing head avatar methods assume rigid hair movement.These methods often fail to disentangle hair from the head, representing it as a simple outer shell and failing to capture its natural volumetric behavior. In this paper, we address these limitations by introducing PhysHead, a hybrid representation for animatable head avatars with realistic hair dynamics learned from multi-view video. Our approach combines a 3D parametric mesh for the head with strand-based hair, which can be directly simulated using physics engines. For the appearance model, we employ Gaussian primitives attached to both the head mesh and hair segments. This representation enables the creation of photorealistic head avatars with dynamic hair behavior, such as wind-blown motion, overcoming the constraints of rigid hair in existing methods
481	adjacent_context medium	PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Existing approaches to 3D semantic urban scene generation predominantly rely on voxel-based representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality wit
482	adjacent_context medium	Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract We present _Relightable Holoported Characters_ (RHC), a novel person-specific method for free-view rendering and relighting of full-body and highly dynamic humans solely observed from sparse-view RGB videos at inference. In contrast to classical one-light-at-a-time (OLAT)-based human relighting, our transformer-based RelightNet predicts relit appearance within a single network pass, avoiding costly OLAT-basis capture and generation. For training such a model, we introduce a new capture strategy and dataset recorded in a multi-view lightstage, where we alternate frames lit by random environment maps with uniformly lit tracking frames, simultaneously enabling accurate motion tracking and diverse illumination as well as dynamics coverage. Inspired by the rendering equation, we derive physics-informed features that encode geometry, albedo, shading, and the virtual camera view from a coarse h
483	adjacent_context medium	REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Recent generative models have shown strong performance in generating diverse 3D assets from 2D images, a fundamental research topic in computer vision and graphics. However, these models still struggle to generate voluminous 3D assets when the input is a flat image that provides limited 3D cues. We introduce REVIVE 3D, a two-stage, plug-and-play pipeline for generating voluminous 3D assets from flat images. In Stage 1, we construct an Inflated Prior by inflating the foreground silhouette to recover global volume and superimposing part-aware details to capture local structure. In Stage 2, 3D Latent Refinement injects Gaussian noise into the Inflated Prior's latent and then denoises it, guided by the prior's geometric cues and the backbone's pretrained 3D knowledge. By initializing the process with the encoded latent of a source mesh instead of the prior, the framework also supports 3D edi
484	adjacent_context medium	Scaling View Synthesis Transformers 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Recently, geometry-free view synthesis transformers have achieved state-of-the-art results in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. However, the specific factors that govern how their performance scales with compute remain poorly understood. In this work, we conduct a rigorous analysis of the scaling laws for view synthesis transformers and elucidate a series of design choices for training compute-optimal NVS models. Most significantly, we find that an encoder–decoder architecture, which was previously found to be less scalable, can in fact be compute-optimal. We attribute the previously inferior performance of previous encoder–decoder methods to certain architectural choices and inconsistent training compute across comparisons. Across several compute levels, we demonstrate that our encoder–decoder architecture, which we
485	adjacent_context medium	Scaling4D: Pushing the Frontier of Video Novel View Synthesis through Large-Scale Monocular Videos 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Video Novel View Synthesis (VNVS) aims to render arbitrary novel viewpoints of dynamic scenes from a single-view video, but its algorithmic training faces a major challenge: the lack of large-scale multi-view video datasets. Prior methods often train on monocular data by framing it as an inpainting task, which typically leads to a train-inference gap and visual artifacts. While synthetic multi-view data can partially alleviate the data scarcity issue, its high acquisition costs and limited diversity restrict scalability. To address these problems, we propose Scaling4D, a novel strategy that theoretically avoids the train-inference gap while leveraging large-scale monocular videos for training. Specifically, we take a higher-level perspective on the problem, reformulating VNVS into a general correspondence-guided generation task. Furthermore, in conjunction with extensive real-world data,
486	adjacent_context medium	Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised globa
487	adjacent_context medium	ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and Test-time Generative Adaptation 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract We introduce ELITE, an Efficient Gaussian head avatar synthesis from a monocular video via Learned Initialization and TEst-time generative adaptation. Prior works rely either on a 3D data prior or a 2D generative prior to compensate for missing visual cues in monocular videos. However, 3D data prior methods often struggle to generalize in-the-wild, while 2D generative prior methods are computationally heavy and prone to identity hallucination. We identify a complementary synergy between these two priors and design an efficient system that achieves high-fidelity animatable avatar synthesis with strong in-the-wild generalization. Specifically, we introduce a feed-forward Mesh2Gaussian Prior Model (MGPM) that enables fast initialization of a Gaussian avatar. To further bridge the domain gap at test time, we design a test-time generative adaptation stage, leveraging both real and synthetic
488	adjacent_context medium	DiffBMP: Differentiable Rendering with Bitmap Primitives 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract We introduce DiffBMP, a scalable and efficient differentiable rendering engine for a collection of bitmap images. Our work addresses a limitation that traditional differentiable renderers are constrained to vector graphics, given that most images in the world are bitmaps. Our core contribution is a highly parallelized rendering pipeline, featuring a custom CUDA implementation for calculating gradients. This system can, for example, optimize the position, rotation, scale, color, and opacity of thousands of bitmap primitives all in under 1 min using a consumer GPU. We employ and validate several techniques to facilitate the optimization: soft rasterization via Gaussian blur, structure-aware initialization, noisy canvas, and specialized losses/heuristics for videos or spatially constrained images. We demonstrate DiffBMP is not just an isolated tool, but a practical one designed to integ
489	adjacent_context medium	WonderZoom: Multi-Scale 3D World Generation 3D Vision & Geometry / 3D Reconstruction	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; surface_occupancy	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract We present WonderZoom, a novel approach to generating 3D scenes with contents across multiple spatial scales from a single image. Existing 3D world generation models remain limited to single-scale synthesis and cannot produce coherent scene contents at varying granularities. The fundamental challenge is the lack of a scale-aware 3D representation capable of generating and rendering content with largely different spatial sizes. WonderZoom addresses this through two key innovations: (1) scale-adaptive Gaussian surfels for generating and real-time rendering of multi-scale 3D scenes, and (2) a progressive detail synthesizer that iteratively generates finer-scale 3D contents. Our approach enables users to ``zoom into'' a 3D region and auto-regressively synthesize previously non-existent fine details from landscapes to microscopic features. Experiments demonstrate that WonderZoom significantly
490	adjacent_context medium	Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation 3D Vision & Geometry / 3D Gaussian Splatting	A. thesis anchor: representation shift	general_reconstruction; gaussian_radiance; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Generating large-scale 3D head avatars of non-existent identities with high-fidelity and strong multi-view consistency (MVC) is essential for applications such as synthetic crowds, digital twins, and large asset libraries. For high scalability, avatars must be generated from minimal resources, without costly MV studio captures or any 3D data. In this work, we target this challenging minimal-resource setting for 3D head generation. Second, we argue that the common strategy of enforcing MVC via intermediate MV image generation is both expensive and fundamentally fragile. Instead, we analyze how MVC can be induced by design, showing that intermediate view synthesis is unnecessary. To this end, we introduce MVCHead — a fast, single-shot state space model that directly predicts Gaussians, without intermediate generation. At its core, we propose a Hierarchical State Space (HiSS) block that enf
491	adjacent_context medium	Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual–Inertial Odometry 3D Vision & Geometry / Pose Estimation	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; pose_calibration_localization; robotics_mapping; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Visual-Inertial Odometry (VIO) is a critical component for robust ego-motion estimation, enabling foundational capabilities such as autonomous navigation in robotics and real-time 6-DoF tracking for augmented reality.Existing methods face a well-known trade-off: filter-based approaches are efficient but prone to drift, while optimization-based methods, though accurate, rely on computationally prohibitive Visual-Inertial Bundle Adjustment (VIBA) that is difficult to run on resource-constrained platforms.Rather than removing VIBA altogether, we aim to reduce how often and how heavily it must be invoked. To this end, we cast two key design choices in modern VIO, when to run the visual frontend and how strongly to trust its output, as sequential decision problems, and solve them with lightweight reinforcement learning (RL) agents. Our framework introduces a lightweight, dual-pronged RL polic
492	adjacent_context medium	Haptic Neural Fields: Bringing Tactile Interactions to 3D Rendered Scenes 3D Vision & Geometry / 3D Reconstruction	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract We address the problem of making 3D scenes interactive by asking: what would objects feel like if touched in a virtual environment? State-of-the-art 3D rendering methods provide compelling visual realism, but they fall short in modeling physical interactions, such as haptic feedback. We propose a framework that learns the correspondence between user actions and tactile responses, enabling the generation of touch-based signals directly from simulated interactions in 3D scenes. Our approach leverages a neural field representation conditioned on geometry and action to synthesize material-specific tactile signals. Experiments show that the generated signals reliably convey material properties and interaction dynamics. This paves the way toward interactive, touch-aware virtual environments with realistic haptic feedback.
493	adjacent_context medium	MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts 3D Vision & Geometry / 3D Reconstruction	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; surface_occupancy; robotics_mapping; generation_editing; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language–driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent that supports five object-level operations. To support photorea
494	adjacent_context medium	PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image 3D Vision & Geometry / 3D Reconstruction	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; surface_occupancy; robotics_mapping; generation_editing; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract 3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce \textbf{PhysX-Anything}, the first \textbf{simulation-ready} physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by \textbf{193$\times$}, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning a
495	adjacent_context medium	SAGE: Scalable Agentic 3D Scene Generation for Embodied AI 3D Vision & Geometry / 3D Reconstruction	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; surface_occupancy; robotics_mapping; generation_editing; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., “pick up a bowl and place it on the table”), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simula
496	adjacent_context medium	SE(3)-Equivariance with Geometric and Topological Guidance for Category-Level Object Pose Estimation 3D Vision & Geometry / Pose Estimation	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Object pose estimation is a key task for embodied robots, enabling them to interact with objects effectively. Category-level object pose estimation provides a way for robots to estimate the pose of unknown objects. However, estimating object pose from point clouds alone remains challenging. In this paper, we introduce SEGPose, a novel category-level object pose estimation method based on point clouds. Unlike previous methods, SEGPose leverages geometric, topological information, and SE(3)-equivariance, enhancing the network's accuracy in pose prediction. To utilize geometric and topological features, we propose a constraint-based feature extraction and 3D reconstruction method, enabling effective object shape reconstruction. We also design an SE(3)-equivariance feature prediction network to handle pose transformations consistently across viewpoints, improving pose accuracy. Experimental
497	adjacent_context medium	Volumetric Functional Maps 3D Vision & Geometry / 3D Reconstruction	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract The computation of volumetric correspondences between 3D shapes is a prominent tool for medical and industrial applications. In this work, we pave the way for spectral volume mapping, extending for the first time the functional maps framework from the surface to the volumetric setting. We show that the eigenfunctions of the volumetric Laplace operator define a functional space that is suitable for high-quality signal transfer. We also experiment with various techniques that edit this functional space, porting them to volume domains. We validate our method on novel volumetric datasets and on tetrahedralizations of well-established surface datasets, also showcasing practical applications involving both discrete and continuous signal mapping, for segmentation transfer, mesh connectivity transfer, and solid texturing. Last but not least, we show that considering the volumetric spectrum great
498	adjacent_context medium	Deep Feature Deformation Weights 3D Vision & Geometry / 3D Reconstruction	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; surface_occupancy; robotics_mapping; generation_editing	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Handle-based mesh deformation has been a long-standing paradigm in computer graphics, enabling intuitive shape edits from sparse controls. Classic techniques offer precise and rapid deformation control. However, they solve an optimization problem with constraints defined by the choice of control handles, requiring a user to know apriori the ideal distribution of handles on the shape to accomplish the desired edit. The mapping from handle set to deformation behavior is often unintuitive and, importantly, non-semantic. Modern data-driven methods, on the other hand, leverage the data prior to obtain semantic edits, at the cost of fine-grained control and speed. We propose a technique that achieves the best of both worlds by leveraging the semantic prior of data and the precise control and speed of traditional frameworks. Our approach is surprisingly simple yet effective: deep feature proxim
499	adjacent_context medium	HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models 3D Vision & Geometry / 3D Reconstruction	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; surface_occupancy; robotics_mapping; generation_editing	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract The 3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires extensive and tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for automatic 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with ex
500	adjacent_context medium	SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation 3D Vision & Geometry / 3D Reconstruction	B. bridge: reconstruction becomes mapping/world model	general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark	editorial thesis/bridge bucket but weaker direct reconstruction signal	abstract Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision-language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer t

Relevance

Paper

Editorial bucket

Matched groups

Reason

Abstract

core_reconstruction

high

Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation

Autonomous Driving / Autonomous Driving

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

VGGT/feed-forward geometry lineage with direct geometry signal

abstract

Understanding dynamic 3D environments is essential for safe autonomous driving, particularly when reasoning about human-centric, nonrigid agents. However, existing self-supervised occupancy prediction frameworks predominantly assume rigid-body motion and rely on simple frame-to-frame offsets, limiting their ability to capture fine-grained deformations and maintain temporal coherence. To address this issue, we propose DeGO, a deformable Gaussian occupancy framework that unifies decoupled Gaussian deformation with factorized 4D foundation-model distillation. DeGO disentangles rigid and nonrigid motion, enabling each Gaussian primitive to evolve through both deformation and offset-based updates. In parallel, a factorized 4D distillation strategy transfers cross-camera and cross-frame knowledge from the VGGT foundation model, producing foundation-aligned features that enhance temporal consis

core_reconstruction

high

Dynamic Visual SLAM using a General 3D Prior

Robotics & Embodied AI / Embodied AI

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping

VGGT/feed-forward geometry lineage with direct geometry signal

abstract

Reliable incremental estimation of camera poses and 3D reconstruction is key to enable various applications including robotics, interactive visualization, and augmented reality. However, this task is particularly challenging in dynamic natural environments, where scene dynamics can severely deteriorate camera pose estimation accuracy. In this work, we propose a novel monocular visual SLAM system that can robustly estimate camera poses in dynamic scenes. To this end, we leverage the complementary strengths of geometric patch-based online bundle adjustment and recent feed-forward reconstruction models. Specifically, we propose a feed-forward reconstruction model to precisely filter out dynamic regions, while also utilizing its depth prediction to enhance the robustness of the patch-based visual SLAM. By aligning depth prediction with estimated patches from bundle adjustment, we robustly ha

core_reconstruction

high

DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving

Autonomous Driving / Autonomous Driving

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping

VGGT/feed-forward geometry lineage with direct geometry signal

abstract

Dynamic scene reconstruction in autonomous driving remains a fundamental challenge due to significant temporal variations, moving objects, and complex scene dynamics. Existing feed-forward 3D models have demonstrated strong performance in static reconstruction but still struggle to capture dynamic motion. To address these limitations, we propose DynamicVGGT, a unified feed-forward framework that extends VGGT from static 3D perception to dynamic 4D reconstruction. Our goal is to model point motion within feed-forward 3D models in a dynamic and temporally coherent manner. To this end, we jointly predict the current and future point maps within a shared reference coordinate system, allowing the model to implicitly learn dynamic point representations through temporal correspondence. To efficiently capture temporal dependencies, we introduce a Motion-aware Temporal Attention (MTA) module that

core_reconstruction

medium

E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

Learning Algorithms / Self-supervised

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Self-supervised pre-training has revolutionized foundation models for language, 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Exp

core_reconstruction

high

Emergent Extreme-View Geometry in 3D Foundation Models

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, non-overlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with ded

core_reconstruction

high

Emergent Outlier View Rejection in Visual Geometry Grounded Transformers

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Reliable 3D reconstruction from in-the-wild image collections is often hindered by noisy images—irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effe

core_reconstruction

high

FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences.In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of **descriptor tokens**. Global attention is then computed as cro

core_reconstruction

high

Flow3r: Factored Flow Prediction for Visual Geometry Learning

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We propose Flow3r, a scalable framework for visual geometry learning that leverages flow prediction to guide learning using unlabeled monocular videos. Current 3D/4D reconstruction systems primarily rely on dense geometry and pose supervision, and cannot easily generalize to diverse dynamic real-world scenes. In this work, we propose a mechanism to augment training directly from unlabeled videos, leveraging dense 2D correspondences (or ‘flow’) between arbitrary image pairs as supervision. Our key insight is that a factored flow prediction module that computes from two images using ‘geometry latents’ from one image and the ‘pose latent’ from the othercan guide visual geometry learning. We first highlight the benefits and scalability of flow supervision in controlled settings and then leverage large-scale unlabeled data to improve off-the-shelf visual geometry models. We evaluate Flow3r ac

core_reconstruction

high

FRM: Linear-Time 3D Reconstruction via Test-Time Training

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Feed-forward transformer models such as VGGT and $\pi^3$ are highly accurate, but their computational cost grows quadratically with the number of input images, making them slow to evaluate on large collections. More efficient approaches ameliorate this cost at the expense of reconstruction quality. We introduce Fast Reconstruction Model, a stateful feed-forward reconstruction model that uses a bidirectional architecture that scales linearly in the number of input views, while matching or surpassing the reconstruction quality of quadratic-time methods. FRM employs test-time training layers to compress images into a compact hidden scene state during a single forward pass, enabling our model to reconstruct 3D scenes at speeds up to 75 FPS on a single H100 GPU---over 20 times faster than SOTA methods such as VGGT. This hidden state also serves as an implicit scene representation which can be

core_reconstruction

high

GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting has demonstrated superior performance in rendering efficiency and quality, yet the generation of 3D Gaussians still remains a challenge without proper geometric priors. Existing methods have explored to predict point maps as geometric references for inferring Gaussian primitives, while the unreliable estimated geometries may lead to poor generations. In this work, we introduce GaussianGrow, a novel approach that generates 3D Gaussians by learning to grow them from easily accessible 3D point clouds, naturally enforcing geometric accuracy in Gaussian generation. Specifically, we design a text-guided Gaussian growing scheme that leverages a multi-view diffusion model to synthesize consistent appearances from input point clouds for supervision. To mitigate artifacts caused by fusing neighboring views, we constrain on novel views generated at non-preset camera poses iden

core_reconstruction

high

Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, \method{} produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.

core_reconstruction

high

Generalizable Sparse-View 3D Reconstruction from Unconstrained Images

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Reconstructing 3D scenes from sparse, unposed images remains challenging under real-world conditions with varying illumination and transient occlusions. Existing methods rely on scene-specific optimization with appearance embeddings or dynamic masks, requiring extensive per-scene training and failing under sparse views. Moreover, evaluations on limited scenes raise questions about generalization. We present GenWildSplat, a feed-forward framework for sparse-view outdoor reconstruction that requires no per-scene optimization. Given unposed internet images, GenWildSplat predicts depth, camera parameters, and 3D Gaussians in a canonical space using learned geometric priors. An appearance adapter modulates appearance for target lighting conditions, while semantic segmentation handles transient objects. Through curriculum learning on synthetic and real data, GenWildSplat generalizes across div

core_reconstruction

high

Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction

Autonomous Driving / Autonomous Driving

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; gaussian_radiance; depth_correspondence; surface_occupancy; robotics_mapping

VGGT/feed-forward geometry lineage with direct geometry signal

abstract

Accurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction.We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric s

core_reconstruction

high

GGPT: Geometry-Grounded Point Transformer

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Recent feed-forward networks have achieved remarkable progress in sparse-view 3D reconstruction by predicting dense point maps directly from RGB images. However, they often suffer from geometric inconsistencies and limited fine-grained accuracy due to the absence of explicit multi-view constraints. We introduce the Geometry-Grounded Point Transformer (GGPT), a framework that augments feed-forward reconstruction with reliable sparse geometric guidance. We first propose an improved Structure-from-Motion pipeline based on dense feature matching and lightweight geometric optimisation to efficiently estimate accurate camera poses and partial 3D point clouds from sparse input views.Building on this foundation, we propose a geometry-guided 3D point transformer that refines dense point maps under explicit sparse-geometry supervision using an optimised guidance encoding. Extensive experiments dem

core_reconstruction

high

HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Visual Geometry Grounded Transformer (VGGT) has shown significant progress in 3D vision tasks. However, its global attention layers incur quadratic computational cost with respect to the number of input views, becoming a critical bottleneck for scalability. Several sparsification-based acceleration techniques have been proposed to alleviate this issue, but they often suffer from substantial accuracy degradation. We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits head-wise sparsification sensitivity. In the first stage, we measure head-wise sparsification sensitivity using a novel metric, the Head Sensitivity Score (HeSS), which approxi

core_reconstruction

high

HTTM: Head-wise Temporal Token Merging for Faster VGGT

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT.Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the un

core_reconstruction

high

LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Recent feed-forward reconstruction models like VGGT and $\pi^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system byaligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($Sim(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignmen

core_reconstruction

high

Learning 3D Reconstruction with Priors in Test Time

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We introduce a test-time framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks, without retraining or modifying the pre-trained image-only networks. Rather than feeding priors into the architecture, we cast them as constraints on the predictions and optimize the network at inference. The optimization loss is composed of a self-supervised objective and prior penalty terms. The self-supervised objective is defined as the compatibility among multi-view predictions, implemented by the photometric or geometric loss between the renderings from other views and each view itself. Any available priors are turned into the penalty terms on the corresponding output modalities. Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation, our method cons

core_reconstruction

high

Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos

Autonomous Driving / Autonomous Driving

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; pose_calibration_localization; dynamic_4d; robotics_mapping; data_benchmark

VGGT/feed-forward geometry lineage with direct geometry signal

abstract

Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals

core_reconstruction

high

LongStream: Long-Sequence Streaming Autoregressive Visual Geometry

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Long-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences. They typically anchor poses to the first frame, which leads to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we solve Transformer cache issues such as attention-sink reliance and long-term KV-cache contamination. We propose cache-consistent training combined with

core_reconstruction

high

MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Recent advancements in neural visual geometry, including transformer-based models such as VGGT and Pi3, have achieved impressive accuracy on 3D reconstruction tasks. However, their reliance on full attention makes them fundamentally limited by GPU memory capacity, preventing them from scaling to large, unordered image collections. We introduce MERG3R, a training-free divide-and-conquer framework that enables geometric foundation models to operate far beyond their native memory limits. MERG3R first reorders and partitions unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently. It then merges the resulting local reconstructions through an efficient global alignment and confidence-weighted bundle adjustment procedure, producing a globally consistent 3D model. Our framework is model-agnostic and can be paired with existing neural geometry mod

core_reconstruction

high

MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Recent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks.In 3D visual geometry reconstruction, large-scale training has likewise proven effective for learning versatile representations.However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability.Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation.In addition, it integrates dense sem

core_reconstruction

high

MotionCrafter: Repurposing Video Generators for Dense Geometry and Motion Reconstruction

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy; generation_editing

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We introduce MotionCrafter, the first video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. To represent them effectively in latent space, we propose a 4D VAE that encodes point maps and scene flows as a unified latent compatible with pretrained video generators. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents—despite their fundamentally different distributions—we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in joint 4D geometry reconstruction and dense scene flow estimation, d

core_reconstruction

high

OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed,

core_reconstruction

high

Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips alongside object-level semantics; and 2D–3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation — marking a step forward toward real-time, semantics-aware Spatial AI.

core_reconstruction

high

Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; surface_occupancy; generation_editing; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Current compositional image-to-3D scene generation approaches construct 3D scenes by time-consuming iterative layout optimization or inflexible joint object-layout generation. Moreover, most methods rely on limited field-of-view perspective images, hindering the creation of complete $360^\circ$ environments. To address these limitations, we design $\textbf{Pano3DComposer}$, an efficient feed-forward framework for panoramic images. To decouple object generation from layout estimation, we propose a plug-and-play Object-World Transformation Predictor. This module converts the 3D objects generated by off-the-shelf image-to-3D models from local to world coordinates. To achieve this, we adapt the VGGT architecture to $\textbf{Alignment-VGGT}$ by using target object crop, multi-view object renderings and camera parameters to predict the transformation. The predictor is trained using pseudo-geom

core_reconstruction

high

PhysIR-Splat: Physically Consistent Thermal Infrared Radiative Transfer in 3D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; gaussian_radiance; pose_calibration_localization

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Thermal infrared (TIR) 3D reconstruction provides geometry that is intrinsically coupled to the temperature field, even in low-light, nighttime, and smoke-obscured environments. TIR imaging measures self-emitted thermal radiation driven by object temperature and is largely independent of external illumination; therefore, simply carrying over visible-spectrum assumptions to TIR-based 3D reconstruction and novel view synthesis (NVS) often results in floating artifacts and blurred edges. In addition, radiometric inconsistency and low contrast in TIR weaken structure-from-motion (SfM) initialization, which in turn hinders subsequent 3D Gaussian Splatting (3DGS) optimization. We present PhysIR-Splat, a 3DGS framework that follows infrared radiative transfer: we explicitly model temperature, emissivity, and environmental irradiance on Gaussian primitives and, during rendering, jointly account

core_reconstruction

high

Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Understanding how the 3D world evolves over time is a fundamental task in computer vision, essential for embodied settings, autonomous driving, etc. It requires not only the reconstruction of the observed scene but also the anticipation of how the scene dynamics will unfold in the future. While the area of 3D reconstruction has progressed rapidly with the advent of recent feed-forward neural networks, forecasting future dynamics in 3D, given the 2D frames of a video remains unexplored. We present Point4Cast, a unified framework that processes streaming 2D frame sequences of a video to estimate the past, present, and future of the underlying dynamic scene, in 3D. At the core of our approach lies a persistently evolving latent \emph{spacetime representation} that models the environment’s evolution across time. Upon receiving a new 2D frame, an update operation integrates the incoming evide

core_reconstruction

high

Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; surface_occupancy; generation_editing; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain—from active sensors such as LiDAR or from feed-forward predictors like VGGT—offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation. A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a st

core_reconstruction

high

QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Estimating 3D attributes directly from images has advanced rapidly with the Visual Geometry Grounded Transformer (VGGT), which predicts camera parameters, depth maps, and point clouds in a single forward pass. However, its 1.2B-parameter scale severely limits deployment on resource-constrained platforms such as UAVs and mobile AR devices. To address this limitation, we introduce QVGGT, a tailored quantization framework designed to compress VGGT. Our approach starts from the observation that transformer blocks within VGGT exhibit heterogeneous sensitivity to quantization. We thus analyze per-block quantization sensitivity and propose a selective mixed-precision strategy that allocates higher precision to the most fragile transformer blocks. To address the amplification of quantization error caused by high-variance camera and register tokens, we further introduce token filtering with camer

core_reconstruction

high

Reliev3R: Relieving Feed-forward 3D Reconstruction from Multi-View Geometric Annotations

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

With recent advances, Feed-forward Reconstruction Models (FFRMs) have demonstrated great potential in reconstruction quality and adaptiveness to multiple downstream tasks. However, the excessive reliance on multi-view geometric annotations, e.g. 3D point maps and camera poses, makes the fully-supervised training scheme of FFRMs difficult to scale up.In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. Relieving the reliance on geometric sensory data and compute-exhaustive structure-from-motion preprocessing, our method draws 3D knowledge directly from monocular relative depths and image sparse correspondences given by zero-shot predictions of pretrained models.At the core of Reliev3R, we design an ambiguity-aware relative depth loss and a trigonometry-based reprojection loss to facilita

core_reconstruction

high

Scal3R: Scalable Test-Time Training for Feed-forward Large-Scale 3D Reconstruction

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lig

core_reconstruction

high

Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapt

core_reconstruction

high

SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model

Autonomous Driving / Autonomous Driving

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

VGGT/feed-forward geometry lineage with direct geometry signal

abstract

This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird’s eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state

core_reconstruction

medium

STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction

Learning Algorithms / Efficient Models

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Online 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. While causal VGGT transformers address this challenge through key-value (KV) cache mechanism, the linear growth of the cache introduces a significant memory bottleneck. When memory constraints trigger early eviction, reconstruction quality and temporal consistency deteriorate markedly. In this work, we observe that attention patterns in causal transformers for 3D reconstruction exhibit intrinsic spatio-temporal sparsity. Leveraging this insight, we propose **STAC**, a *S*patio-**T**emporally **A**ware **C**ache compression framework specifically designed for streaming 3D reconstruction using large causal transformers. STAC incorporates three key components: a **Working Temporal Token Caching** mechanism that preserves long-term informative tokens based on decayed cumula

core_reconstruction

high

TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; gaussian_radiance; dynamic_4d

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss.This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby _unbinding_ the number of predicted primitives from input image resolution and number of views. Our resulting method, __TokenGS__, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance o

core_reconstruction

high

Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

Autonomous Driving / Autonomous Driving

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping

VGGT/feed-forward geometry lineage with direct geometry signal

abstract

Recent advancements in feed-forward architectures for visual geometry estimation have achieved significant progress. Interestingly, per-frame visual geometry estimation approaches typically exhibit weaker multi-frame consistency but demonstrate superior per-frame accuracy compared to multi-frame algorithms. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals three key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furtherm

core_reconstruction

high

V-DPM: Video Reconstruction with Dynamic Point Maps

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

New, powerful 3D representations such as DUSt3R’s invariant point maps, which encode 3D shape and camera parameters, have significantly advanced feed-forward 3D reconstruction. While point maps assume static scenes, Dynamic Point Maps (DPMs) extend the concept to dynamic 3D content, also representing 3D scene motion.However, DPMs have so far been limited to image pairs and, like DUSt3R, require post-processing via optimization when more than two views are involved. We argue that DPMs are far more meaningful when applied to videos and introduce V-DPM to demonstrate this.First, we show how to set up DPMs for videos to optimize their representational power, ease of neural prediction, and reuse of pre-trained models. Second, we implement these ideas on top of VGGT, a recent state-of-the-art 3D reconstructor. Although VGGT was trained on static scenes, we show that a small amount of synthetic

core_reconstruction

high

VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training.VGG-T$^3$ ($\mathbf{V}$isual $\mathbf{G}$eometry $\mathbf{G}$rounded $\mathbf{T}$est $\mathbf{T}$ime $\mathbf{T}$raining) scales linearly w.r.t. the number of input views, similar to online models, and achieves a $11.6\times$ speed-up over baselines that rely on softmax attention for reconstructing a $1k$ image collection in just $54$ seconds. Because our method retains global scene aggregation capability, our resulting point map reconstruction err

core_reconstruction

high

VGGT-$\Omega$

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present VGGT-Ω, a feed-forward model for 3D reconstruction that substantially advances the state of the art in accuracy, efficiency, and capability for both static and dynamic scenes. Prior models such as VGGT have shown that feed-forward 3D reconstruction can already be competitive with traditional optimization-based methods. Here, we further demonstrate that the accuracy and robustness of these models scale predictably with model capacity and data size. To enable training 3D reconstruction models at an unprecedented scale, we introduce a high-quality data annotation pipeline that handles dynamic scenes, a self-supervised learning protocol, and architectural changes that greatly reduce memory requirements. We significantly simplify VGGT’s architecture by replacing multiple dense prediction heads with loss-driven multitask learning, removing unstable DPT blocks, and introducing more e

core_reconstruction

high

VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that together form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT’s perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (i

core_reconstruction

high

VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-view Indoor 3D Object Detection

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain—i.e., precisely calibrated multi-view camera poses—to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key c

core_reconstruction

high

VGGTracker: Fast Spatial Tracking with Visual Geometry Transformer

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Existing 3D point tracking methods mostly rely on heuristic designs or scene reconstruction, which incurs significant computational overhead and makes it difficult to meet the demands of real-time applications.To address this problem, in this work, we present VGGTracker, a novel spatial tracker that leverages a feed-forward visual geometry transformer to predict the trajectories of arbitrary query points from monocular videos in real time.Specifically, we employ a query initialization mechanism to maintain and update a global feature vector and a set of frame-level feature vectors for each query point.Then, we propose a new spatial tracking framework, which consists of a visual geometry transformer backbone, a global embedding branch, a frame-level embedding branch, and a tracking head.The key innovation lies in the dual-branch embedding design, where the global embedding branch integrat

core_reconstruction

high

WildPose: A Unified Framework for Robust Pose Estimation in the Wild

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Estimating camera pose in dynamic environments is a critical challenge, as most visual SLAM and SfM methods assume inputs from static environments. While recent dynamic-aware methods exist, they are often not unified: semantic-based approaches are brittle, per-sequence optimization methods fail on short sequences, and other learned models sometimes perform badly on static-only scenes. We present Wildpose, a unified monocular pose estimation framework that is robust in dynamic environments while maintaining state-of-the-art performance on static and low-ego-motion datasets. Our key insight is to connect the two powerful paradigms in modern 3D vision: the rich perceptual frontend of feed-forward models and the end-to-end optimization of differentiable bundle adjustment (BA). We achieve this by enhancing the differentiable BA pipeline in two ways. First, we introduce a new 3D-aware update o

core_reconstruction

high

DVGT: Visual Geometry Transformer for Autonomous Driving

Autonomous Driving / Autonomous Driving

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; dynamic_4d; robotics_mapping

VGGT/feed-forward geometry lineage with direct geometry signal

abstract

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, it still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Visual Geometry Transformer specifically designed for autonomous Driving (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. Finally, we use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego pose for each frame. Our DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-

core_reconstruction

high

OccAny: Generalized Unconstrained Urban 3D Occupancy

Autonomous Driving / Autonomous Driving

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; gaussian_radiance; surface_occupancy; robotics_mapping

VGGT/feed-forward geometry lineage with direct geometry signal

abstract

Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization.While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios.We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features.OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images.Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework

core_reconstruction

high

VGA:Empowering Aerial-Ground Localization by Visual Geometry Alignment

Remote Sensing & Earth / Remote Sensing

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; pose_calibration_localization; robotics_mapping

VGGT/feed-forward geometry lineage with direct geometry signal

abstract

Aerial-ground visual localization is a challenging task due to the significant differences in scene scale and view point captured between two views. In this work, we explore the practical benefit of jointly learning camera calibration and bird’s-eye-view (BEV) projection for estimating full 6 Degrees-of-freedom relative camera pose between uncalibrated aerial and ground views. We present Visual Geometry Alignment (VGA), a unified framework that jointly learns a global gravity-alignment prior inferred from dense monocular perspective fields, and a planar alignment prior complementing the unobserved azimuth angle through Procrustes alignment in a shared BEV plane. At inference, we jointly refine the relative camera pose by integrating the predicted per-camera gravity alignment and relative planar azimuth angle, yielding improved orientation and translation alignment from visual input with

core_reconstruction

high

AREA3D: Active Reconstruction Agent with Unified Feed-Forward 3D Perception and Vision-Language Guidance

Multimodal & Language / Agentic AI

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction

VGGT/feed-forward geometry lineage with direct geometry signal

abstract

Active 3D reconstruction enables an agent to autonomously select viewpoints to build accurate and complete scene geometry efficiently, rather than passively reconstructing scenes from pre-collected images. Existing active reconstruction methods often rely on geometric heuristics, which may result in redundant observations without improving reconstruction quality. To address this, we propose \textbf{AREA3D}, an active reconstruction agent for 3D reconstruction by leveraging feed-forward 3D models and vision-language guidance. The framework decouples view uncertainty modeling from feed-forward reconstruction, enabling precise uncertainty estimation without online optimization. Moreover, the integrated Vision-Language Model provides high-level semantic guidance that guides exploration beyond purely geometric cues. Extensive experiments on both scene-level and object-level benchmarks (Replic

core_reconstruction

high

4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps.Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity.The resulting system e

core_reconstruction

high

ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Existing hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object's metric scale and pose for grounding its normalized mesh i

core_reconstruction

high

CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Accurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and

core_reconstruction

high

Catch Me if You Can: Active Mapping of Moving 3D Objects

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Current 3D mapping pipelines generally assume static environments, which limits their ability to accurately capture and reconstruct moving objects. To address this limitation, we introduce the novel task of active mapping of moving objects, in which a mapping agent must plan its trajectory while compensating for the object's motion. Our approach, Paparazzo, provides a learning-free solution that robustly predicts the target's trajectory and identifies the most informative viewpoints from which to observe it, to plan its own path. We also contribute a comprehensive benchmark designed for this new task. Through extensive experiments, we show that Paparazzo significantly improves 3D reconstruction completeness and accuracy compared to several strong baselines, marking an important step toward dynamic scene understanding.

core_reconstruction

high

Complet4R: Geometric Complete 4D Reconstruction

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We introduce Complet4R, a novel end-to-end framework for Geometric Complete 4D Reconstruction, which aims to recover temporally coherent and geometrically complete reconstruction for dynamic scenes. Our method formalizes the task of Geometric Complete 4D Reconstruction as a unified framework of reconstruction and completion, by directly accumulating full contexts onto each frame. Unlike previous approaches that rely on pairwise reconstruction or local motion estimation, Complet4R utilizes a decoder-only transformer to operate all context globally directly from sequential video input, reconstructing a complete geometry for every single time step, including occluded regions visible in other frames. Our method demonstrates the state-of-the-art performance on our proposed benchmark for Geometric Complete 4D Reconstruction and the 3D point tracking task. Code will be released to support futur

core_reconstruction

high

Efficiently Reconstructing Dynamic Scenes one D4RT at a Time

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Understanding and reconstructing the complex geometry and motion of dynamic 4D scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward network designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our unified decoding interface allows the model to independently and efficiently probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state-of-the-art, outper

core_reconstruction

high

EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Strand-level hair geometry reconstruction is a fundamental problem in virtual human modeling and the digitization of hairstyles. However, existing methods still suffer from a significant trade-off between accuracy and efficiency. Implicit neural representations can capture the global hair shape but often fail to preserve fine-grained strand details, while explicit optimization-based approaches achieve high-fidelity reconstructions at the cost of heavy computation and poor scalability.To address this issue, we propose EfficientMonoHair, a fast and accurate framework that combines the implicit neural network with multi-view geometric fusion for strand-level reconstruction from monocular video. Our method introduces a fusion-patch-based multi-view optimization that reduces the number of optimization iterations for point cloud direction, as well as a novel parallel hair-growing strategy that

core_reconstruction

high

FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; generation_editing

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Single-image 3D human reconstruction holds significant promise due to its convenience and high demand in various applications. Previous methods have garnered tremendous progress by employing 2D multi-view diffusion models to generate auxiliary views as reconstruction priors, but they struggle with 3D inconsistencies and limited generalization capabilities. In this paper, we present FISHuman, which aims to generate fine-grained, high-fidelity, and content-wise diverse 3D humans from a single-view input, providing production-ready 3D assets. We propose an elaborately designed workflow that reconstructs dynamic 3D meshes from multi-view inconsistent guidance. Specifically, we adapt a dual-stream transformer-based video diffusion model to generate cross-modally aligned multi-view RGB and normal sequences. We find that naively employing static 3D reconstruction can lead to geometric distortio

core_reconstruction

high

Inferring Compositional 4D Scenes without Ever Seeing One

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard.Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, a

core_reconstruction

high

MoRe: Motion-aware Feed-forward 4D Reconstruction Transrformer

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; pose_calibration_localization; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruc

core_reconstruction

high

Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

emporally consistent surface reconstruction of dynamic 3D objects from unstructured point cloud data remains challenging, especially for very long sequences. Existing methods either optimize deformations incrementally, risking drift and requiring long runtimes, or rely on complex learned models that demand category-specific training. We present Neu-PiG, a fast optimization method based on a novel preconditioned surface encoding that estimates coherent non-rigid deformations without sacrificing temporal stability or accuracy. Our method encodes entire deformations across all time steps at various spatial scales into a multi-resolution latent grid, parameterized by the position and normal direction of a reference surface from a single keyframe. This latent representation is then augmented for time modulation and decoded into per-frame 6-DoF deformations via a lightweight multi-layer percep

core_reconstruction

high

PatchScene: Patch-based Voxel Diffusion Model for Large-Scale Scene Completion

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We propose PatchScene, a novel diffusion-based framework for large-scale LiDAR scene completion. Unlike existing methods that rely on global latent representations or dense voxel grids, PatchScene adopts a patch-based voxel diffusion paradigm that explicitly generates fine-grained geometry within localized 3D regions. To ensure coherent reconstruction at both spatial and temporal scales, we introduce a confidence-guided spatio-temporal fusion mechanism that integrates overlapping patches and adjacent frames in a unified generative process. Furthermore, we design an Annular-Flow diffusion strategy that leverages the radial density pattern of LiDAR scans to progressively propagate high-fidelity information from near-range to far-range regions, enabling spatially unbounded scene completion. Extensive experiments on the SemanticKITTI benchmark demonstrate that PatchScene achieves state-of-th

core_reconstruction

high

ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel self-correction manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre-computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation.To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation-Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision.The core of ReFlow is a novel self-correction flow matching mechanism, c

core_reconstruction

high

ReGenHOI: Unifying Reconstruction and Generation for 3D Human–Object Interaction Understanding

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; generation_editing

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Understanding 3D human–object interaction (HOI) involves two highly-related abilities: reconstruction, which perceives observed geometry, and generation, which imagines plausible future interactions. However, most existing methods treat these abilities as separate tasks, limiting their capacity to capture the unified nature of human spatial reasoning. To address this, we propose a unified framework that bridges reconstruction and generation through a shared semantic–geometric reasoning space. Specifically, a 3D Contact Reasoning mechanism enables direct reasoning in 3D space, jointly modeling geometric structure and semantic relationships, while a Reasoning Trace Refinement module iteratively refines contact predictions by integrating geometric and semantic cues. The framework builds a unified latent representation via explicit reasoning on human–object contact regions. To further enhanc

core_reconstruction

medium

Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling

3D Vision & Geometry / Pose Estimation

A. thesis anchor: dynamic/4D recon

general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Visual SLAM is one of the most fundamental problems in computer vision, with direct applications to real-time localization tasks such as AR/VR, robotics, and 3D scene reconstruction. Although significant progress has been made in both sparse and dense approaches, real-time monocular SLAM remains challenging—particularly in the uncalibrated setting, where existing methods are often inefficient and lack modularity. In this paper, we present a new visual SLAM pipeline implemented from scratch in C++ that explicitly leverages the spatio-temporal structure of the scene for improved localization, and is designed to be modular so that off-the-shelf components can be easily integrated. We introduce a temporal representation based on a buffer of recent keyframes that preserves short-term scene continuity. To complement this, we incorporate a spatial representation based on a 3D cell-based scene m

core_reconstruction

high

Vista4D: Video Reshooting with 4D Point Clouds

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present **Vista4D**, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, our method re-synthesizes the scene with the same dynamics from a different camera trajectory and viewpoint. Existing video reshooting methods often struggle with depth estimation artifacts of real-world dynamic videos, while also failing to preserve content appearance and maintain precise camera control for challenging new trajectories. We build a 4D-grounded point cloud representation with static pixel segmentation and 4D reconstruction to explicitly preserve seen content and provide rich camera signals, and we train with reconstructed multiview dynamic data for robustness against point cloud artifacts during real-world inference. Our results demonstrate improved 4D consistency, camera control, and visual quali

core_reconstruction

high

WorldStereo: Bridging Controllable Video Generation and Scene Reconstruction via 3D Geometric Memories

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Recent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories.In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds.Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank.These components enable WorldStereo to generate multi-view-co

core_reconstruction

medium

TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos

3D Vision & Geometry / Pose Estimation

A. thesis anchor: dynamic/4D recon

general_reconstruction; pose_calibration_localization; dynamic_4d

direct reconstruction/3DGS/4D title linked to core representation cluster

abstract

Reconstructing humans and their surrounding environments in a globally consistent 4D space is essential for comprehensive perception. However, prior works typically assume single-view inputs or decouple humans, scenes, and cameras, making them unable to recover coherent geometry, stable motion, and physically aligned trajectories. These limitations motivate us to introduce a new task: unified human–scene–camera reconstruction from multi-view videos, which aims to jointly estimate dynamic humans, static scenes, and camera poses in one global coordinate frame. We propose TROPHIES--Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos--a unified framework tailored for this task. TROPHIES features a Human Branch that models human through temporal and spatial reasoning, and a Scene Branch that reconstructs static geometry with human-aware attention. A global alignment

core_reconstruction

high

Any4D: Unified Feed-Forward Metric 4D Reconstruction

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordi

core_reconstruction

high

$\textit{4DSurf}$: High-Fidelity Dynamic Scene Surface Reconstruction

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

This paper addresses the problem of dynamic scene surface reconstruction using Gaussian Splatting (GS), aiming to recover temporally consistent geometry. While existing GS-based dynamic surface reconstruction methods can yield superior reconstruction, they are typically limited to either a single object or objects with only small deformations, struggling to maintain temporally consistent surface reconstruction of large deformations over time. We propose ``4DSurf'', a novel and unified framework for generic dynamic surface reconstruction that does not require specifying the number or types of objects in the scene, can handle large surface deformations and temporal inconsistency in reconstruction. The key innovation of our framework is the introduction of Gaussian deformations induced Signed Distance Function Flow Regularization that constrains the motion of Gaussians to align with the evo

core_reconstruction

high

$L^{2}DGS$: Low-Light Dynamic Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Synthesizing novel spatiotemporal views of dynamic scenes is inherently challenging due to both object and camera motion, as well as sparsity of observations. Recent advances in Neural Radiance Fields (NeRFs) and Gaussian Splatting (GS) have enabled 4D dynamic scene reconstruction, but predominantly from well-lit images or videos. Some works address the problem of reconstructing a well-lit scene from low-light input, but these are limited to static scenes. Moreover, prior methods primarily emphasize improving illumination, while overlooking the underlying scene characteristics. Reconstructing well-lit dynamic scenes from inputs captured under low-light conditions is particularly challenging due to shadows, occlusions, and disocclusions caused by object motion, which makes the problem highly ambiguous and ill-posed. We propose $L^{2}DGS$ (Low-Light Dynamic Gaussian Splatting), a self-supe

core_reconstruction

high

3D Gaussian Splatting from unposed Spike Stream

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) has significantly advanced 3D reconstruction with its impressive performance. However, its reliance on sharp images and precise camera pose priors limits its effectiveness in high-speed scenarios. Recent advances have integrated spike camera, a bio-inspired sensor with a high temporal resolution, to enhance 3DGS in such conditions. Although spike-based methods reduce the need for sharp images, they still face challenges in achieving precise camera pose estimation due to unstable observations and visual texture deficiency.To address these challenges, we propose Nope-SGS, the first framework that reconstructs high-speed 3D scenes from **unposed captures** of the bio-inspired high-temporal-resolution spike camera. To achieve robust 3D reconstruction and pose estimation, we first reformulate the spike model from a probabilistic perspective and extend its applicat

core_reconstruction

high

3D Gaussian Splatting with Self-Constrained Prior for High Fidelity Surface Reconstruction

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Rendering 3D surfaces has been revolutionized within the modeling of radiance fields through either 3DGS or NeRF. Although 3DGS has shown advantages over NeRF in terms of rendering quality or speed, there is still room for improvement in recovering high fidelity surfaces through 3DGS. To resolve this issue, we propose a self-constrained prior to constraining the movement of 3D Gaussians, aiming for more accurate depth rendering. Our self-constrained prior is a TSDF grid fused by the rendered depth during the learning of 3D Gaussians. The prior measures a band on both sides of the estimated surface for imposing more specific constraints on the right 3D Gaussians, such as removing 3D Gaussians outside the band, encouraging larger opacity for Gaussians near the center of the band or smaller opacity for Gaussians near the boundary of the band. We regularly update the prior by fusing more rec

core_reconstruction

high

4C4D: 4 Camera 4D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

This paper tackles the challenge of recovering 4D dynamic scenes from videos captured by as few as four portable cameras. Learning to model scene dynamics for temporally consistent novel-view rendering is a foundational task in computer graphics, where previous works often require dense multi-view captures using camera arrays of dozens or even hundreds of views. We propose 4C4D, a novel framework that enables high-fidelity 4D Gaussian Splatting from video captures of extremely sparse cameras. Our key insight lies that the geometric learning under sparse settings is substantially more difficult than modeling appearance. Driven by this observation, we introduce a Neural Decaying Function on Gaussian opacities for enhancing the geometric modeling capability of 4D Gaussians. This design mitigates the inherent imbalance between geometry and appearance modeling in 4DGS by encouraging the 4DGS

core_reconstruction

high

4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

4D reconstruction of equine family (e.g. horses) from monocular video is important for animal welfare. Previous mainstream 4D animal reconstruction methods require joint optimization of motion and appearance over a whole video, which is time-consuming and sensitive to incomplete observation. In this work, we propose a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction. For motion, we introduce a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video. For appearance, we design a novel feed-forward network that reconstructs a high-fidelity, animatable 3D Gaussian avatar from as few as a single image. To assist training, we create a large-scale synthetic motion dataset, VarenPose

core_reconstruction

high

ActivePolicy: Active Gaussian Reconstruction and Optimization Strategy Based on Global-Local Information Gain

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Active 3D Gaussian reconstruction achieves superior completeness and rendering quality by intelligently selecting viewpoints. However, existing methods suffer from two critical limitations: information gain metrics that prioritize geometric coverage while ignoring rendering quality, and overfitting to sparse view configurations that degrades novel view synthesis. We introduce ActivePolicy, a novel framework addressing both challenges through principled NBV selection and regularization. We propose \textbf{GL-Graph}, a graph-theoretic strategy that unifies geometric consistency, rendering quality, and observation redundancy into a single stability criterion. To counteract overfitting, we introduce \textbf{4D-Reg}, which identifies floaters through manifold discrepancies among three depth types (R-Depth, $\alpha$-Depth, C-Depth) and suppresses them via adaptive dropout. Extensive experiment

core_reconstruction

high

AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Monocular UAV videos pose a fundamental challenge for 3D reconstruction: dynamic scene modeling requires accurate camera poses, yet recovering poses from long UAV trajectories often fails under texture-sparse regions and moving objects.Existing approaches typically handle either pose-free static reconstruction or dynamic reconstruction with known poses, but jointly solving both from casual aerial footage remains difficult due to motion coupling and severe scale variation.We introduce \modelname, a scale-aware Gaussian splatting framework that jointly recovers camera trajectories and reconstructs dynamic scenes from pose-free monocular videos.Central to our method are scale-aware spatio-temporal anchors (S$^2$A-Anchors), which enable a unified optimization via three key decoupling mechanisms:(i) separating ego-motion from object motion,(ii) isolating static geometry from temporal deformat

core_reconstruction

high

AnchorSplat: Feed-Forward 3D Gaussian Splatting With 3D Geometric Priors

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Scene-level 3D reconstruction has attracted increasing attention, and feed-forward 3D Gaussian Splatting (3DGS) has emerged as a promising paradigm for novel view synthesis. However, most existing methods adopt a pixel-aligned formulation that maps each 2D pixel to a 3D Gaussian, making the number of Gaussians tightly coupled with the input images. This leads to several limitations: (i) reconstruction quality is sensitive to the quantity and viewpoint coverage of input images, often causing Gaussians to accumulate more densely in regions with frequent viewpoints; (ii) alignment errors become more pronounced under sparse-view conditions; and (iii) the lack of explicit geometric consistency can degrade depth estimation and downstream 3D tasks. In this paper, we propose AnchorSplat, a novel multi-view feed-forward 3DGS framework for scene-level reconstruction that departs from pixel-aligned

core_reconstruction

high

ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Reconstructing articulated objects into high-fidelity digital twins is crucial for applications such as robotic manipulation and interactive simulation. Recent self-supervised methods using differentiable rendering frameworks like 3D Gaussian Splatting remain highly sensitive to the initial part segmentation. Their reliance on heuristic clustering or pre-trained models often causes optimization to converge to local minima, especially for complex multi-part objects.To address these limitations, we propose ArtPro, a novel self-supervised framework that introduces adaptive integration of mobility proposals. Our approach begins with an over-segmentation initialization guided by geometry features and motion priors, generating part proposals with plausible motion hypotheses. During optimization, we dynamically merge these proposals by analyzing motion consistency among spatial neighbors, while

core_reconstruction

high

BA-GS: Bayesian Adaptive Gaussian Splatting for SFM-Free 3D Reconstruction

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) has demonstrated exceptional performance in reconstruction and novel view synthesis tasks. However, its reliance on Structure-from-Motion preprocessing may lead to degraded performance under sparse-view scenarios. Recent works attempt to address this limitation by leveraging pre-trained image matching models to generate Gaussian primitives but overlook the probabilistic uncertainty embedded in both the initial primitive distribution and iterative position updates. This uncertainty can accumulate and degrade reconstruction fidelity. Hence, we propose BA-GS, a Bayesian framework that models both the global distribution and local uncertainty of Gaussian primitives. At global initialization, a Variational Bayesian Gaussian Mixture Model (VB-GMM) models the latent distribution of primitives, capturing region-wise density and gradient patterns. At local refinement,

core_reconstruction

high

BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

The boundary representation (B-rep) models a 3D solid as its explicit boundaries: trimmed corners, edges, and faces. Recovering B-rep representation from unstructured data is a challenging and valuable task of computer vision and graphics. Recent advances in deep learning have greatly improved the recovery of 3D shape geometry, but still depend on dense and clean point clouds and struggle to generalize to novel shapes. We propose B-rep Gaussian Splatting (BrepGaussian), a novel framework that learns 3D parametric representations from 2D images. We employ a Gaussian Splatting renderer with learnable features, followed by a specific fitting strategy. To disentangle geometry reconstruction and feature learning, we introduce a two-stage learning framework that first captures geometry and edges and then refines patch features to achieve clean geometry and coherent instance representations. Ex

core_reconstruction

high

CGHair: Compact Gaussian Hair Reconstruction with Card Clustering

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We present a compact pipeline for high-fidelity hair reconstruction from multi-view images. While recent 3D Gaussian Splatting (3DGS) methods achieve realistic results, they often require millions of primitives, leading to high storage and rendering costs. Observing that hair exhibits structural and visual similarities across a hairstyle, we cluster strands into representative hair cards and group these into shared texture codebooks. Our approach integrates this structure with 3DGS rendering, significantly reducing reconstruction time and storage while maintaining comparable visual quality. In addition, we propose a generative prior accelerated method to reconstruct the initial strand geometry from a set of images. Our experiments demonstrate a 4-fold reduction in strand reconstruction time and achieve comparable rendering performance with over 200× lower memory footprint.

core_reconstruction

high

ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Dynamic 3D scene reconstruction is essential for immersive media such as VR, MR, and XR, yet remains challenging for long multi-view sequences with large-scale motion. Existing dynamic Gaussian approaches are either Frame-Stream, offering scalability but poor temporal stability, or Clip, achieving local consistency at the cost of high memory and limited sequence length.We propose ClipGStream, a hybrid reconstruction framework that performs stream optimization at the clip level rather than the frame level. The sequence is divided into short clips, where dynamic motion is modeled using clip-independent spatio-temporal fields and residual anchor compensation to capture local variations efficiently, while inter-clip inherited anchors and decoders maintain structural consistency across clips. This Clip-Stream design enables scalable, flicker-free reconstruction of long dynamic videos with hig

core_reconstruction

high

Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We present Gaussian Splatting Alignment (GSA), a novel method for aligning two independent 3D Gaussian Splatting (3DGS) models via a similarity transformation (rotation; translation; scale), even when they are of different objects in the same category (e.g, different cars). In contrast, existing methods can only align 3DGS models of the same object (e.g, the same car) and often must be given true scale as input, while we estimate it successfully. Our approach leverages viewpoint-guided spherical map features to obtain robust correspondences and introduces a two-step optimization framework that aligns models while keeping the 3DGS models fixed. First, we perform an iterative, feature-guided coarse registration that is robust to extremely poor initialization (e.g, 180° misalignment or a 10× scale gap), followed by a fine registration step enforcing multi-view feature consistency, inspired

core_reconstruction

high

CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Single-view 3D human reconstruction has garnered significant attention in recent years. Despite numerous advancements, prior research has concentrated on reconstructing 3D models from clear, close-up images of individual subjects, often yielding subpar results in the more prevalent multi-person scenarios. Reconstructing 3D human crowd models is a highly intricate task, laden with challenges such as: 1) extensive occlusions, 2) low clarity, and 3) numerous and various appearances. To address this task, we propose CrowdGaussian, a unified framework that directly reconstructs multi-person 3D Gaussian Splatting (3DGS) representations from single-image inputs. To handle occlusions, we devise a self-supervised adaptation pipeline that enables the pretrained large human model to reconstruct complete 3D humans with plausible geometry and appearance from heavily occluded inputs.Furthermore, we in

core_reconstruction

high

Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Novel view synthesis has been significantly advanced by NeRFs and 3D Gaussian Splatting (3DGS), which require ordering volumetric samples or primitives for correct color blending. While the recent Gaussian-Enhanced Surfels (GES) enable high-performance, sort-free rendering, they suffer from aliasing artifacts and suboptimal reconstruction. To address these limitations, we propose DP-GES, a novel representation that augments opaque surfels with semi-transparent boundaries and leverages Depth Peeling to establish accurate per-pixel ordering. This design enables sort-free Gaussian splatting with correct transmittance modulation, effectively eliminating aliasing and popping artifacts while facilitating a fully differentiable joint optimization. Extensive experiments demonstrate that our method achieves superior reconstruction quality and compares favorably against state-of-the-art techniques

core_reconstruction

high

DiffSoup: Direct Differentiable Rasterization of Triangle Soup for Extreme Radiance Field Simplification

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Radiance field reconstruction aims to recover high-quality 3D representations from multi-view RGB images. Recent advances, such as 3D Gaussian splatting, have achieved real-time rendering with high visual fidelity, given sufficiently powerful graphics hardware. However, drastic model simplification — i.e., reducing the number of primitives by several orders of magnitude — is required to enable efficient online transmission and rendering across diverse hardware platforms. We introduce DiffSoup, a radiance field representation that employs a soup (i.e., a highly unstructured primitives) of a small number of triangles with neural textures that have binary opacity. We show that the binary opacity representation is directly differentiable via stochastic opacity masking, enabling stable training without molifier (i.e., smooth rasterization). DiffSoup can be rasterized with a traditional depth-

core_reconstruction

high

Disco-GS: Gaussian Splatting in Dynamic Color Lighting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Recent advances in Gaussian Splatting (GS) have significantly improved 3D scene reconstruction and novel view synthesis. However, most existing methods typically assume that training inputs are captured under stable lighting conditions and achromatic light. In contrast, scenes recorded under temporally varying color light, as in “disco lights” commonly seen in events, performances, and decorative settings, introduce severe ambiguities in both scene photometry and geometry. We propose Disco-GS, a framework that leverages GS for reconstructing the 3D scene while simultaneously recovering the underlying canonical appearance from videos captured under dynamic lighting conditions. Disco-GS estimates the effective per-pixel transient light, which, when applied to the canonical image, results in the observed color image of the scene, thereby enabling self-supervised learning. Disco-GS is an end

core_reconstruction

high

Distilling Unsigned Distance Function for Surface Reconstruction from 3D Gaussian Splatting

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Unsigned distance fields (UDFs) are well suited for representing open surfaces, but learning them from multi-view images is challenging because ground-truth surfaces are unavailable for supervision in most cases and the gradient of a UDF is undefined on the underlying surface. Prior methods optimize UDFs with global objectives and apply gradient-based priors ignoring the non-differentiability for queries on the target surface, which leads to unstable training and over-smoothing on fine details. We address these issues by distilling a patch-based UDF prior, trained on synthetic ground truth algebraic surfaces with closed form expressions, into a lightweight student UDF inside Gaussian optimization process. We design band-limited knowledge distillation strategy that leverages a pretrained patch-based UDF predictor to provide reliable near-surface UDF supervision, enabling stable student t

core_reconstruction

high

Dropping Anchor and Spherical Harmonics for Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Recent 3D Gaussian Splatting (3DGS) dropout methods address overfitting under sparse-view conditions by randomly nullifying Gaussian opacities. However, we identify a neighbor compensation effect in these approaches: dropped Gaussians are often compensated by their neighbors, weakening the intended regularization. Moreover, these methods overlook the contribution of high-degree spherical harmonic coefficients (SH) to overfitting. To address these issues, we propose DropAnSH-GS, a novel anchor-based dropout strategy. Rather than dropping Gaussians independently, our method randomly selects certain Gaussians as anchors and simultaneously removes their spatial neighbors. This effectively disrupts local redundancies and encourages the model to learn more robust, globally informed representations. Furthermore, we extend the dropout to color attributes by randomly dropping higher-degree SH coe

core_reconstruction

high

DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present Compact 3D Reconstruction with Positive and Negative Primitives (DualPrim), a novel approach for reconstructing compact and topologically regular 3D meshes from multi-view images. Unlike traditional methods that rely on implicit representations such as signed distance functions, or explicit formats such as meshes and point clouds, our method models geometry using quadrics-based 3D primitives. Each primitive is defined by a positive-density superquadric that contributes to the shape, and a negative-density superquadric that carves out local volumes, enabling fine-grained geometric control and flexible topology. This dual-primitive representation yields compact, well-regularized, and efficiently parameterized mesh reconstructions. To infer primitive parameters from multi-view images, we design a differentiable rendering pipeline that jointly estimates positive and negative super

core_reconstruction

high

E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

The emergence of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS) has advanced novel view synthesis (NVS). These methods, however, require high-quality RGB inputs and accurate corresponding poses, limiting robustness under real-world conditions such as fast camera motion or adverse lighting. Event cameras, which capture brightness changes at each pixel with high temporal resolution and wide dynamic range, enable precise sensing of dynamic scenes and offer a promising solution. However, existing event-based NVS methods still rely on known poses or depend on depth estimation models and auxiliary modalities such as RGB-D. We present E2EGS, a pose-free framework operating solely on event streams. Our key insight is that edge information provides rich structural cues essential for accurate trajectory estimation and high-quality NVS. To extract edges from noisy event streams, we

core_reconstruction

high

EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Understanding a 3D scene immediately with its exploration is essential for embodied tasks, where an agent must construct and comprehend the 3D representation in an online and nearly real-time manner. In this study, we propose **EmbodiedSplat**, an online feed-forward 3DGS for open-vocabulary scene understanding that enables simultaneous online 3D reconstruction and 3D semantic understanding from the streaming images. Unlike existing open-vocabulary 3DGS methods, our objectives are two-fold: 1) Reconstructs the semantic-embedded 3DGS of the entire scene from over 300 streaming images in an online manner. 2) Highly generalizable to novel scenes with feed-forward design and supports nearly real-time 3D semantic reconstruction when combined with real-time 2D models. To achieve these objectives, we propose an Online Sparse Coefficients Field with a CLIP Global Codebook where it binds the 2D C

core_reconstruction

medium

Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow

3D Vision & Geometry / Pose Estimation

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; robotics_mapping; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

High-quality 3D scene representation in radiance fields relies on accurate camera poses which are often difficult to acquire in real-world scenarios. An effective solution is to use RGB images for the joint optimization of radiance fields and camera poses, an approach that has been well explored in NeRF series methods. However, unlike NeRF, joint optimization in 3D Gaussian Splatting (3DGS) often requires additional regularization or prior spatial knowledge to reach comparable performance. To eliminate these dependencies, we introduce Energy-GS, a pose-aware Gaussian splatting framework that jointly optimizes scene representation and camera poses using only RGB images. We observe that pose gradients in joint optimization are unstable due to the point-based rendering mechanism. Furthermore, unlike NeRF’s spatial sampling framework that enables coarse-to-fine pose alignment, rasterization-

core_reconstruction

high

FastEventDGS: Deformable Gaussian Splatting for Fast Dynamic Scenes from a Single Event Camera

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

The demand for dynamic 3D assets in AR/VR has recently popularized Deformable Gaussian Splatting. However, traditional RGB cameras are limited in their ability to reconstruct high-speed scenes due to motion blur and low temporal resolution. While event cameras offer a promising alternative, reconstructing a complete scene from their sparse and noisy output is a significant challenge. Existing event-based methods rely on an auxiliary sensor, such as a frame camera, thereby inducing tedious hardware and calibration challenges.We introduce FastEventDGS, a novel Deformable Gaussian Splatting-based framework that leverages a single event camera for high-fidelity 4D reconstruction. Our method utilizes a continuous camera trajectory parametrization and integrates two event generation models to provide both photometric and geometric constraints. We further propose a local patch event motion loss

core_reconstruction

high

FastGS: Training 3D Gaussian Splatting in 100 Seconds

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multi-view consistency, efficiently solving the trade-off between training time and rendering quality. We innovatively design a densification and pruning strategy based on multi-view consistency, dispensing with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets demonstrate that our method significantly outperforms the state-of-the-art methods in training speed, achieving a 3.29× training acceleration and comparable rendering quality compared with DashGaussian on the Mip-NeRF 360 dataset and a 15.45× acceleration

core_reconstruction

high

FHAvatar: Fast and High-Fidelity Reconstruction of Face-and-Hair Composable 3D Head Avatar from Few Casual Captures

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We present FHAvatar, a novel framework for reconstructing 3D Gaussian avatars with composable face and hair components from an arbitrary number of views. Unlike previous approaches that couple facial and hair representations within a unified modeling process, we explicitly decouples two components in texture space by representing the face with planar Gaussians and the hair with strand-based Gaussians. To overcome the limitations of existing methods that rely on dense multi-view captures or costly per-identity optimization, we propose an aggregated transformer backbone to learn geometry-aware cross-view priors and head-hair structural coherence from multi-view datasets, enabling effective and efficient feature extraction and fusion from few casual captures. Extensive quantitative and qualitative experiments demonstrate that FHAvatar achieves state-of-the-art reconstruction quality from on

core_reconstruction

high

FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We present FlexAvatar, a flexible large reconstruction model for high-fidelity 3D head avatars with detailed dynamic deformation from single or sparse images, without requiring camera poses or expression labels. It leverages a transformer-based reconstruction model with structured head query tokens as canonical anchor to aggregate flexible input-number-agnostic, camera-pose-free and expression-free inputs into a robust canonical 3D representation.For detailed dynamic deformation, we introduce a lightweight UNet decoder conditioned on UV-space position maps, which can produce detailed expression-dependent deformations in real time. To better capture rare but critical expressions like wrinkles and bared teeth, we also adopt a data distribution adjustment strategy during training to balance the distribution of these expressions in the training set.Moreover, a lightweight 10-second refinemen

core_reconstruction

high

FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Real objects inhabit a physical world and must behave plausibly during interaction with other physical objects. However, current methods that perform 3D reconstructions of real-world scenes from multi-view images optimize primarily for visual fidelity, i.e., they train with photometric losses and reason about uncertainty in the image or representation space. This appearance-centric view overlooks body contacts and couplings, conflates function-critical regions (e.g., aerodynamic or hydrodynamic surfaces) with ornamentation, and reconstructs structures suboptimally, even when physical regularizers are added. We consider the question: How can 3D reconstruction become aware of real-world interactions and underlying object function, beyond visual cues? We propose FluidGaussian, a plug-and-play method that tightly couples geometry reconstruction with ubiquitous fluid-structure interactions to

core_reconstruction

high

FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

The increasing need for augmented reality and robotics is urging for articulated object reconstruction with high scalability. However, the existing settings of reconstructing from discrete articulation states or casual monocular video need non-trivial axes alignment or suffer from insufficient coverage, limiting the applications. In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simpler setup and high scalability. FreeArtGS combines free-moving part segmentation with joint estimation and end-to-end optimization, taking only a monocular RGB-D video as input. By optimizing with the priors from off-the-shelf point-tracking and feature models, free-moving part segmentation discovers rigid parts from relative motion in unconstrained capture. The joint estimation module proposes a noise-resistant approa

core_reconstruction

high

From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors.Given uncalibrated and unposed multi-view images, NAS3R reconstructs 3D Gaussian primitives from context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision.To ensure stable convergence, NAS3R integrates scene reconstruction and camera estimation within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization.The framework is compatible with state-of-the-art architectures and can incorporate pretrained priors or intrinsic information when available.Extensive experiments show that NAS3R achieves superior result

100

core_reconstruction

medium

From Rays to Projections: Better Inputs for Feed-Forward View Synthesis

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting, but title/abstract signal is narrower

abstract

Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plücker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity

101

core_reconstruction

high

FSFSplatter: Geometrically Accurate Reconstruction with Free Sparse-view Images within 2 minutes

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Gaussian Splatting has become a leading reconstruction technique, known for its high-quality novel view synthesis and detailed reconstruction. However, most existing methods require dense, calibrated views. Reconstruction from free sparse-view images often leads to poor surface due to limited overlap and overfitting.We introduce FSFSplatter for $\textbf{f}$ast geometrically accurate reconstruction from $\textbf{f}$ree $\textbf{s}$parse-view images. Our method integrates end-to-end dense Gaussian scene initialization and geometry-enhanced scene optimization.Specifically, FSFSplatter employs a large transformer to encode multi-view images and generates a dense and geometrically consistent Gaussian scene initialization via a batch based self-splitting Gaussian head. It eliminates local floaters through contribution-based pruning and mitigates overfitting by leveraging depth and multi-view f

102

core_reconstruction

high

GaussFusion: Improving 3D Reconstruction in the Wild with Geometry-Informed Video Generator

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We present GaussFusion, a novel approach for improving 3D Gaussian splatting (3DGS) reconstructions in the wild through geometry-informed video generation. GaussFusion mitigates common 3DGS artifacts, including floaters, flickering, and blur caused by camera pose errors, incomplete coverage, and noisy geometry initialization. Unlike prior RGB-based approaches limited to a single reconstruction pipeline, our method introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Given an existing reconstruction, we render a Gaussian primitive video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. We further introduce an artifact synthesis pipeline that simulates diverse degradation patterns, ensuring robustness and generaliz

103

core_reconstruction

high

Gaussian Mapping for Evolving Scenes

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Mapping systems with novel view synthesis (NVS) capabilities are widely used in computer vision, as well as in various applications, including augmented reality, robotics, and autonomous driving. Most notably, 3D Gaussian Splatting-based systems show high NVS performance; however, many current approaches are limited to static scenes. While recent works have begun addressing short-term dynamics (motion within the camera's view), long-term dynamics (the scene evolving through changes out of view) remain less explored.To overcome this limitation, we introduce a dynamic scene adaptation mechanism that continuously updates the 3D representation to reflect the latest changes. In addition, since maintaining geometric and semantic consistency remains challenging due to stale observations disrupting the reconstruction process, we propose a novel keyframe management mechanism that discards outdate

104

core_reconstruction

high

GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We introduce GaussianZoom, a generative zoom-in 3D reconstruction system with an iterative progressive framework that combines geometry-consistent scene modeling and multi-scale semantic reasoning to enable high-fidelity extreme zoom-in rendering from low-resolution inputs.To achieve this, we develop a novel multi-view consistent super-resolution module with depth-based feature warping and VLM-driven detail synthesis, ensuring accurate multi-view correspondence while enriching fine-scale appearance beyond the observed resolution.To support zooming across large magnification ranges, we further introduce a new expandable continuous Level-of-Detail hierarchy that dynamically modulates Gaussian visibility for smooth, alias-free cross-scale rendering. Experiments on Mip-NeRF360 and Tanks\&Temples demonstrate that GaussianZoom achieves superior perceptual quality, multi-view consistency, and r

105

core_reconstruction

high

GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Reconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diffusion models have enabled remarkable progress in image and video generation for avatar reconstruction, existing methods primarily rely on 2D priors and struggle to achieve consistent 3D geometry. We propose a novel framework that leverages geometry-aware diffusion to distill strong geometry priors for high-fidelity head avatar reconstruction. Our approach jointly synthesizes portrait images and corresponding surface normals, while a pose-free expression encoder captures implicit expression representations. Both synthesized images and expression latents are distilled into 3D Gaussian-based avatars, enabling photorealistic rendering with accurate geometry. Extensive experiments demonstrate that our method substantially outperforms state-of-

106

core_reconstruction

high

Geometric-Photometric Event-based 3D Gaussian Ray Tracing

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Event cameras offer a high temporal resolution over traditional frame-based cameras, which makes them suitable for motion and structure estimation. However, it has been unclear how event-based 3D Gaussian Splatting (3DGS) approaches could leverage fine-grained temporal information of sparse events. This work proposes a framework to address the trade-off between accuracy and temporal resolution in the event-based 3DGS. Our key idea is to decouple the rendering into two branches: event-by-event geometry (depth) rendering and snapshot-based radiance (intensity) rendering, by using ray-tracing and the image of warped events. The extensive evaluation shows that our method achieves the state-of-the-art performance on the real-world datasets and competitive performance on the synthetic datasets. Also, the proposed method works without prior information (e.g., pretrained image reconstruction mod

107

core_reconstruction

high

GHPT: Real-Time Relightable Gaussian Splatting using Hybrid Path Tracing

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian splatting (3DGS) has emerged as a promising approach for high-fidelity 3D scene representation. However, relighting and composition of Gaussian splatting remain challenging because path tracing is not directly applicable. Existing relighting methods for Gaussian splatting typically adopt either approximate rendering formulations or rely on Gaussian ray tracing, yielding low relighting performance and low rendering efficiency. To address these limitations, we propose Gaussian hybrid path tracing (GHPT), a three-stage framework to acquire relightable Gaussian splatting models. The first stage utilizes planar-based Gaussian splatting reconstruction representation (PGSR) to enable multi-view consistent depth rendering and reconstruct the surface mesh of a scene. The second stage performs physically-based differentiable rendering on the obtained mesh to reconstruct the material ma

108

core_reconstruction

high

GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficie

109

core_reconstruction

high

GOR-IS: 3D Gaussian Object Removal In the Intrinsic Space

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; generation_editing; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Recent advances in Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have made it standard practice to reconstruct 3D scenes from multi-view images. Removing objects from such 3D representations is a fundamental editing task that requires complete and seamless inpainting of occluded regions, ensuring consistency in geometry and appearance. Although existing methods have made notable progress in improving inpainting consistency, they often neglect global lighting effects, leading to physically implausible results. Moreover, these methods struggle with view-dependent non-Lambertian surfaces, where appearance varies across viewpoints, leading to unreliable inpainting. In this paper, we present 3D **G**aussian **O**bject **R**emoval in the **I**ntrinsic **S**pace (GOR-IS), a novel framework for physically consistent and visually coherent 3D object removal. Our approach decompose

110

core_reconstruction

high

GP-4DGS: Probabilistic Analysis of 4D Gaussian Splattings for Monocular Video Reconstruction via Variational Gaussian Processes

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We present GP-4DGS, a probabilistic framework for monocular video reconstruction that models the motion of 4D Gaussian Splatting (GS) primitives using variational Gaussian Processes (GPs). In contrast to prior approaches that depend on manually designed motion priors, our kernel-based probabilistic formulation enables flexible, data-adaptive motion modeling while implicitly providing appropriate priors for unobserved regions. GP-4DGS employs variational GPs with spatial kernels to capture geometric correlations and periodic kernels to characterize temporal dynamics, achieving efficient scalability to large sets of primitives compared to standard GPs. To train GP-4DGS, we introduce an optimization strategy that jointly optimizes GS primitive parameters as well as GP hyperparameters, establishing a complementary relationship between probabilistic and geometric modeling. Beyond improved rec

111

core_reconstruction

high

HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Diffusion priors have recently demonstrated strong capability in enhancing the quality of sparse-view 3D reconstruction by augmenting training views at novel viewpoints, but they inevitably introduce hallucinated content-- artifacts inconsistent with the input views -- into the final 3D model. To address this challenge, we propose Hallucination-Aware Diffusion prior (HAD), which estimates pixel-wise hallucination score maps for augmented images by leveraging multi-view reasoning capabilities from a feedforward novel view synthesis (NVS) network pre-trained on large-scale 3D data. These hallucination scores enable selective masking of unreliable pixels during the progressive 3D reconstruction procedure, preventing the introduction of non-existent artifacts into the 3D model. To further enhance performance, we create multiple versions of augmented images at each novel view by conditioning

112

core_reconstruction

high

Hermite Radial Basis Function for Surface Reconstruction via Differentiable Rendering

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Recent advances in novel view synthesis have enabled differentiable rendering methods to reconstruct 3D scenes directly from images. Algorithms such as 3D Gaussian Splatting and RayGauss use local basis functions to represent radiance fields, enabling fast, high-quality rendering of real-world scenes. However, these methods lack an exact geometric representation of the scene. In this work, inspired by Hermite Radial Basis Function (HRBF) implicits, we introduce a global implicit function constructed from local RBFs and their derivatives to represent surfaces. The proposed formulation enables learning scene geometry through differentiable rendering of an implicit function. By leveraging local basis functions, it achieves both an efficient geometric representation and fast rendering, using a bounding volume hierarchy (BVH) to accelerate intersections with the local basis functions. The imp

113

core_reconstruction

high

HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) has recently emerged as a promising approach in novel view synthesis, combining photorealistic rendering with real-time efficiency. However, its success heavily relies on dense camera coverage; under sparse-view conditions, insufficient supervision leads to irregular Gaussian distributions—characterized by globally sparse coverage, blurred background, and distorted high-frequency areas.To address this, we propose HeroGS—Hierarchical Guidance for Robust 3D Gaussian Splatting—a unified framework that establishes hierarchical guidance across the image, feature, and parameter levels. At the image level, sparse supervision is converted into pseudo-dense guidance, globally regularizing the Gaussian distributions and forming a consistent foundation for subsequent optimization. Building upon this, Feature-Adaptive Densification and Pruning (FADP) at the feature leve

114

core_reconstruction

high

IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Generalizable 3D Gaussian Splatting aims to directly predict Gaussian parameters using a feed-forward network for scene reconstruction. Among these parameters, Gaussian means are particularly difficult to predict, so depth is usually estimated first and then unprojected to obtain the Gaussian sphere centers. Existing methods typically rely solely on a single warp to estimate depth probability, which hinders their ability to fully leverage cross-view geometric cues, resulting in unstable and coarse depth maps. To address this limitation, we propose IDESplat, which iteratively applies warp operations to boost depth probability estimation for accurate Gaussian mean prediction. First, to eliminate the inherent instability of a single warp, we introduce a Depth Probability Boosting Unit (DPBU) that integrates multi-level epipolar attention maps in a multiplicative manner. Next, we construct a

115

core_reconstruction

high

Illumination-Consistent Human-Scene Reconstruction from Monocular Video

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Reconstructing 3D humans and scenes from monocular videos is a challenging task, particularly due to human motion, varying illumination, and dynamic scene shadows. While recent works have explored scene disentanglement by jointly modeling humans and their surrounding scenes, they often overlook illumination and shadow effects—resulting in inconsistent human appearance and degraded scene realism. To address this gap, we propose a photometrically consistent integration of human and scene reconstruction based on 3D Gaussian Splatting, with a key focus on modeling spatially-varying illumination and shadows. Central to our method is a learnable light volume that provides localized lighting cues to human Gaussians, enabling more realistic and consistent appearance synthesis. To further ensure accurate human geometry and alignment, we adopt a two-stage reconstruction strategy: we first optimize

116

core_reconstruction

high

iLRM: An Iterative Large 3D Reconstruction Model

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input images to enabl

117

core_reconstruction

high

iSplat: Iterative Learning for Fine-Grained Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Recent advances in feed-forward 3D Gaussian splatting have demonstrated remarkable efficiency by reconstructing scenes in a single pass. However, the reconstruction fidelity of these methods lags behind that of traditional optimization-based approaches, which gradually correct reconstruction flaws through a lengthy iterative process. In this paper, we leverage the strengths of both paradigms and introduce iSplat, a novel framework that reformulates reconstruction as an iterative feed-forward process involving multiple (typically three) passes.Central to iSplat is a recurrent GRU-based optimizer that refines both geometry and appearance in a synergistic loop. To address geometric inaccuracies, we propose an uncertainty-driven depth refinement strategy that progressively narrows the search space for each Gaussian based on its estimated uncertainty from the previous step. To further improve

118

core_reconstruction

high

Layered 4D-Rotor Gaussian Splatting: A Compressed Representation for Long Dynamic Scenes

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We address the challenge of reconstructing long dynamic scenes from multi-view videos in a storage-efficient manner. Recent advances in Gaussian Splatting and its extensions to dynamic scenes have demonstrated impressive visual quality, but remain limited to short duration (<10 s), large storage size (>500 MB), and high GPU VRAM usage.To overcome these limitations, we introduce Layered 4D-Rotor Gaussian Splatting (L4DRotorGS), a novel compressed representation designed for long dynamic scenes. Our approach integrates a layered 4D representation, efficient training, and effective compression into a unified framework. Specifically, 4D Gaussians are first organized into layers based on their temporal extents and then partitioned into discrete temporal buckets. This structure allows for selective access and rendering of only the necessary subsets of 4D Gaussians, substantially reducing GPU m

119

core_reconstruction

high

Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; robotics_mapping

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Robust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics.We introduce $\textbf{\textit{UniSplat}}$, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a $\textit{dual-masking strategy}$ that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural informatio

120

core_reconstruction

high

Learning Compact 3D Representations from Feed-Forward Novel View Synthesis

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Reconstructing and understanding 3D scenes from sparse views in a feed-forward manner remains challenging. While recent approaches use per-pixel 3D Gaussian Splatting for reconstruction and 2D-to-3D feature lifting for scene understanding, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation. We propose a feed-forward framework that estimates compact Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns to efficiently lift features. Extensive experiments on 3D open-vocabulary segmentation and view-invariant feature generation demon

121

core_reconstruction

high

LumiMotion: Improving Gaussian Relighting with Scene Dynamics

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

In 3D reconstruction, the problem of inverse rendering, namely recovering the illumination of the scene and the material properties, is fundamental. Existing Gaussian Splatting-based methods primarily target static scenes and often assume simplified or moderate lighting to avoid entangling shadows with surface appearance. This limits their ability to accurately separate lighting effects from material properties, particularly in real-world conditions. We address this limitation by leveraging dynamic elements - regions of the scene that undergo motion - as a supervisory signal for inverse rendering. Motion reveals the same surfaces under varying lighting conditions, providing stronger cues for disentangling material and illumination. This thesis is supported by our experimental results which show we improve LPIPS by 23\% for albedo estimation and by 15% for scene relighting relative to nex

122

core_reconstruction

medium

LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing; data_benchmark

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

We present a novel approach for interactive light editing in indoor scenes from a single multi-view scene capture. Our method leverages a generative image-based light decomposition model that factorizes complex indoor scene illumination into its constituent light sources. This factorization enables independent manipulation of individual light sources, specifically allowing control over their state (on/off), chromaticity, and intensity. We further introduce multi-view lighting harmonization to ensure consistent propagation of the lighting decomposition across all scene views. This is integrated into a relightable 3D Gaussian splatting representation, providing real-time interactive control over the individual light sources. Our results demonstrate highly photorealistic lighting decomposition and relighting outcomes across diverse indoor scenes. We evaluate our method on both synthetic and

123

core_reconstruction

high

MAPo : Motion-Aware Partitioning of Deformable 3D Gaussian Splatting for High-Fidelity Dynamic Scene Reconstruction

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting, known for enabling high-quality static scene reconstruction with fast rendering, is increasingly being applied to multi-view dynamic scene reconstruction. A common strategy involves learning a deformation field to model the temporal changes of a canonical set of 3D Gaussians. However, these deformation-based methods often produce blurred renderings and lose fine motion details in highly dynamic regions due to the inherent limitations of a single, unified model in representing diverse motion patterns. To address these challenges, we introduce Motion-Aware Partitioning of Deformable 3D Gaussian Splatting (MAPo), a novel framework for high-fidelity dynamic scene reconstruction. Its core is a dynamic score-based partitioning strategy that distinguishes between high- and low-dynamic 3D Gaussians. For high-dynamic 3D Gaussians, we recursively partition them temporally an

124

core_reconstruction

high

Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object’s complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object’s overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We

125

core_reconstruction

high

MeshSplatting: Differentiable Rendering with Opaque Meshes

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Primitive-based splatting methods like 3D Gaussian Splatting (3DGS) have revolutionized novel view synthesis with real-time rendering.However, their point-based representations remain incompatible with mesh-based pipelines that power AR/VR and game engines. We present MeshSplatting, a mesh-based reconstruction approach that jointly optimizes geometry and appearance through differentiable rendering.By enforcing connectivity via restricted Delaunay triangulation and refining surface consistency, MeshSplatting creates end-to-end smooth, visually high-quality meshes that render efficiently in real-time 3D engines.On Mip-NeRF360, it boosts PSNR by +0.69 dB over the current state-of-the-art MiLo for mesh-based novel view synthesis, while training 2x faster and using 2x less memory, bridging neural rendering and interactive 3D graphics for seamless real-time scene interaction.

126

core_reconstruction

high

MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Recently, 3D Gaussian Splatting and its derivatives have achieved significant breakthroughs in large-scale scene reconstruction. However, how to efficiently and stably achieve high-quality geometric fidelity remains a core challenge. To address this issue, we introduce MetroGS, a novel Gaussian Splatting framework for efficient and robust reconstruction in complex urban environments. Our method is built upon a distributed 2D Gaussian Splatting representation as the core foundation, serving as a unified backbone for subsequent modules. To handle potential sparse regions in complex scenes, we propose a structured dense enhancement scheme that utilizes SfM priors and a pointmap model to achieve a denser initialization, while incorporating a sparsity compensation mechanism to improve reconstruction completeness. Furthermore, we design a progressive hybrid geometric optimization strategy that

127

core_reconstruction

high

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs. free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-base

128

core_reconstruction

high

MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectioanl Blending with Hierarchical Densification

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Recent advances in 4D Gaussian Splatting (4DGS) have extended the high-speed rendering capability of 3D Gaussian Splatting (3DGS) into the temporal domain, enabling real-time rendering of dynamic scenes.However, one of the major remaining challenges lies in modeling long-range motion-contained dynamic videos, where a naïve extension of existing methods leads to severe memory explosion, temporal flickering, and failure to handle appearing or disappearing occlusions over time. To address these challenges, we propose a novel 4DGS framework characterized by an Anchor Relay-based Bidirectional Blending (ARBB) mechanism, named MoRel, which enables temporally consistent and memory-efficient modeling of long-range dynamic scenes.Our method progressively constructs locally canonical anchor spaces at key-frame time index and models inter-frame deformations at the anchor level, enhancing temporal c

129

core_reconstruction

high

MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Online reconstruction of dynamic scenes aims to learn from streaming multi-view inputs under low-latency constraints. The fast training and real-time rendering capabilities of 3D Gaussian Splatting have made on-the-fly reconstruction practically feasible, enabling online 4D reconstruction. However, existing online approaches, despite their efficiency and visual quality, fail to learn per-Gaussian motion that reflects true scene dynamics. Without explicit motion cues, appearance and motion are optimized solely under photometric loss, causing per-Gaussian motion to chase pixel residuals rather than true 3D motion. To address this, we propose MoRGS, an efficient online per-Gaussian motion reasoning framework that treats Gaussian movement as a core modeling object. Specifically, we efficiently leverage optical flow on a sparse set of key views as a lightweight motion cue to guide per-Gaussi

130

core_reconstruction

high

MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We present MOSAIC-GS, a novel, fully explicit, and computationally efficient approach for high-fidelity dynamic scene reconstruction from monocular videos using Gaussian Splatting.Monocular reconstruction is inherently ill-posed due to the lack of sufficient multiview constraints, making accurate recovery of object geometry and temporal coherence particularly challenging. To address this, we leverage multiple geometric cues, such as depth, optical flow, dynamic object segmentation, and point tracking. Combined with rigidity-based motion constraints, these cues allow us to estimate preliminary 3D scene dynamics during an initialization stage.Recovering scene dynamics prior to the photometric optimization reduces reliance on motion inference from visual appearance alone, which is often ambiguous in monocular settings.To enable compact representations, fast training, and real-time rendering

131

core_reconstruction

high

MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Realistic reconstruction of dynamic 4D scenes is essential for understanding the physical world.Despite recent progress in monocular view synthesis, existing methods still struggle to recover accurate 3D geometry and temporally consistent motion in complex environments.To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences, enabling faithful reconstruction of high-fidelity scene structures and coherent motion representation under complex dynamics.To handle motion with arbitrary flexibility and long-term variation, we introduce a scalable motion field built upon cluster-based bases that adaptively grow to capture diverse motion patterns over time.Moreover, we introduce a progressive optimization strategy that extends naturally to unseen frames. This strategy comprises two propagation modules: 1)

132

core_reconstruction

high

MSCD-GS: Motion-Separated Cooperative Deblurring Dynamic Reconstruction via Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Although 4D reconstruction based on Gaussian Splatting has achieved many impressive results, reconstructing real-world images captured by a casual monocular camera remains a significant challenge. In dynamic scenes, as the camera and objects move during the exposure time, these input images inevitably contain a considerable amount of motion blur, which severely compromises the quality of reconstruction and new viewpoint synthesis. The existing deblurring 3D Gaussian models still cannot handle motion blur issues in real dynamic scenes. To address these challenges, we propose MSCD-GS—a novel method for motion-separated collaborative deblurring 4D reconstruction via Gaussian Splatting, capable of effectively handling motion-blurred inputs. Specifically, due to the distinct motion characteristics of static and dynamic Gaussians, we perform separate motion modeling to achieve dynamic scene re

133

core_reconstruction

high

MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Generalizable Neural Radiance Fields (GeNeRF) enable high-quality scene reconstruction from a limited number of views and can generalize to unseen scenes. However, in real-world environments, transient distractors disrupt structural consistency across views, leading to deviated supervision signals and degraded reconstruction quality. Existing distractor-free NeRF methods rely on per-scene optimization and they estimate uncertainty from per-view reconstruction errors to remove distractors, but this is unreliable to GeNeRF, because it may misjudge inconsistent static structures from source views as distractors. To address this issue, we propose MUGeNeRF: a multi-view uncertainty-guided distractor-aware GeNeRF method, aim to effectively alleviate GeNeRF's robust modeling challenges in dynamic scenes with transient distractions. We explicitly decompose distractor awareness into two complemen

134

core_reconstruction

high

Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Recent years have witnessed the rapid emergence of 3D Gaussian Splatting (3DGS) as a powerful approach for 3D reconstruction and novel view synthesis. Its explicit representation with Gaussian primitives enables fast training, real-time rendering, and convenient post-processing such as editing and surface reconstruction. However, 3DGS suffers from a critical drawback: the number of primitives grows drastically for scenes with high-frequency appearance details, since each primitive can represent only a single color, requiring multiple primitives for every sharp color transition.To overcome this limitation, we propose Neural Gabor splatting, which augments each Gaussian primitive with a lightweight multi-layer perceptron (MLP) that models a wide range of color variations within a single primitive. To further control primitive numbers, we introduce a frequency-aware densification strategy t

135

core_reconstruction

high

No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present the first study of cross-sensor view synthesis across different modalities. We examine a practical, fundamental, yet widely overlooked problem: getting aligned RGB-X data, where most RGB-X prior work assumes such pairs exist and focuses on modality fusion, but it empirically requires huge engineering effort in calibration. We propose a match-densify-consolidate method. First, we perform RGB-X image matching followed by guided point densification. Using the proposed confidence-aware densification and self-matching filtering, we attain better view synthesis and later consolidate them in 3D Gaussian Splatting (3DGS). Our method uses no 3D priors for X-sensor and only assumes nearly no-cost COLMAP for RGB. We aim to remove the cumbersome calibration for various RGB-X sensors and advance the popularity of cross-sensor learning by a scalable solution that breaks through the bottlene

136

core_reconstruction

high

Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; generation_editing

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Feed-forward 3D Gaussian Splatting (3DGS) models enable real-time scene generation but are hindered by suboptimal pixel-aligned primitive placement, which relies on a dense, rigid grid and limits both quality and efficiency. We introduce a new feed-forward architecture that detects 3D Gaussian primitives at a sub-pixel level, replacing the pixel grid with an adaptive, ``Off The Grid" distribution. Inspired by keypoint detection, our multi-resolution decoder learns to distribute primitives across image patches. This module is trained end-to-end with a 3D reconstruction backbone using self-supervised learning. Our resulting pose-free model generates photorealistic scenes in seconds, achieving state-of-the-art novel view synthesis for feed-forward models. It outperforms competitors while using far fewer primitives, demonstrating a more accurate and efficient allocation that captures fine de

137

core_reconstruction

high

P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit representation enabling fast, high-fidelity rendering, making it a promising foundation for closed-loop simulators and perception models in autonomous driving. However, conventional 3DGS implicitly assumes consistent exposure and tone mapping across views. Real driving data violates this assumption due to heterogeneous camera pipelines and dynamic outdoor illumination, baking exposure discrepancies and sensor noise into the radiance field and producing artifacts and inconsistent illumination especially in static backgrounds crucial for realistic simulation. These issues are amplified in autonomous driving, where sparse viewpoints, varying exposures, and outdoor lighting interact, while prior work mainly targets dynamic-object reconstruction and overlooks cross-view photometric consistency.To address this limitation,

138

core_reconstruction

high

PackUV: Packed Gaussian UV Maps for 4D Volumetric Video

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; generation_editing; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Volumetric videos offer immersive 4D experiences, but remain difficult to reconstruct, store, and stream at scale. Existing Gaussian Splatting based methods achieve high-quality reconstruction but break down on long sequences, temporal inconsistency, and fail under large motions and disocclusions. Moreover, their outputs are typically incompatible with conventional video coding pipelines, preventing practical applications. We introduce PackUV, a novel 4D Gaussian representation that maps all Gaussian attributes into a sequence of structured, multi-scale UV atlas, enabling compact, image-native storage. To fit this representation from multi-view videos, we propose PackUV-GS, a temporally consistent fitting method that directly optimizes Gaussian parameters in the UV domain. A flow-guided Gaussian labeling and video keyframing module identifies dynamic Gaussians, stabilizes static regions,

139

core_reconstruction

high

Part$^{2}$GS: Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Articulated objects are common in the real world, yet modeling their structure and motion remains a challenging task for 3D reconstruction methods. In this work, we introduce Part$^{2}$GS, a novel framework for modeling articulated digital twins of multi-part objects with high-fidelity geometry and physically consistent articulation. Part$^{2}$GS leverages a part-aware 3D Gaussian representation that encodes articulated components with learnable attributes, enabling structured, disentangled transformations that preserve high-fidelity geometry. To ensure physically consistent motion, we propose a motion-aware canonical representation guided by physics-based constraints, including contact enforcement, velocity consistency, and vector-field alignment. Furthermore, we introduce a field of repel points to prevent part collisions and maintain stable articulation paths, significantly improving

140

core_reconstruction

high

ParticleGS: Learning Neural Gaussian Particle Dynamics from Videos for Prior-free Physical Motion Extrapolation

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; generation_editing

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

The ability to extrapolate dynamic 3D scenes beyond the observed timeframe is fundamental to advancing physical world understanding and predictive modeling. Existing dynamic 3D reconstruction methods have achieved high-fidelity rendering of temporal interpolation, but typically lack physical consistency in predicting the future. To overcome this issue, we propose ParticleGS, a physics-based framework that reformulates dynamic 3D scenes as physically grounded systems. ParticleGS comprises three key components: 1) an encoder that decomposes the scene into static properties and initial dynamic physical fields; 2) an evolver based on Neural Ordinary Differential Equations (Neural ODEs) that learns continuous-time dynamics for motion extrapolation; and 3) a decoder that reconstructs 3D Gaussians from evolved particle states for rendering. Through this design, ParticleGS integrates physical re

141

core_reconstruction

high

PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present PercHead, a model for single-image 3D head reconstruction and disentangled 3D editing - two tasks that are inherently challenging due to ambiguity in plausible explanations for the same input. At the heart of our approach lies our novel perceptual loss based on DINOv2 and SAM 2.1. Unlike widely-adopted low-level losses like LPIPS, SSIM or L1, we rely on deep visual understanding of images and the resulting generalized supervision signals. We show that our new loss can be a drop-in replacement for standard losses and used to improve visual quality in high-frequency areas. We base our model architecture on Vision Transformers (ViTs), allowing us to decouple the 3D representation from the 2D input. We train our method on multi-view images for view-consistency and in-the-wild images for strong transferability to new environments. Our model achieves state-of-the-art performance in

142

core_reconstruction

high

PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Understanding physical properties such as friction, stiffness, hardness, and material composition is essential for enabling robots to interact safely and effectively with their surroundings. However, existing 3D reconstruction methods focus on geometry and appearance and cannot infer these underlying physical properties. We present PhysGS, a Bayesian-inferred extension of 3D Gaussian Splatting that estimates dense, per-point physical properties from visual cues and vision--language priors. We formulate property estimation as Bayesian inference over Gaussian splats, where material and property beliefs are iteratively refined as new observations arrive. PhysGS also models aleatoric and epistemic uncertainties, enabling uncertainty-aware object and scene interpretation. Across object-scale (ABO-500), indoor, and outdoor real-world datasets, PhysGS improves accuracy of the mass estimation by

143

core_reconstruction

high

Physically Inspired Gaussian Splatting for HDR Novel View Synthesis

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

High dynamic range novel view synthesis (HDR-NVS) reconstructs scenes with dynamic details by fusing multi-exposure low dynamic range (LDR) views, yet it struggles to capture ambient illumination-dependent appearance. Implicitly supervising HDR content by constraining tone-mapped results fails in correcting abnormal HDR values, and results in limited gradients for Gaussians in under/over-exposed regions. To this end, we introduce PhysHDR-GS, a physically inspired HDR-NVS framework that models scene appearance via intrinsic reflectance and adjustable ambient illumination. PhysHDR-GS employs a complementary image-exposure (IE) branch and Gaussian-illumination (GI) branch to faithfully reproduce standard camera observations and capture illumination-dependent appearance changes, respectively. During training, the proposed cross-branch HDR consistency loss provides explicit supervision fo

144

core_reconstruction

high

Plug-and-Play PDE Optimization for 3D Gaussian Splatting: Toward High-Quality Rendering and Reconstruction

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction by achieving high-quality novel view synthesis with fast rendering speed, introducing 3D Gaussian primitives to represent the scene. However, 3DGS encounters blurring and floaters when applied to complex scenes, caused by the reconstruction of redundant and ambiguous geometric structures. We attribute this issue to the unstable optimization of the Gaussians. To address this limitation, we present a plug-and-play PDE-based optimization method that overcomes the optimization constraints of 3DGS-based approaches in various tasks, such as novel view synthesis and surface reconstruction. Firstly, we theoretically derive that the 3DGS optimization procedure can be modeled as a PDE, and introduce a viscous term to ensure stable optimization. Secondly, we use the Material Point Method (MPM) to obtain a stable numerical

145

core_reconstruction

medium

PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting

3D Vision & Geometry / Point Cloud

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; robotics_mapping

direct reconstruction/3DGS/4D title linked to core representation cluster

abstract

Unsupervised point cloud segmentation is critical for embodied intelligence and autonomous driving, as it mitigates the prohibitive cost of dense point-level annotations required by fully supervised methods. Integrating 2D pre-trained models such as SAM to supplement semantic information is a natural choice, yet this approach faces a fundamental mismatch between discrete 3D points and continuous 2D images. This mismatch leads to inevitable projection overlap and complex modality alignment, resulting in compromised semantic consistency across 2D-3D transfer. To address these limitations and achieve semantic-consistent segmentation, this paper proposes PointGS, a simple yet effective pipeline for unsupervised 3D point cloud segmentation. PointGS leverages 3D Gaussian Splatting as a unified intermediate representation to bridge the discrete-continuous domain gap. Input sparse point clouds

146

core_reconstruction

high

PolarGuide-GSDR: 3D Gaussian Splatting Driven by Polarization Priors and Deferred Reflection for Real-World Reflective Scenes

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Polarization-aware Neural Radiance Fields (NeRF) enable novel view synthesis of specular-reflection scenes but face challenges in slow training, inefficient rendering, and strong dependencies on material/viewpoint assumptions. However, 3D Gaussian Splatting (3DGS) enables real-time rendering yet struggles with accurate reflection reconstruction from reflection-geometry entanglement, adding a deferred reflection module introduces environment map dependence. We address these limitations by proposing PolarGuide-GSDR, a polarization-forward-guided paradigm establishing a bidirectional coupling mechanism between polarization and 3DGS: first 3DGS’s geometric priors are leveraged to resolve polarization ambiguity, and then the refined polarization information cues are used to guide 3DGS’s normal and spherical harmonic representation. This process achieves high-fidelity reflection separation an

147

core_reconstruction

high

Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Omnidirectional 3D Gaussian Splatting with panoramas is a key technique for 3D scene representation, and existing methods typically rely on slow SfM to provide camera poses and sparse points priors. In this work, we propose a pose-free omnidirectional 3DGS method, named PFGS360, that reconstructs 3D Gaussians from unposed omnidirectional videos. To achieve accurate camera pose estimation, we first construct a spherical consistency-aware pose estimation module, which recovers poses by establishing consistent 2D–3D correspondences between the reconstructed Gaussians and the unposed images using Gaussians' internal depth priors. Besides, to enhance the fidelity of novel view synthesis, we introduce a depth-inlier-aware densification module to extract depth inliers and Gaussian outliers with consistent monocular depth priors, enabling efficient Gaussian densification and achieving photoreali

148

core_reconstruction

medium

PoseGaussian: 6D Pose Estimation for Unseen Objects via Sparse-View Object-Level 3D Gaussian Splatting

3D Vision & Geometry / Pose Estimation

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

6D pose estimation is a key technology in computer vision and robotic manipulation. However, many methods remain heavily dependent on CAD models that are difficult to obtain. Object-level 3D reconstruction provides an alternative route, and 3D Gaussian Splatting (3DGS) shows convincing potential owing to its training and rendering efficiency. Nevertheless, under sparse reference views, 3DGS is prone to floating artifacts and appearance overfitting, which weakens the stability of pose estimation. We present PoseGaussian, a method for sparse-view 6D pose estimation for unseen object that builds on improved 3DGS. First, we use sparse RGB-D views to inject a depth structure prior into the 3DGS initialization for stable structure, and we adopt adaptive density control, view-warping augmentation, and joint photometric–depth supervision to reduce floaters and appearance overfitting under sparse

149

core_reconstruction

high

Radiance Meshes for Volumetric Reconstruction

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We introduce Radiance Meshes for representing radiance fields with constant density tetrahedral cells produced with a Delaunay tetrahedralization.Unlike a Voronoi diagram, a Delaunay tetrahedralization yields simple triangles that are natively supported by existing hardware. As such, our model is able to perform exact and fast volume rendering using both rasterization and ray-tracing. We introduce a new rasterization method that achieve faster rendering speeds than all prior radiance field representations (assuming an equivalent number of primitives and resolution) across a variety of platforms.Optimizing the positions of Delaunay vertices introduces topological discontinuities (edge flips). To solve this, we use a Zip-NeRF-style backbone which allows us to express a smoothly varying field even when the topology changes.Our rendering method exactly evaluates the volume rendering equation

150

core_reconstruction

high

REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Articulated objects are pervasive in daily environments, such as drawers and refrigerators. Towards their part-level surface reconstruction and joint parameter estimation, REArtGS introduces a category-agnostic approach using multi-view RGB images at two different states. However, we observe that REArtGS still struggles with screw-joint or multi-part objects and lacks geometric constraints for unseen states. In this paper, we propose REArtGS++, a novel method towards generalizable articulated object reconstruction with temporal geometry constraint and planar Gaussian splatting. We first model a decoupled screw motion for each joint without type prior, and jointly optimize part-aware Gaussians with joint parameters through part motion blending. To introduce time-continuous geometric constraint for articulated modeling, we encourage Gaussians to be planar and propose a temporally consisten

151

core_reconstruction

high

RelightAnyone: A Generalized Relightable 3D Gaussian Head Model

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; robotics_mapping; generation_editing; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) has become a standard approach to reconstruct and render photorealistic 3D head avatars. A major challenge is to relight the avatars to match any scene illumination. For high quality relighting, existing methods require subjects to be captured under complex time-multiplexed illumination, such as one-light-at-a-time (OLAT). We propose a new generalized relightable 3D Gaussian head model that can relight any subject observed in a single- or multi-view images without requiring OLAT data for that subject. Our core idea is to learn a mapping from flat-lit 3DGS avatars to corresponding relightable Gaussian parameters for that avatar. Our model consists of two stages: a first stage that models flat-lit 3DGS avatars without OLAT lighting, and a second stage that learns the mapping to physically-based reflectance parameters for high-quality relighting. This two-stage

152

core_reconstruction

high

ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

High-quality 3D garment reconstruction plays a crucial role in mitigating the sim-to-real gap in applications such as digital avatars, virtual try-on and robotic manipulation. However, existing garment reconstruction methods, typically rely on the unstructured representations, such as 3D Gaussian Splats, which struggle to provide accurate reconstructions of garment topology and sewing structures. As a result, the reconstructed outputs are often unsuitable for high-fidelity physical simulation. We propose \textbf{ReWeaver}, a novel framework for topology-accurate 3D garment and sewing pattern reconstruction from \textit{sparse} multi-view RGB images. Given as few as four input views, ReWeaver predicts seams and panels as well as their connectivities in both the 2D UV space and the 3D space. The reconstructed seams and panels align precisely with the input images, and can be easily convert

153

core_reconstruction

high

RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Neural fields (NFs) have achieved remarkable success in scene reconstruction and novel view synthesis. However, existing NF approaches that rely on RGB or LiDAR inputs often struggle under adverse weather conditions, limiting their robustness in real-world outdoor environments such as autonomous driving. In contrast, millimeter-wave radar is inherently resilient to environmental variations, yet its integration with NFs remains largely underexplored. Moreover, outdoor driving scenes frequently involve dynamic objects, making spatiotemporal modeling crucial for temporally consistent novel view synthesis. To address these challenges, we present RF4D, a radar-based neural field framework tailored for novel view synthesis in outdoor dynamic scenes. RF4D explicitly incorporates temporal information into its representation, enabling more accurate modeling of object motion. A dedicated \textbf{s

154

core_reconstruction

high

RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Reconstructing people, objects, and their interactions in 3D is a long-standing and fundamental goal for intelligent systems. Often the input is RGB video from a moving camera, making the task ill-posed; depth is ambiguous, humans and objects occlude each other, and camera and object motion entangle to create apparent motion. Most prior work addresses humans or objects in isolation, ignoring their interplay, or assumes known 3D shapes or cameras, which is impractical for real-world applications. We develop RHINO (Reconstructing Human Interactions with Novel Objects), a novel three-step framework that recovers in 3D a human, novel (unseen) manipulated object, and static scene in a common world frame from a monocular RGB video. First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and

155

core_reconstruction

high

S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Explicit 3D representations have already become an essential medium for 3D simulation and understanding.However, the most commonly used point cloud and 3D Gaussian Splatting (3DGS) each suffer from non-photorealistic rendering and significant degradation under sparse inputs.In this paper, we introduce Sparse to Dense lifting (S2D), a novel pipeline that bridges the two representations and achieves high-quality 3DGS reconstruction with minimal inputs.Specifically, the S2D lifting is two-fold.We first present an efficient one-step diffusion model that lifts sparse point cloud for high-fidelity image artifact fixing.Meanwhile, to reconstruct 3D consistent scenes, we also design a corresponding reconstruction strategy with random sample drop and weighted gradient for robust model fitting from sparse input views to dense novel views.Extensive experiments show that S2D achieves the best consis

156

core_reconstruction

medium

ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing; data_benchmark

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Recent advancements in 3D object generation using diffusion models have achieved remarkable success, but generating realistic 3D urban scenes remains challenging. Existing methods relying solely on 3D diffusion models tend to suffer a degradation in appearance details, while those utilizing only 2D diffusion models typically compromise camera controllability. To overcome this limitation, we propose ScenDi, a method for urban scene generation that integrates both 3D and 2D diffusion models. We first train a 3D latent diffusion model to generate 3D Gaussians, enabling the rendering of images at a relatively low resolution. To enable controllable synthesis, this 3DGS generation process can be optionally conditioned by specifying inputs such as 3d bounding boxes, road maps, or text prompts. Then, we train a 2D video diffusion model to enhance appearance details conditioned on rendered images

157

core_reconstruction

high

SDGS: Spatial Difference Guided Gaussian Splatting for Simultaneous Localization and 3D Reconstruction

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit 3D representation, enabling photorealistic and real-time novel view synthesis. However, most 3DGS pipelines still assume precomputed camera poses and offline optimization, which introduces latency and makes them brittle in fast-motion, real-world scenarios. Existing online 3DGS systems mostly fall into two camps: (1) hybrid systems that rely on a separate traditional SLAM system for camera poses and optimize Gaussians decoupled from tracking, increasing system complexity; and (2) purely Gaussian-based systems that estimate poses from dense photometric errors, requiring repeated rendering of a large number of Gaussians and thus incurring high computational cost. Moreover, current online methods are often sensitive to motion blur and high dynamic range scenes, limiting their applicability in practice.We address these

158

core_reconstruction

high

Semantic Foam: Unifying Spatial and Semantic Scene Decomposition

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Current generation scene reconstruction methods like 3D Gaussian Splatting are capable of producing photo-realistic novel view synthesis at real-time speeds, yet see only limited adoption in many practical graphics applications.One significant contributing factor to this gap is the difficulty of interacting with and editing these representations in comparison to classic human-authored 3D assets.While work has been done to impose semantic decomposition onto these representations, there are still significant limitations in the quality and consistency of these segmentations.We address this by proposing a semantically decomposed variant of the recently introduced Radiant Foam method.Our approach, Semantic Foam, combines the natural spatial volumetric decomposition provided by Radiant Foam's Voronoi mesh with an explicit semantic feature field parameterized on the cells.The explicit mesh stru

159

core_reconstruction

high

SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We presents SGS-Intrinsic, an indoor inverse rendering framework that works well for sparse-view images. Unlike existing 3D Gaussian Splatting (3DGS) based methods that focus on object-centric reconstruction and fail to work under sparse view settings, our method allows to achieve high-quality geometry reconstruction and accurate disentanglement of material and illumination. The core idea is to construct a dense and geometry-consistent Gaussian semantic field guided by semantic and geometric priors, providing a reliable foundation for subsequent inverse rendering. Building upon this, we perform material–illumination disentanglement by combining a hybrid illumination model and material prior to effectively capture illumination–material interactions. To mitigate the impact of cast shadows and enhance the robustness of material recovery, we introduce illumination-invariant material constrai

160

core_reconstruction

high

SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Novel view synthesis of dynamic scenes is fundamental to achieving photorealistic 4D reconstruction and immersive visual experiences. Recent progress in Gaussian-based representations has significantly improved real-time rendering quality, yet existing methods still struggle to maintain a balance between long-term static and short-term dynamic regions in both representation and optimization. To address this, we present SharpTimeGS, a lifespan-aware 4D Gaussian framework that achieves temporally adaptive modeling of both static and dynamic regions under a unified representation.Specifically, we introduce a learnable lifespan parameter that reformulates temporal visibility from a Gaussian-shaped decay into a flat-top profile, allowing primitives to remain consistently active over their intended duration and avoiding redundant densification. In addition, the learned lifespan modulates each

161

core_reconstruction

high

Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Reconstructing high-fidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25-200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while optimization-based methods achieve higher fidelity but require dense views and expensive computation. We bridge this gap with a hybrid approach that combines the strengths of both paradigms.Our method introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass. We then

162

core_reconstruction

high

SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy b

163

core_reconstruction

high

SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Reconstructing dynamic fluids from sparse views is a long-standing and challenging problem, due to the severe lack of 3D information from insufficient view coverage. While several pioneering approaches have attempted to address this issue using differentiable rendering or novel view synthesis, they are often limited by time-consuming optimization under ill-posed conditions. We propose SmokeSVD, an efficient and effective framework to progressively reconstruct dynamic smoke from a single video by integrating the generative capabilities of diffusion models with physically guided consistency optimization. Specifically, we first propose a physically guided side-view synthesizer based on diffusion models, which explicitly incorporates velocity field constraints to generate spatio-temporally consistent side-view images frame by frame, significantly alleviating the ill-posedness of single-view

164

core_reconstruction

high

SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Articulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive exp

165

core_reconstruction

high

Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Neural Radiance Fields (NeRF) have shown remarkable success in image novel view synthesis (NVS), inspiring extensions to LiDAR NVS. However, most methods heavily rely on accurate camera poses for scene reconstruction. The sparsity and textureless nature of LiDAR data also present distinct challenges, leading to geometric holes and discontinuous surfaces. To address these issues, we propose SG-NLF, a pose-free LiDAR NeRF framework that integrates spectral information with geometric consistency. Specifically, we design a hybrid representation based on spectral priors to reconstruct smooth geometry. For pose optimization, we construct a confidence-aware graph based on feature compatibility to achieve global alignment. In addition, an adversarial learning strategy is introduced to enforce cross-frame consistency, thereby enhancing reconstruction quality. Comprehensive experiments demonstrate

166

core_reconstruction

high

Splatent: Splatting Diffusion Latents for Novel View Synthesis

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Radiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusion-based pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D spa

167

core_reconstruction

high

SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) enables high-quality novel view synthesis, motivating interest in generating higher-resolution renders than those available during training. A natural strategy is to apply super-resolution (SR) to low-resolution (LR) input views, but independently enhancing each image introduces multi-view inconsistencies, leading to blurry renders. Prior methods attempt to mitigate these inconsistencies through learned neural components, temporally consistent video priors, or joint optimization on LR and SR views, but all uniformly apply SR across every image. In contrast, our key insight is that close-up LR views may contain high-frequency information for regions also captured in more distant views, and that we can use the camera pose relative to scene geometry to inform where to add SR content. Building from this insight, we propose SplatSuRe, a method that selectively app

168

core_reconstruction

high

SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D super-resolution (3DSR) aims to reconstruct high-resolution (HR) 3D scenes from low-resolution (LR) multi-view images. Existing methods rely on dense LR inputs and per-scene optimization, which restricts the high-frequency priors for constructing HR 3D Gaussian Splatting (3DGS) to those inherited from pretrained 2D super-resolution (2DSR) models. This severely limits reconstruction fidelity, cross-scene generalization, and real-time usability. We propose to reformulate 3DSR as a direct feed-forward mapping from sparse LR views to HR 3DGS representations, enabling the model to autonomously learn 3D-specific high-frequency geometry and appearance from large-scale, multi-scene data. This fundamentally changes how 3DSR acquires high-frequency knowledge and enables robust generalization to unseen scenes. Specifically, we introduce \textbf{SR3R}, a feed-forward framework that directly predi

169

core_reconstruction

high

STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Reconstructing high-fidelity and animatable 3D head avatars from monocular videos remains a challenging yet essential task. Existing methods based on 3D Gaussian Splatting typically bind Gaussians to mesh triangles and model deformations solely via Linear Blend Skinning, which results in rigid motion and limited expressiveness. Moreover, they struggle to reconstruct frequently occluded regions (e.g., mouth interiors, eyelids). To address these limitations, we propose STAvatar, which consists of two key components: (1) a UV-Adaptive Soft Binding framework that leverages both image- and FLAME-based priors to learn per-Gaussian feature offsets within the UV space. This UV representation supports dynamic resampling, ensuring full compatibility with Adaptive Density Control (ADC) and enhanced adaptability to geometric and textural variations. (2) a Temporal ADC strategy, which first clusters

170

core_reconstruction

high

SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step.However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object’s motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven

171

core_reconstruction

high

Tavatar: Topology-Aware Gaussian Attribute Derivation for Animatable Human Avatars

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Reconstructing high-fidelity, animatable human avatars from monocular videos remains a critical challenge. Existing 3DGS-based human animation methods constrain Gaussian parameters but exclude scale, which we argue is crucial for adapting human poses to challenging out-of-distribution poses. To achieve robust animation under unseen poses, we propose Tavatar, which derives key parameters such as scale, rotation, and other geometric attributes directly from the local mesh geometry, instead of learning them through unconstrained optimization. This paradigm shift enforces topological consistency by design, as each Gaussian is analytically anchored to the local mesh geometry, inheriting its spatial structure and deformation behavior. Specifically, we bind Gaussians to mesh faces and vertices, deriving their scales and orientations from triangle properties and local edge lengths to ensure cohe

172

core_reconstruction

high

TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We present **TokenSplat**, a feed-forward framework for joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images.At its core, TokenSplat introduces a **Token-aligned Gaussian Prediction** module that aligns semantically corresponding information across views directly in the feature space.Guided by coarse token positions and fusion confidence, it aggregates multi-scale contextual features to enable long-range cross-view reasoning and reduce redundancy from overlapping Gaussians.To further enhance pose robustness and disentangle viewpoint cues from scene semantics, TokenSplat employs learnable camera tokens and an **Asymmetric Dual-Flow Decoder (ADF-Decoder)** that enforces directionally constrained communication between camera and image tokens. This maintains clean factorization within a feed-forward architecture, enabling coherent reconstruction and stab

173

core_reconstruction

high

Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; generation_editing; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Generating dynamic 4D objects from sparse inputs is difficult because it demands joint preservation of appearance and motion coherence across views and time while suppressing artifacts and temporal drift. We hypothesize that the view discrepancy arises from supervision limited to pixel- or latent-space video-diffusion losses, which lack explicitly temporally aware, feature-level tracking guidance.We present \emph{Track4DGen}, a two-stage framework that couples a multi-view video diffusion model with a foundation point tracker and a hybrid 4D Gaussian Splatting (4D-GS) reconstructor. The central idea is to explicitly inject tracker-derived motion priors into intermediate feature representations for both multi-view video generation and 4D-GS. In Stage One, we enforce dense, feature-level point correspondences inside the diffusion generator, producing temporally consistent features that cur

174

core_reconstruction

high

tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model’s capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforwar

175

core_reconstruction

high

TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Novel view synthesis from sparse-view inputs poses a significant challenge in 3D computer vision, particularly for achieving high-quality scene reconstructions with limited viewpoints. We introduce TWINGS, a framework that enhances 3D Gaussian Splatting (3DGS) by directly addressing point sparsity. We employ Thin Plate Splines (TPS), a smooth non-rigid deformation model that minimizes bending energy to estimate a globally coherent warp from control-point correspondences, to align backprojected points from estimated depth with triangulated 3D control points, yielding calibrated backprojected points. By sampling these calibrated points near the control points, TWINGS provides a fast and geometrically accurate initialization for 3DGS, ultimately improving structural detail preservation and color fidelity in reconstructed scenes. Extensive experiments on DTU, LLFF, and Mip-NeRF360 demonstrat

176

core_reconstruction

high

Uika: Universal Head Avatar from Pose-Free Images

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of unposed inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation. First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise UV coordinate estimation. Such UV coordinate estimation allows us to project each valid pixel from screen space to UV space, which is independent of camera pose and character expression. We thus leverage this UV space to represent our Gaussian head avatar. To this end, we design learnable UV tokens on which the attent

177

core_reconstruction

high

Unblur-SLAM: Dense Neural SLAM for Blurry Inputs

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We propose Unblur-SLAM, an RGB SLAM pipeline for sharp 3D reconstruction from blurred image inputs. In contrast to previous work, our approach is able to handle different types of blur and demonstrates state-of-the-art performance in the presence of both motion blur and defocus blur. Moreover, we adjust the computation effort with the amount of blur in the input image.As a first stage, our method uses a feed-forward image deblurring model for which we propose a suitable training scheme that can improve both tracking and mapping modules.Frames that are successfully deblurred by the feed-forward network obtain refined poses and depth through local-global multi-view optimization and loop closure. Frames that fail the first stage deblurring are directly modeled through the global 3DGS representation and an additional blur network to model multiple blurred sub-frames and simulate the blur for

178

core_reconstruction

high

Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We present Gaussian Splatting Anisotropic Visibility Field (GAVIS), a novel framework for uncertainty quantification and active mapping in 3DGS. Our key insight is that regions unseen from the training views yield unreliable predictions from the 3DGS. To address this, we introduce a principled and efficient method for quantifying the visibility field in 3DGS, defined as the anisotropic visibility of each particle with respect to the training views, and represented using spherical harmonics. The resulting visibility field is integrated into a Bayesian Network–based uncertainty-aware volume rendering process, enabling real-time (200 FPS) uncertainty quantification for synthesized views. Active mapping is further performed within a maximum information gain framework building on this formulation.Extensive experiments across diverse environments demonstrate that GAVIS consistently and signifi

179

core_reconstruction

high

Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce a novel feed-forward framework that reconstructs 3D scenes from unposed multi-view images. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction—all within a single, feed-forward pass. Extensive experiments demonstrate this method establishes a new state-of-the-art across multiple benchmarks, including RE10K and ScanNet. Our work signifies a novel paradigm towards generalizable 3D scene reconstruction.

180

core_reconstruction

high

VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian splatting (3DGS) has demonstrated impressive performance in synthesizing high-fidelity novel views. Nonetheless, its effectiveness critically depends on the quality of the initialized point cloud. Specifically, achieving uniform and complete point coverage over the underlying scene structure requires overlapping observation frustums, an assumption that is often violated in unbounded, dynamic urban environments. Training Gaussian models with partially initialized point clouds often leads to distortions and artifacts, as camera rays may fail to intersect valid surfaces, resulting in incorrect gradient propagation to Gaussian primitives associated with occluded or invisible geometry. Additionally, existing densification strategies simply clone and split Gaussian primitives from existing ones, incapable of reconstructing geometry from missing structures. To address these limitati

181

core_reconstruction

high

VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Simultaneous Localization and Mapping (SLAM) with 3D Gaussian Splatting (3DGS) enables fast, differentiable rendering and high-fidelity reconstruction across diverse real-world scenes. However, existing 3DGS-SLAM approaches handle measurement reliability implicitly, making pose estimation and global alignment susceptible to drift in low-texture regions, transparent surfaces, or areas with complex reflectance properties. To this end, we introduce VarSplat, an uncertainty-aware 3DGS-SLAM system that explicitly learns per-splat appearance variance. By using the law of total variance with alpha compositing, we then compute corresponding differentiable per-pixel uncertainty map. This variance map guides tracking, submap registration, and loop detection toward focusing on reliable regions and contributes to more stable optimization. Experimental results on Replica (synthetic) and TUM-RGBD, Sca

182

core_reconstruction

high

VDFE: Difference-Aware 3D Scene Editing with Non-Intrusive Video Diffusion Priors for Multi-View Consistency and Efficiency

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; generation_editing; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Text-driven 3D editing, enabled by advancements in 3D reconstruction techniques such as NeRF and 3D Gaussian Splatting, aims to provide intuitive scene customization. However, existing methods frequently exhibit limitations in controllability and consistency. To address these shortcomings, we propose \textbf{VDFE}, a difference-aware 3D scene editing method based on non-intrusive utilization of pre-trained video diffusion priors, which integrates Optimal Control Guided Flow Editing (FlowOCE), Decoupled Flow Difference (DFD), and Difference-Aware Gaussians Editing (DAGE). Specifically, FlowOCE treats the editing process as an optimal control problem, optimizing a noise-free editing trajectory to minimize unintended modifications in non-target region; DFD precisely locates editing region by analyzing flow differences, which supplies priors for the subsequent optimization process; and DAGE

183

core_reconstruction

high

Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Existing single-image 3D human avatar methods primarily rely on rigid joint transformations, limiting their ability to model realistic cloth dynamics. We present DynaAvatar, a zero-shot framework that reconstructs animatable 3D human avatars with motion-dependent cloth dynamics from a single image. Trained on large-scale multi-person motion datasets, DynaAvatar employs a Transformer-based feed-forward architecture that directly predicts dynamic 3D Gaussian deformations without subject-specific optimization. To overcome the scarcity of dynamic captures, we introduce a static-to-dynamic knowledge transfer strategy: a Transformer pretrained on large-scale static captures provides strong geometric and appearance priors, which are efficiently adapted to motion-dependent deformations through lightweight LoRA fine-tuning on dynamic captures. We further propose the DynaFlow loss, an optical flow

184

core_reconstruction

high

GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

While 3D Gaussian splatting has emerged as a powerful paradigm, it fundamentally fails to model transparency such as glass panels, which are prevalent in everyday environments. The core challenge lies in decoupling the intertwined radiance contributions from transparent interfaces and the transmitted geometry observed through the glass. We present GLINT, a framework that models scene-scale transparency through explicit decomposed Gaussian representation. GLINT reconstructs the primary interface and separates outgoing radiance into reflection and transmission components according to its optical properties, enabling coherent Gaussian radiance transport. During the optimization, GLINT bootstraps transparency localization by utilizing geometry separation cues that emerge from our decomposition with the geometry and material priors from a pre-trained video relighting model. Extensive experime

185

core_reconstruction

high

MatSpray: Fusing 2D Material World Knowledge on 3D Geometry

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Manual modeling of material parameters and 3D geometry is a time consuming yet essential task in the gaming and film industries. While recent advances in 3D reconstruction have enabled accurate approximations of scene geometry and appearance, these methods often fall short in relighting scenarios due to the lack of precise, spatially varying material parameters. At the same time, diffusion models operating on 2D images have shown strong performance in predicting physically based rendering (PBR) properties such as albedo, roughness, and metallicity. However, transferring these 2D material maps onto reconstructed 3D geometry remains a significant challenge. We propose a framework for fusing 2D material data into 3D geometry using a combination of novel learning-based and projection-based approaches. We begin by reconstructing scene geometry via Gaussian Splatting. From the input images, a

186

core_reconstruction

high

Multi-view Pyramid Transformer: Look Coarser to See Broader

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the un

187

core_reconstruction

high

Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) represents scenes through primitives with coupled intrinsic properties: geometric attributes (position, covariance, opacity) and appearance attributes (view-dependent color). Faithful reconstruction requires intrinsic geometry-appearance consistency, where geometry accurately captures 3D structure while appearance reflects photometry. However, sparse observations lead to appearance overfitting and underconstrained geometry, causing severe novel-view artifacts.We present ICO-GS (Intrinsic Geometry-Appearance Consistency Optimization for 3DGS), a principled framework that enforces this consistency through tightly coupled geometric regularization and appearance learning. Our approach first regularizes geometry via feature-based multi-view photometric constraints by employing pixel-wise top-k selection to handle occlusions and edge-aware smoothness to preserve sh

188

core_reconstruction

high

RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Human perceive the 3D world through 2D observations from limited viewpoints. While recent feed-forward generalizable 3D reconstruction models excel at recovering 3D structures from sparse images, their representations are often confined to observed regions, leaving unseen geometry un-modeled. This raises a key, fundamental challenge: Can we infer a complete 3D structure from partial 2D observations? We present RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation. At the core of RnG, we propose a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level, and treats the KV-cache as an implicit 3D representation. Then, arbitrary poses can efficiently query this cache to render high-fidelity, novel-view RGBD outputs. As a result, RnG n

189

core_reconstruction

medium

Motion-Aware Animatable Gaussian Avatars Deblurring

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; data_benchmark

core genus=3D Gaussian Splatting, but title/abstract signal is narrower

abstract

The creation of 3D human avatars from multi-view videos is a significant yet challenging task in computer vision. However, existing techniques rely on high-quality, sharp images as input, which are often impractical to obtain in real-world scenarios due to variations in human motion speed and intensity. This paper introduces a novel method for directly reconstructing sharp 3D human Gaussian avatars from blurry videos. The proposed approach incorporates a 3D-aware, physics-based model of blur formation caused by human motion, together with a 3D human motion model designed to resolve ambiguities in motion-induced blur. This framework enables the joint optimization of the avatar representation and motion parameters from a coarse initialization. Comprehensive benchmarks are established using both a synthetic dataset and a real-world dataset captured with a 360-degree synchronous hybrid-expos

190

core_reconstruction

high

PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Multi-view 3D reconstruction methods remain highly sensitive to photometric inconsistencies arising from camera optical characteristics and variations in image signal processing (ISP). Existing mitigation strategies such as per-frame latent variables or affine color corrections lack physical grounding and generalize poorly to novel views. We propose the Physically-Plausible ISP (PPISP) correction module, which disentangles camera-intrinsic and capture-dependent effects through physically based and interpretable transformations. A dedicated PPISP controller, trained on the input views, predicts ISP parameters for novel viewpoints, analogous to auto exposure and auto white balance in real cameras. This design enables realistic and fair evaluation on novel views without access to ground-truth images. PPISP achieves SoTA performance on standard benchmarks, while providing intuitive control a

191

core_reconstruction

medium

High-Fidelity Mobile Avatars with Pruned Local Blendshapes

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

We propose a method to reconstruct high-fidelity human avatars from multi‑view video that can run on mobile devices. Many works can model high‑quality Gaussian-based full-body avatars from multi‑view video. However, these methods require heavy computation to obtain pose‑dependent appearance, making deployment on mobile devices very difficult. Recent methods distill from pretrained models and model pose‑dependent nonlinear Gaussian attributes by linearly combining global pose features with blendshapes. Although they can run on mobile devices, they suffer some loss of detail. We observe that nearby Gaussians are often highly correlated within a local region of the body, and can be linearly modeled with less error. Therefore, we use local linear blendshapes in small body parts to capture global nonlinear changes of Gaussian attributes. To further reduce computation and model size, we propos

192

core_reconstruction

medium

Learning Convex Decomposition via Feature Fields

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

This work proposes a new formulation to the long-standing problem of convex decomposition through learning feature fields, enabling the first feed-forward model for open-world learning of convex decomposition. Our method produces high-quality decompositions of 3D shapes into a union of convex bodies, which are essential to accelerate collision detection in physical simulation, amongst many other applications.The key insight is to adopt a feature learning approach and learn a continuous feature field that can later be clustered to yield a good convex decomposition via our self-supervised, purely-geometric objective derived from the classical definition of convexity.Our formulation can be used for single shape optimization, but more importantly, feature prediction unlocks scalable, self-supervised learning on large datasets resulting in the first learned open-world for convex decompositio

193

core_reconstruction

high

EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Feed-forward 3D Gaussian Splatting (3DGS) enables efficient one-pass scene reconstruction, providing 3D representations for novel view synthesis without per-scene optimization. However, existing methods typically predict pixel-aligned primitives per-view, producing an excessive number of primitives in dense-view settings and offering no explicit control over the number of predicted Gaussians. To address this, we propose EcoSplat, the first efficiency-controllable feed-forward 3DGS framework that adaptively predicts the 3D representation for any given target primitive count at inference time. EcoSplat adopts a two-stage optimization process. The first stage is Pixel-aligned Gaussian Training (PGT) where our model learns initial primitive prediction. The second stage is Importance-aware Gaussian Finetuning (IGF) stage where our model learns rank primitives and adaptively adjust their param

194

core_reconstruction

high

More Natural, More Real: Object-aware Gaussian Splatting for 3D Visual Decoding from Human Brain

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Exploring human visual perception and understanding of the stereoscopic world represents a significant topic in computational neuroscience. Recent studies have provided rich Brain-3D datasets, conducted preliminary explorations into 3D visual reconstruction. However, existing research struggles to capture the differences in dynamic changes of 3D stimulus views, and there remains room for improvement in high-fidelity reconstruction and rendering. 3D Gaussian Splatting (3DGS) has recently achieved significant progress in stereoscopic view synthesis. Inspired by it, we propose BrainGS -- an innovative framework for decoding more realistic 3D objects from the brain. BrainGS incorporates a Fusion Time-Spatial Network to achieve comprehensive encoding of the brain, combined with the Multi-Attribute Controller (MAC), it decouples features using visual, semantic, and color as anchors, effectivel

195

core_reconstruction

high

CoRoGS: Contextual Gaussian Splatting for Robust Large-Deviation View Synthesis

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Novel view synthesis (NVS) under large view deviations remains an underexplored challenge for 3D Gaussian Splatting (3DGS). In urban scenes with limited training coverage, models often fail to maintain geometric consistency when extrapolating to unseen viewpoints, resulting in severe distortions and degraded rendering quality. We introduce Context-Aware Gaussian Splatting (CoRoGS), a $\textbf{Co}$ntext-aware framework for $\textbf{Ro}$bust large-deviation novel view synthesis (LD-NVS) that embeds contextual reasoning into 3DGS. Instead of treating Gaussians as independent primitives, CoRoGS adopts a contextual formulation that explicitly models inter-Gaussian dependencies. This representation is implemented by constructing a 3D Gaussian graph, which propagates relational geometry and semantics via message passing, resulting in context-aware Gaussian updates. To further maintain structura

196

core_reconstruction

high

DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) has enabled efficient 3D scene reconstruction from everyday images with real-time, high-fidelity rendering, greatly advancing VR/AR applications. Fisheye cameras, with their wider field of view (FOV), promise high-quality reconstructions from fewer inputs and have recently attracted much attention. However, since 3DGS relies on rasterization, most subsequent works involving fisheye camera inputs first undistort images before training, which introduces two problems: 1) Black borders at image edges cause information loss and negate the fisheye’s large FOV advantage; 2) Undistortion’s stretch‐and‐interpolate resampling spreads each pixel’s value over a larger area, diluting detail density— causes 3DGS overfitting these low‐frequency zones, producing blur and floating artifacts.In this work, we integrate fisheye camera model into the original 3DGS framework, enab

197

core_reconstruction

high

Evidential Neural Radiance Fields

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Understanding sources of uncertainty is fundamental to trustworthy three-dimensional scene modeling. While recent advances in neural radiance fields (NeRFs) achieve impressive accuracy in scene reconstruction and novel view synthesis, the lack of uncertainty estimation significantly limits their deployment in safety-critical settings. Existing uncertainty quantification methods for NeRFs fail to capture both aleatoric and epistemic uncertainty. Among those that do quantify one or the other, many of them either compromise rendering quality or incur significant computational overhead to obtain uncertainty estimates. To address these issues, we introduce Evidential Neural Radiance Fields, a probabilistic approach that seamlessly integrates with the NeRF rendering process and enables direct quantification of both aleatoric and epistemic uncertainty from a single forward pass. We compare mult

198

core_reconstruction

high

LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Novel View Synthesis has often relied on explicit 3D representations, which inject a strong 3D bias in the process; however, recent work has shown that network-based rendering can work better despite lacking 3D inductive biases. In this paper, we show that much better quality can be obtained by leveraging a strong 3D bias without a 3D representation. To do so, we introduce LagerNVS, an encoder-decoder network that uses 3D-aware features as a latent scene encoding. The encoder is initialized from a 3D reconstruction network, paired with a lightweight decoder, and trained end-to-end with photometric losses. LagerNVS achieves state-of-the-art deterministic feed-forward Novel View Synthesis results (including 31.1 PSNR on Re10k), with and without known cameras, renders in real-time, generalizes to in-the-wild data without known cameras, and can be paired with a diffusion decoder for generati

199

core_reconstruction

high

Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We present an approach for high-quality dynamic Gaussian Splatting from monocular videos. To this end, we in this work go one step further beyond previous methods to explicitly model continuous position and orientation deformation of dynamic Gaussians, using an SE(3) B-spline motion bases with a compact set of control points. To improve computational efficiency while enhancing the ability to model complex motions, an adaptive control mechanism is devised to dynamically adjust the number of motion bases and control points. Besides, we develop a soft segment reconstruction strategy to mitigate long-interval motion interference, and employ a multi-view diffusion model to provide multi-view cues for avoiding overfitting to training views. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in novel view synthesis. Our code and trained model will be made pub

200

core_reconstruction

high

NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We present NimbusGS, a unified framework for reconstructing high-quality 3D scenes from degraded multi-view inputs captured under diverse and mixed adverse weather conditions. Unlike existing methods that target specific weather types, NimbusGS addresses the broader challenge of generalization by modeling the dual nature of weather: a continuous, view-consistent medium that attenuates light, and dynamic, view-dependent particles that cause scattering and occlusion. To capture this structure, we decompose degradations into a global transmission field and per-view particulate residuals. The transmission field represents static atmospheric effects shared across views, while the residuals model transient disturbances unique to each input. To enable stable geometry learning under severe visibility degradation, we introduce a geometry-guided gradient scaling mechanism that mitigates gradient i

201

core_reconstruction

high

RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) has emerged as a leading technology for high-quality 3D scene reconstruction. However, the iterative refinement and densification process leads to the generation of a large number of primitives, each contributing to the reconstruction to a substantially different extent. Estimating primitive importance is thus crucial, both for removing redundancy during reconstruction and for enabling efficient compression and transmission.Existing methods typically rely on rendering-based analyses, where each primitive is evaluated through its contribution across multiple camera viewpoints. However, such methods are 1) sensitive to the number and selection of views; 2) rely on specialized differentiable rasterizers; and 3) have long calculation times that grow linearly with view count, making them difficult to integrate as plug-and-play modules, as well as resulting in limi

202

core_reconstruction

high

ReLaGS: Relational Language Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Achieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either object-centric or rely on costly training for inter-object reasoning. We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. A Gaussian pruning mechanism refines scene geometry, while a robust multi-view language alignment strategy aggregates noisy 2D features into accurate 3D object embeddings. On top of this hierarchy, we build an open-vocabulary 3D scene graph with Vision Language-derived annotations and Graph Neural Network-based relational reasoning. Our approach enables efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships, validated a

203

core_reconstruction

high

ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos

3D Vision & Geometry / 3D Reconstruction

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

The ubiquity of monocular videos capturing daily hand-object interactions presents a valuable resource for embodied intelligence. While 3D hand reconstruction from in-the-wild videos has seen significant progress, reconstructing the involved objects remains challenging due to severe occlusions and the complex, coupled motion of the camera, hands, and object. In this paper, we introduce ForeHOI, a novel feed-forward model that directly reconstructs 3D object geometry from monocular hand-object interaction videos within one minute of inference time, eliminating the need for any pre-processing steps. Our key insight is that, the joint prediction of 2D mask inpainting and 3D shape completion in a feed-forward framework can effectively address the problem of severe occlusion in monocular hand-held object videos, thereby achieving results that outperform the performance of optimization-based

204

core_reconstruction

high

PromptDepth: Efficient and Promptable Geometric 3D Vision Model \\ for Embodied Intelligence

3D Vision & Geometry / 3D Reconstruction

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Vision models for embodied intelligence require efficient 3D comprehension and interaction with objects within the scene. Existing 3D reconstruction models either overlook instance-level perception or rely on time-consuming offline reasoning, showing a less adaptability in real-time embodied scenario. In this paper, we present PromptDepth, the first promptable vision model that features both geometric 3D understanding and instance-level interaction especially designed for embodied intelligence. PromptDepth is a feed-forward network that quickly yields panoptic, instanced, or tracked depth map from two corresponding frames, enabling the real-time infer sequences from embodied agents. Specifically, following the minimal prediction problem, we design a promptable Dense Prediction Transformer, making it flexible to interact with unified dense prediction according to a specific prompt. Consid

205

core_reconstruction

medium

Representing 3D Faces with Learnable B-Spline Volumes

3D Vision & Geometry / 3D Reconstruction

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

We present CUBE (Control-based Unified B-Splinie Encoding), a new geometric representation for digital humans that combines B-Spline volumes with learned features, and demonstrate its use as decoder for 3D scan registration and monocular 3D face reconstruction. Unlike existing B-Spline representations that use 3D control points, CUBE is parametrized by a lattice (e.g., $8 \times 8 \times 8$) of high-dimensional control features, increasing the models' expressivity. These control features define a continuous mapping from a 3D parametric domain to 3D Euclidean space through an intermediate feature space, which is evaluated in two stages. First, high-dimensional control features are locally blended using the B-Spline bases, yielding a high-dimensional feature vector, where the first three values are the 3D coordinates of a coarse base mesh. This feature vector is input to a small MLP to pre

206

core_reconstruction

medium

SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings

3D Vision & Geometry / Pose Estimation

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Monocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustm

207

core_reconstruction

high

SMVRT: Implicit Human 3D Modeling Using Sparse Multi-view Volumetric Reconstruction with Transformer Fusion

3D Vision & Geometry / 3D Reconstruction

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Recently, the community has witnessed significant progress in human modeling from a single view or multi-views, which often involves "guessing" the occluded parts using either generative models or template fitting. In this work, we address these challenges by exploring optimal fusion strategies from sparse views only. We propose an end-to-end implicit 3D reconstruction framework using a sparse multi-view setup. Specifically, we achieve this by exploring fusion blocks at three stages of the network. First, 2D feature encoders carrying out locally and globally, which produce enhanced features. Second, 3D feature grid, formed by attentional fusion of warped multi-view and multi-level 2D features, which follows 3D regularization of feature grids to aggregate spatially coherent multi-view features. Third, attentional 2D3D feature aggregation associated to query point generate enhanced latent

208

core_reconstruction

high

OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

B. bridge: reconstruction becomes mapping/world model

gaussian_radiance; depth_correspondence; robotics_mapping; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Open-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit spatial attribute grids for the

209

core_reconstruction

high

Reconstructing Functional 3D Scenes from Egocentric Interaction Videos

3D Vision & Geometry / 3D Reconstruction

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; surface_occupancy; robotics_mapping

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present FunREC, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGB-D interaction videos. Unlike existing methods on articulated reconstruction, which rely on controlled setups, multi-state captures, or CAD priors, FunREC operates directly on in-the-wild human interaction sequences to recover interactable 3D scenes. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. Across new real and simulated benchmarks, FunREC surpasses prior work by a large margin, achieving up to +50 mIoU improvement in part segmentation, 5$-$10$\times$ lower articulation and pose errors, and significantly higher reconstruction accuracy. We further demonstrate applications on URDF/USD export for simulation

210

core_reconstruction

high

X-Part: High Fidelity And Structure Coherent Shape Decomposition And Completion

3D Vision & Geometry / 3D Reconstruction

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; surface_occupancy; robotics_mapping

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Generating 3D shapes at part level is pivotal for downstream applications such as mesh retopology, UV mapping, and 3D printing. However, existing part-based generation methods often lack sufficient controllability and suffer from poor semantically meaningful decomposition. To this end, we introduce X-Part, a controllable generative model designed to decompose a holistic 3D object into semantically meaningful and structurally coherent parts with high geometric fidelity. X-Part exploits the bounding box as prompts for the part generation and injects point-wise semantic features for meaningful decomposition. Furthermore, we design an editable pipeline for interactive part generation. Extensive experimental results show that X-Part achieves state-of-the-art performance in part-level shape generation. This work establishes a new paradigm for creating production-ready, editable, and structural

211

core_reconstruction

high

AERGS-SLAM: Auto-Exposure-Robust Stereo 3D Gaussian Splatting SLAM

3D Vision & Geometry / 3D Gaussian Splatting

B. bridge: representation meets metric pose

gaussian_radiance; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian splatting (3DGS) has emerged as a revolutionary scene representation in simultaneous localization and mapping (SLAM) research. However, existing research on 3DGS-based SLAM fails to accurately address the appearance variations induced by camera auto-exposure in prevalent real-world scenarios, resulting in reduced localization and photorealistic mapping accuracy. To address this issue, we propose a stereo auto-exposure-robust Gaussian splatting SLAM (AERGS-SLAM), a framework robust to such variations and enables both reliable localization and exposure-controlled photorealistic mapping. Our key contributions are two fold. Firstly, we propose a camera exposure network to model the camera exposure process, which we integrate with Gaussian splatting to achieve exposure-controlled novel view synthesis. Secondly, we exploit an illumination-robust geometric feature for localization a

212

core_reconstruction

high

Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM

3D Vision & Geometry / 3D Gaussian Splatting

B. bridge: representation meets metric pose

gaussian_radiance; pose_calibration_localization; depth_correspondence; dynamic_4d; robotics_mapping

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Handling the dynamic environments is a significant research challenge in Visual Simultaneous Localization and Mapping (SLAM). Recent research combines 3D Gaussian Splatting (3DGS) with SLAM to achieve both robust camera pose estimation and photorealistic renderings. However, using SLAM to efficiently reconstruct both static and dynamic regions remains challenging. In this work, we propose an efficient framework for dynamic 3DGS SLAM guided by optical flow. Using the input depth and prior optical flow, we first propose a category-agnostic motion mask generation strategy by fitting a camera ego-motion model to decompose the optical flow. This module separates dynamic and static Gaussians and simultaneously provides flow-guided camera pose initialization. We boost the training speed of dynamic 3DGS by explicitly modeling their temporal centers at keyframes. These centers are propagated usin

213

core_reconstruction

high

ODGS-SLAM: Omnidirectional Gaussian Splatting SLAM

3D Vision & Geometry / 3D Gaussian Splatting

B. bridge: representation meets metric pose

gaussian_radiance; pose_calibration_localization; robotics_mapping; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

This work presents ODGS-SLAM, an omnidirectional simultaneous localization and mapping (SLAM) system utilizing 3D Gaussian Splatting (3DGS) as the unified representation for tracking and mapping.Thus, it reconstructs scene geometry from panoramic image sequences (RGB or RGBD) via splats while also detecting the camera poses.Such a framework is important to understand the full surrounding, *e.g.*, for augmented reality applications or autonomous systems.We extended existing 3DGS-SLAM methods to handle omnidirectional input by including closed-form gradients for mapping and camera pose estimation, utilizing an equirectangular projection model.To lower memory footprint, a key frame removal procedure based on graph analysis is proposed, enabling the application to handle larger input sizes.For evaluation, we provide a data set of controlled real-world and synthetic test scenes (indoor and ou

214

core_reconstruction

high

SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM

3D Vision & Geometry / 3D Gaussian Splatting

B. bridge: representation meets metric pose

gaussian_radiance; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) has made huge progress in RGBD SLAM. Current methods usually use 3D Gaussians or view-tied 3D Gaussians to represent radiance fields in tracking and mapping. However, these Gaussians are either too flexible or too limited in movements, resulting in slow convergence or limited rendering quality. To resolve this issue, we adopt pixel-aligned Gaussians but allow each Gaussian to adjust its position along its ray to maximize the rendering quality, even if Gaussians are simplified for improving scalability. To speed up the tracking, we model the depth distribution around each pixel as a Gaussian function, and then use these points to align each frame to the 3D scene quickly. We report our evaluations on widely used benchmarks, justify our designs, and show advantages over the latest methods in view rendering, camera tracking, runtime, and storage complexity.

215

core_reconstruction

medium

ULF-Loc: Unbiased Landmark Feature for Robust Visual Localization with 3D Gaussian Splatting

3D Vision & Geometry / Pose Estimation

B. bridge: representation meets metric pose

gaussian_radiance; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Visual localization is a core technology for augmented reality and autonomous navigation. Recent methods combine the efficient rendering of 3D Gaussian Splatting (3DGS) with feature-based localization. These methods rely on direct matching between 2D query features and the 3D Gaussian feature field, but this often results in mismatches due to an inherent bias in the learned Gaussian feature. We theoretically analyze the feature learning process in 3DGS, revealing that the widely adopted $\alpha$-blending optimization inherently introduces bias into 3D point features. This bias stems from the entanglement between individual Gaussians and their neighboring Gaussians, making the learned features unsuitable for precise matching tasks. Motivated by these findings, we propose ULF-Loc, an unbiased landmark feature framework that replaces biased feature optimization with geometry-weighted featur

216

core_reconstruction

high

Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty

3D Vision & Geometry / 3D Gaussian Splatting

B. bridge: representation meets metric pose

gaussian_radiance; pose_calibration_localization; depth_correspondence

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) has recently emerged as a powerful scene representation and is increasingly used for visual localization and pose refinement. However, despite its high-quality differentiable rendering, the robustness of 3DGS-based pose refinement remains highly sensitive to both the initial camera pose and the reconstructed geometry. In this work, we take a closer look at these limitations and identify two major sources of uncertainty: (i) pose prior uncertainty, which often arises from regression or retrieval models that output a single deterministic estimate, and (ii) geometric uncertainty, caused by imperfections in the 3DGS reconstruction that propagate errors into PnP solvers. Such uncertainties can distort reprojection geometry and destabilize optimization, even when the rendered appearance still looks plausible.To address these uncertainties, we introduce a relocaliza

217

core_reconstruction

high

Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

B. bridge: representation meets metric pose

gaussian_radiance; pose_calibration_localization; depth_correspondence

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Visual relocalization is a fundamental task in the field of 3D computer vision, estimating a camera’s pose when it revisits a previously known scene. While point-based hierarchical localization methods have shown strong scalability and efficiency, they are often limited by sparse image observations and weak feature matching. In this work, we propose SplatHLoc, a novel hierarchical visual relocalization framework that uses Feature Gaussian Splatting as the scene representation. For feature matching, we observe that Gaussian-rendered features and those extracted directly from images exhibit different strengths across the two-stage matching process: the former performs better in the coarse stage, while the latter proves more effective in the fine stage. Therefore, we introduce a hybrid feature matching strategy, enabling more accurate and efficient pose estimation. Extensive experiments on

218

core_reconstruction

high

GenSplat: Bridging the Generalization Gap in 3DGS Language Comprehension

3D Vision & Geometry / 3D Gaussian Splatting

B. bridge: representation meets metric pose

gaussian_radiance; pose_calibration_localization; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

In this paper, we propose GenSplat, a novel approach for language comprehension in 3D Gaussian Splatting (3DGS). Unlike previous methods that either achieve cross-scene generalization by being bounded to a predefined vocabulary or handle free-form language by overfitting to individual scenes, GenSplat is robust to free-form language queries and generalizable across 3DGS scene representations. Our key insight for this problem is to formulate a structured learning process to progressively align linguistic concepts with 3D Gaussians. It contains two novel technical contributions. First, we propose a Progressive Language Grounding Curriculum that structurally guides the model through learning category-level semantics to instance-level concepts and free-form language, preventing overfitting by building a generalizable language feature space. Second, we design a Multi-modal Large Language Mode

219

core_reconstruction

high

GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

B. bridge: representation meets metric pose

gaussian_radiance; pose_calibration_localization

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and real-time rendering. Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points. Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issue, we propose graph-based spatial distribution optimization for compact 3D Gaussian Splatting (GS\textasciicircum2), which enhances reconstruction quality by optimizing the spatial distribution of Gaussian points. Specifically, we introduce an evidence lower bound (ELBO)-based adaptive densification strategy that automatically controls the densification process. In addition, an opacity-aware progressive pruning strategy is proposed to further reduce memory consumption by dynamically removing

220

core_reconstruction

high

3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; surface_occupancy; generation_editing; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We introduce 3D-Fixer, a novel generalizable and efficient scheme for single-image to compositional 3D scene generation. Unlike existing feed-forward frameworks that lack generalization ability in open-set scenarios due to the limited dataset, or divide-and-conquer frameworks that suffer from slow inference or accumulated registration errors during layout alignment, 3D-Fixer extends pre-trained object-level 3D generation priors to perform in-place completion on the single-view estimated geometry, eliminating the need for pose alignment while preserving feed-forward efficiency. At its core, 3D-Fixer introduces a coarse-to-fine scheme to accurately determine the completion boundary and generate high quality completion 3D asset based on the single-view estimated fragmented geometry. Also, we design a dual-branch conditioning network that integrates 2D and 3D contextual information to guide

221

core_reconstruction

high

AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and even surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks.

222

core_reconstruction

high

Cupid: Generative 3D Reconstruction via Joint Object and Pose Modeling

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We introduce Cupid, a generative 3D reconstruction framework that jointly models the full distribution over both canonical objects and camera poses. Our two-stage flow-based model first generates a coarse 3D structure and 2D-3D correspondences to estimate the camera pose robustly. Conditioned on this pose, a refinement stage injects pixel-aligned image features directly into the generative process, marrying the rich prior of a generative model with the geometric fidelity of reconstruction. This strategy achieves exceptional faithfulness, outperforming state-of-the-art reconstruction methods by over 3 dB PSNR and 10\% in Chamfer Distance. As a unified generative model that decouples the object and camera pose, Cupid naturally extends to multi-view and scene-level reconstruction tasks without requiring post-hoc optimization or fine-tuning.

223

core_reconstruction

high

Edges Compete for Trust: Group Relative Edge Optimization for Building Reconstruction from Point Clouds

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Building reconstruction aims to extract compact wireframes from point clouds. Recent edge-based methods achieve impressive results but suffer from sparse supervision from one-to-one matching, which leaves most edge proposals under-optimized. In this paper, we present Group Relative Edge Optimization (GREO), the first attempt to incentivize dense supervision across edges proposals through reinforcement learning-style optimization in wireframe reconstruction. Specifically, GREO computes edge-level rewards based on geometric alignment quality and transforms them into target confidence distributions via group-wise normalization. In addition, we incorporate entropy regularization to maintain distributional stability and prevent confidence collapse. This joint optimization enables dense and discriminative supervision across all edge proposals through cross-entropy minimization. Experiments

224

core_reconstruction

high

JRM: Joint Reconstruction Model for Multiple Objects without Alignment

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Object-centric reconstruction seeks to recover the 3D structure of a scene through composition of independent objects. While this independence can simplify modeling, it discards strong signals that could improve reconstruction, notably repetition where the same object model is seen multiple times in a scene, or across scans. We propose the Joint Reconstruction Model (JRM) to leverage repetition by framing object reconstruction as one of personalized generation: multiple observations share a common subject that should be consistent for all observations, while still adhering to the specific pose and state from each. Prior methods in this direction rely on explicit matching and rigid alignment across observations, making them sensitive to errors and difficult to extend to non-rigid transformations. In contrast, JRM is a 3D flow-matching generative model that implicitly aggregates unaligned

225

core_reconstruction

high

Long-Tail Internet Photo Reconstruction

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Internet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed, while most real-world sites contain only sparse, noisy, and uneven imagery that defeats classical and learned 3D methods. Existing 3D foundation models generalize well to curated datasets but collapse under the sparsity, ambiguity, and irregularity of Internet photos. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large-scale, clean, and depth-refined dataset, together with a sparse-aware sampling strategy that mimic camera distributions in long-tai

226

core_reconstruction

high

ManifoldNeuS: Manifold-aware View Optimizability for Pose-Free Neural Surface Reconstruction

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Jointly optimizing camera poses and object geometry from unposed images is a challenging task in neural surface reconstruction. Existing methods often suffer from pose drift and geometric distortion, stemming from the easy-view bias --- uniform view optimization favors easy-to-optimize views with abundant texture and good overlap that dominate gradient updates, while hard-to-optimize counterparts with weak texture or limited overlap yet critical for geometric completeness are progressively marginalized. To address this, we propose ManifoldNeuS, a novel framework that explicitly models and leverages per-view optimizability to guide pose-free neural surface reconstruction. Specifically, we introduce the manifold-aware view optimizability score (MaVOS), which jointly assesses immediate fitness (the ease of optimizing each view) and long-term coverage gain (the value of optimizing each view)

227

core_reconstruction

high

Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

The 3D characterization of microstructures is crucial for understanding and designing functional materials. However, the scanning electron microscope (SEM), widely used in scientific research, captures only 2D electron intensity distributions. Existing SEM 3D reconstruction methods struggle with textureless regions, shadowing artifacts, and calibration dependencies, whereas advanced learning-based approaches fail to generalize to microscopic SEM domains due to the lack of physical priors and domain-specific data. To address these challenges, we introduce NFH-SEM, a neural field-based hybrid reconstruction framework that recovers high-fidelity 3D surfaces from multi-view, multi-detector SEM images. NFH-SEM integrates coarse multi-view geometry with photometric stereo cues from detector signals through a continuous neural field, incorporating a learnable forward model that embeds SEM imagi

228

core_reconstruction

high

PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Panoramic imagery offers a full $360^\circ$ field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting.We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth a

229

core_reconstruction

medium

Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Recent learning-based face reconstruction and registration frameworks such as ToFu and TEMPEH have shown that dense correspondence between facial scans and a common topology can be learned directly from images. However, these approaches still depend on precomputed registrations obtained through iterative optimization pipelines that often require manual verification and correction by human annotators. We introduce MOCHI (Multi-view Optimizable Correspondence of Heads from Images), a fully differentiable and registration-free alternative. Instead of relying on optimization-based registrations, we employ a pseudo-linear inverse kinematic solver in conjunction with dense 2D keypoints produced by a tracker trained only on synthetic data to directly enforce a common face topology at the vertex level. We further find that the commonly used point-to-surface distance can lead to unstable training

230

core_reconstruction

high

TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Topology-consistent dynamic model sequences are essential for applications such as animation and model editing. However, existing 4D reconstruction methods face challenges in generating high-quality topology-consistent meshes. To address this, we propose a topology-aware dynamic reconstruction framework based on Gaussian Splatting. We introduce a Gaussian topological structure that explicitly encodes spatial connectivity. This structure enables topology-aware densification and pruning, preserving the manifold consistency of the Gaussian representation. Temporal regularization terms further ensure topological coherence over time, while differentiable mesh rasterization improves mesh quality. Experimental results demonstrate that our method reconstructs topology-consistent mesh sequences with significantly higher accuracy than existing approaches. Moreover, the resulting meshes enable prec

231

core_reconstruction

medium

TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy; generation_editing

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

The dominant paradigm for high-fidelity 3D generation relies on a VAE-Diffusion pipeline, where the VAE's reconstruction capability sets a firm upper bound on generation quality. A fundamental challenge limiting existing VAEs is the \textit{representation mismatch} between ground-truth meshes and network predictions: GT meshes have arbitrary, variable topology, while VAEs typically predict fixed-structure implicit fields (\eg, SDF on regular grids). This inherent misalignment prevents establishing explicit mesh-level correspondences, forcing prior work to rely on indirect supervision signals such as SDF or rendering losses. Consequently, fine geometric details, particularly sharp features, are poorly preserved during reconstruction. To address this, we introduce TopoMesh, a sparse voxel-based VAE that unifies both GT and predicted meshes under a shared Dual Marching Cubes (DMC) topologic

232

core_reconstruction

high

UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present UniSH, a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction. A key challenge in this domain is the scarcity of large-scale, annotated real-world data, forcing a reliance on synthetic datasets. This reliance introduces a significant sim-to-real domain gap, leading to poor generalization, low-fidelity human geometry, and poor alignment on in-the-wild videos. To address this, we propose an innovative training paradigm that effectively leverages unlabeled in-the-wild data. Our framework bridges strong, disparate priors from scene reconstruction and HMR, and is trained with two core components: (1) a robust distillation strategy to refine human surface details by distilling high-frequency details from an expert depth model, and (2) a two-stage supervision scheme, which first learns coarse localization on synthetic data, then fine-tunes on real

233

core_reconstruction

medium

ViLearn: Accelerating Training Convergence of Image-to-3D Generation via Visibility Learning

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy; generation_editing

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Single-image-to-3D shape generation has seen remarkable progress, driven by latent diffusion models trained on the compressed latent space of 3D VAEs. However, the task remains intrinsically ill-posed: recovering complete 3D geometry—especially occluded surfaces—from a single view is inherently ambiguous. Existing VecSet-based approaches further exacerbate this challenge by treating shape tokens as an unordered set without explicit positional encoding. This design forces diffusion models to simultaneously learn visible correspondences from the input image and hallucinate invisible geometry within a large, permutation-invariant token space, where the lack of structure significantly hinders training efficiency and convergence stability.To address this, we propose \textit{Visibility Learning}, a training paradigm that injects visibility structure and positional inductive bias into the image

234

core_reconstruction

high

ART: Articulated Reconstruction Transformer

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We introduce ART, Articulated Reconstruction Transformer—a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as a part-based prediction problem. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable to standard simulation formats. Trained o

235

core_reconstruction

high

PE3R: Perception-Efficient 3D Reconstruction

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Recent advances in 2D-to-3D perception have enabled the recovery of 3D scene semantics from unposed images. However, prevailing methods often suffer from limited generalization, reliance on per-scene optimization, and semantic inconsistencies across viewpoints. To address these limitations, we introduce PE3R, a tuning-free framework for efficient and generalizable 3D semantic reconstruction. By integrating multi-view geometry with 2D semantic priors in a feed-forward pipeline, PE3R achieves zero-shot generalization across diverse scenes and object categories without any scene-specific fine-tuning. Extensive evaluations on open-vocabulary segmentation and multi-view depth estimation show that PE3R not only achieves up to 9$\times$ faster inference but also sets new state-of-the-art accuracy in both semantic and geometric metrics. Our approach paves the way for scalable, language-driven 3D

236

core_reconstruction

high

PhyGaP: Physically-Grounded Gaussians with Polarization Cues

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated great success in modeling reflective 3D objects and their interaction with the environment via **deferred rendering (DR)**. However, existing methods often struggle with correctly reconstructing physical attributes such as albedo and reflectance, and therefore they do not support high-fidelity relighting. Observing that this limitation stems from the lack of **shape and material** information in RGB images, we present PhyGaP, a physically-grounded 3DGS method that leverages polarization cues to facilitate precise reflection decomposition and visually consistent relighting of reconstructed objects. Specifically, we design a polarimetric deferred rendering (PolarDR) process to model polarization by reflection, and a self-occlusion-aware environment map building technique (GridMap) to resolve indirect lighting of non-convex o

237

core_reconstruction

high

SASNet: Spatially-Adaptive Sinusoidal Networks for INRs

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Sinusoidal neural networks (SIRENs) are powerful implicit neural representations (INRs) for low-dimensional signals in vision and graphics. By encoding input coordinates with sinusoidal functions, they enable high-frequency image and surface reconstruction. However, training SIRENs is often unstable and highly sensitive to frequency initialization: small frequencies produce overly smooth reconstructions in detailed regions, whereas large ones introduce spurious high-frequency components that manifest as noise in smooth areas such as image backgrounds. To address these challenges, we propose $\textbf{SASNet}$, a $\textit{Spatially-Adaptive Sinusoidal Network}$ that couples a $\textit{frozen frequency embedding layer}$, which explicitly fixes the network’s frequency support, with $\textit{jointly learned spatial masks}$ that localize neuron influence across the domain. This pairing stabili

238

core_reconstruction

high

Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14×/16× (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across a

239

core_reconstruction

medium

Particulate: Feed-Forward 3D Object Articulation

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing; data_benchmark

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

We introduce Particulate, a feed-forward model that, given a single static 3D mesh of an everyday object, predicts its 3D parts, kinematic structure, and articulation parameters.Unlike prior work on articulated 3D object modeling that is limited by costly per-object optimization and small retrieval databases or requires large vision or language foundation models, our approach is based on a flexible, scalable and lightweight transformer architecture.Trained on a diverse collection of articulated 3D assets from public datasets, Particulate accurately infers the articulated structure of novel objects, including those generated by image-to-3D models, in a single feed-forward pass.We further introduce a benchmark for articulated 3D object estimation curated from high-quality public 3D assets.Quantitative and qualitative results show that Particulate significantly outperforms state-of-the-art

240

core_reconstruction

high

SPE-MVS: Spatial Position Encoding Enhanced Multi-View Stereo with Monocular Depth Priors

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Learning-based Multi-View Stereo (MVS) methods have become the mainstream in the field, relying on the construction of cost volumes through multi-view feature similarity computation and regularization. However, existing methods depend heavily on photometric consistency across views, leading to poor performance in challenging regions, such as weakly textured or non-Lambertian surfaces. To overcome this limitation, we propose SPE-MVS, a novel MVS framework enhanced with Spatial Position Encoding (SPE). The SPE represents the 3D positional information of pixels in each image within a unified metric space, constructed using monocular depth priors. We integrate the SPE alongside image data as input and introduce a Photometric-Spatial Hybrid Feature Extractor, along with an SPE-enhanced cost volume construction module. These components incorporate spatial position-based similarity computation,

241

core_reconstruction

high

OMGTex: One-stage Multi-style Facial Texture Reconstruction without Geometry Guidance

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We propose OMGTex, an end-to-end diffusion-based framework for reconstructing high-quality and editable facial UV textures from multi-style facial images. Existing texture reconstruction methods face two major limitations: (1) Fragility due to reliance on 3D geometry priors, which are difficult to estimate accurately, especially under facial occlusions or in stylized domains; and (2) A lack of semantic disentanglement, inhibiting region-specific texture editing and style transfer. Our work addresses both challenges simultaneously.Our core innovation is a geometry-free pipeline that directly maps a 2D face image to its corresponding editable UV texture. We introduce two key techniques: First, to address the challenge of UV misalignment common in diffusion generation, we introduce a gradient-guided refinement strategy at inference time, which explicitly corrects structural consistency. Sec

242

core_reconstruction

high

eRetinexGS: Retinex Modeling for Low-Light Scene Enhancement via Event Streams and 3D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; dynamic_4d; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Perception under low illumination remains a major challenge for computer vision systems, as RGB sensors often fail to capture sufficient structural and color information in extremely dark environments. Event cameras, with their high dynamic range and temporal resolution, provide complementary cues that are well suited for such conditions. In this work, we present eRetinexGS, a novel framework that jointly leverages event streams and low-light frames through 3D Gaussian Splatting for scene-level enhancement and reconstruction. Unlike previous approaches that operate on individual frames, eRetinexGS enforces geometric and photometric consistency across multiple views, bridging the gap between degraded images and noisy event signals. By introducing an event-assisted Retinex decomposition and a reflectance–illumination representation within the 3DGS pipeline, our method reconstructs normal-l

243

core_reconstruction

high

Global Structure-from-Motion Meets Feedforward Reconstruction

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Structure-from-Motion -- the process of simultaneously estimating camera poses and 3D scene structure from a collection of images -- remains a central challenge in computer vision, with many open problems yet to be solved.Recent advances in feedforward 3D reconstruction have made significant strides in overcoming persistent failure cases of classical SfM methods, particularly in scenarios characterized by low texture, limited image overlap, and symmetries.However, while feedforward approaches excel in these challenging conditions, they often face limitations regarding scalability, accuracy, and robustness, and typically fall short of classical methods in standard reconstruction settings.In this work, we systematically analyze these limitations and propose a new state-of-the-art Structure-from-Motion pipeline by combining the respective strengths of classical and feedforward methods.Exten

244

core_reconstruction

high

RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; dynamic_4d; generation_editing

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Temporal retiming, the ability to reconstruct and render dynamic scenes at arbitrary timestamps, is crucial for applications such as slow-motion playback, temporal editing, and post-production. However, most existing 4D Gaussian Splatting (4DGS) methods overfit at discrete frame indices but struggle to represent continuous-time frames, leading to ghosting artifacts when interpolating between timestamps. We identify this limitation as a form of temporal aliasing and propose RetimeGS, a simple yet effective 4DGS representation that explicitly defines the temporal behavior of the 3D Gaussian and mitigates temporal aliasing. To achieve smooth and consistent interpolation, we incorporate optical flow–guided initialization and supervision, triple-rendering supervision, and other targeted strategies. Together, these components enable ghost-free, temporally coherent rendering even under large mo

245

core_reconstruction

medium

ArtLLM: Generating Articulated Assets via 3D LLM

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Creating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimization-based reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a varia

246

core_reconstruction

medium

Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Existing 3D editing methods rely on computationally intensive scene-by-scene iterative optimization and suffer from multi-view inconsistency. We propose an effective and fully feedforward 3D editing framework based on the TRELLIS generative backbone, capable of modifying 3D models from a single editing view. Our framework addresses two key issues: adapting training-free 2D editing to structured 3D representations, and overcoming the bottleneck of appearance fidelity in compressed 3D features. To ensure geometric consistency, we introduce Voxel FlowEdit, an edit-driven flow in the sparse voxel latent space that achieves globally consistent 3D deformation in a single pass. To restore photorealistic details, we develop a normal-guided single to multi-view generation module as an external appearance prior, successfully recovering high-frequency textures. Experiments demonstrate that our meth

247

core_reconstruction

high

LaS-Comp: Zero-shot 3D Completion with Latent–Spatial Consistency

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

This paper introduces LaS-Comp, a zero-shot and category-agnostic approach that leverages the rich geometric priors of 3D foundation models to enable 3D shape completion across diverse types of partial observations. Our contributions are threefold: First, LaS-Comp harnesses these powerful generative priors for completion through a complementary two-stage design: (i) an explicit replacement stage that preserves the partial observation geometry to ensure faithful completion; and (ii) an implicit refinement stage ensures seamless boundaries between the observed and synthesized regions. Second, our framework is training-free and compatible with different 3D foundation models. Third, we introduce Omni-Comp, a comprehensive benchmark combining real-world and synthetic data with diverse and challenging partial patterns, enabling a more thorough and realistic evaluation. Both quantitative and q

248

core_reconstruction

high

Unified Primitive Proxies for Structured Shape Completion

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Structured shape completion recovers missing geometry as primitives rather than as unstructured points, which enables primitive-based surface reconstruction. Instead of following the prevailing cascade, we rethink how primitives and points should interact, and find it more effective to decode primitives in a dedicated pathway that attends to shared shape features. Following this principle, we present UniCo, which in a single feed-forward pass predicts a set of primitives with complete geometry, semantics, and inlier membership. To drive this unified representation, we introduce primitive proxies, learnable queries that are contextualized to produce assembly-ready outputs. To ensure consistent optimization, our training strategy couples primitives and points with online target updates. Across synthetic and real-world benchmarks with four independent assembly solvers, UniCo consistently ou

249

core_reconstruction

high

2D-LFM: Lifting Foundation Model without 3D supervision

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Recent vision foundation models give the impression that 3D reconstruction from RGB is largely solved. Yet these systems struggle with object-specific 3D structure: the fine-grained geometry implied by an object’s landmarks or skeleton. In this paper, we show that when a model is given only 2D landmarks, it can recover more accurate 3D structure than state-of-the-art depth-from-RGB foundation models. Classical lifting approaches such as PAUL demonstrate this principle but do not scale beyond single categories, while methods like 3D-LFM scale but require extensive 3D supervision. We present the first lifting foundation model that learns object-specific 3D geometry using only 2D supervision. The key idea is to inject correspondence structure into the model via a positional encoding inspired by classical structure-from-motion. This simple inductive bias enables robust, object-agnostic 3D li

250

core_reconstruction

high

EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; dynamic_4d; generation_editing

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Audio-driven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Few-shot methods enable instant personalization by reconstructing high-fidelity avatars from only a few seconds of video. However, achieving natural talking-head generation further requires strong emotion-aware motion modeling, and existing few-shot approaches exhibit geometric instability and audio-emotion mismatch under expressive facial motion. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, which introduces strong geometric priors for stable and interpretable motion. Building upon this, we propose a Gated Residual Motion Network (GRMN), w

251

core_reconstruction

high

Dehallu3D: Hallucination-Mitigated 3D Generation from a Single Image via Cyclic View Consistency Refinement

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Large 3D reconstruction models have revolutionized the 3D content generation field, enabling broad applications in virtual reality and gaming. Just like other large models, large 3D reconstruction models suffer from hallucinations as well, introducing structural outliers (e.g., odd holes or protrusions) that deviate from the input data. However, unlike other large models, hallucinations in large 3D reconstruction models remain severely underexplored, leading to malformed 3D-printed objects or insufficient immersion in virtual scenes. Such hallucinations majorly originate from that existing methods reconstruct 3D content from sparsely generated multi-view images which suffer from large viewpoint gaps and discontinuities. To mitigate hallucinations by eliminating the outliers, we propose Dehallu3D for 3D mesh generation. Our key idea is to design a balanced multi-view continuity constraint

252

core_reconstruction

high

FAITHFUL CONTOURING: NEAR-LOSSLESS 3D VOXEL REPRESENTATION FREE FROM ISO-SURFACE

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiment

253

core_reconstruction

medium

PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Existing autoregressive (AR) methods for generating artist-designed meshes struggle to balance global structural consistency with high-fidelity local details, and are susceptible to error accumulation. To address this, we propose PartDiffuser, a novel semi-autoregressive diffusion framework for point-cloud-to-mesh generation. The method first performs semantic segmentation on the mesh and then operates in a "part-wise" manner: it employs autoregression between parts to ensure global topology, while utilizing a parallel discrete diffusion process within each semantic part to precisely reconstruct high-frequency geometric features. PartDiffuser is based on the DiT architecture and introduces a part-aware cross-attention mechanism, using point clouds as hierarchical geometric conditioning to dynamically control the generation process, thereby effectively decoupling the global and local gene

254

core_reconstruction

high

PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We introduce PixARMesh, the first method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post-hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative modeling, we enrich a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream of context, pose, and mesh tokens, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing

255

core_reconstruction

high

RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; surface_occupancy; generation_editing

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual quality. However, existing methods struggle with semi-transparent specular surfaces that exhibit both complex reflections and clear transmission, often producing blurry reflections or overly occluded transmission. To address this, we present **RT-Splatting**, a framework that disentangles each Gaussian's geometric occupancy from its optical opacity. This factorization yields a unified surface-volume scene representation with a single set of Gaussian primitives. Our hybrid renderer interprets this representation both as a surface to capture high-frequency reflections and as a volume to preserve clear transmission. To mitigate the ambiguity in jointly optimizing reflection and transmission, we introduce Specular-Aware Gradient Gating, which suppresses misleading gradients from highly specular regions into

256

core_reconstruction

high

GeoRelight: Learning Joint Geometrical Reconstruction and Relighting with Flexible Multi-Modal Diffusion Transformers

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Relighting a person from a single photo is an attractive but ill-posed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current methods either use sequential pipelines that suffer from error accumulation, or they do not explicitly leverage 3D geometry during relighting, which limits physical consistency. Since relighting and estimation of 3D geometry are mutually beneficial tasks, we propose a unified Multi-Modal Diffusion Transformer (DiT) that jointly solves for both: **GeoRelight**. We make this possible through two key technical contributions: isotropic NDC-Orthographic Depth (iNOD), a distortion-free 3D representation compatible with latent diffusion models; and a strategic mixed-data training method that combines synthetic and auto-labeled real data. By solving geometry and relighting jointly, GeoRelight achieves better performance than

257

core_reconstruction

medium

Foundry: Distilling 3D Foundation Models for the Edge

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient `specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable.In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher’s token-l

258

core_reconstruction

high

Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; depth_correspondence; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) has emerged as an efficient approach for achieving photorealistic rendering. Recent MLP-based variants further improve visual fidelity but introduce substantial decoding overhead during rendering. To alleviate computation cost, several pruning strategies and level-of-detail (LOD) techniques have been introduced, aiming to effectively reduce the number of Gaussian primitives in large-scale scenes. However, our analysis reveals that significant redundancy still remains due to the lack of occlusion awareness. In this work, we propose Proxy-GS, a novel pipeline that exploits a proxy to introduce Gaussian occlusion awareness from any view.At the core of our approach is a fast proxy system capable of producing precise occlusion depth maps at resolution 1000$\times$1000 under 1 ms. This proxy serves two roles: first, it guides the culling of anchors and Gaussians to

259

core_reconstruction

high

SAM 3D: 3Dfy Anything in Images

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a $5:1$ win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruc

260

core_reconstruction

high

SimRecon: SimReady Compositional Scene Reconstruction from Real Videos

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Compositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a ''Perception-Generation-Simulation" pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging

261

core_reconstruction

high

WorldGen: From Text to Traversable and Interactive 3D Worlds

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We introduce WorldGen, a method for generating large, fully formed, navigable 3D worlds from a single text prompt. Existing approaches to 3D scene generation often trade off scene diversity, completeness, and correctness in different ways. We push this envelope by producing large scenes explicitly decomposed into individual, high-quality 3D meshes, making them compatible with standard game engines. Our approach first uses a language-driven procedural generator to lay out the scene's basic volumes and navigable regions. An image generator then establishes the scene's theme, style, and details. Next, we obtain a high-quality, compositional 3D reconstruction of the planned scene. This step first uses an image-to-3D model to perform a holistic reconstruction that implicitly determines the shape and location of all scene objects, accounting for context and navigability. The reconstruction is

262

core_reconstruction

high

Confidence-Guided Multi-Scale Aggregation for Sparse-View High-Resolution 3D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; depth_correspondence; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Sparse-view 3D Gaussian Splatting (3DGS) reconstructs scenes using 3D Gaussians from sparse input views. Yet, this method is prone to overfitting, which is exacerbated at higher resolutions as the expanded dimensionality amplifies floating artifacts and reconstruction ambiguities. In this paper, we present a systematic study of 3DGS under sparse-view conditions and varying input resolutions. While prior work has overlooked resolution as a key factor in sparse-view performance, we identify and quantify a trade-off: lower-resolution inputs facilitate stable global geometry reconstruction, whereas higher-resolution inputs enable finer detail recovery but introduce high-frequency artifacts and instability. Building on this insight, we further propose **CAGS**, a Confidence-Guided Multi-Scale Aggregation that reconstructs scenes through a coarse-to-fine hierarchical optimization process‌. Our

263

core_reconstruction

medium

Efficient unrolled networks for large-scale 3D inverse problems

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Deep learning-based methods have revolutionized the field of imaging inverse problems, yielding state-of-the-art performance across various imaging domains. The best performing networks incorporate the imaging operator within the network architecture, typically in the form of deep unrolling. However, in large-scale problems, such as 3D imaging, most existing methods fail to incorporate the operator in the architecture due to the prohibitive amount of memory required by global forward operators, which hinder typical patching strategies. In this work, we present a domain partitioning strategy and normal operator approximations that enable the training of end-to-end reconstruction models incorporating forward operators of arbitrarily large problems into their architecture. The proposed method achieves state-of-the-art performance on 3D X-ray cone-beam tomography and 3D multi-coil accelerate

264

core_reconstruction

high

EI-Part：Explode for Completion and Implode for Refinement

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Part-level 3D generation is crucial for various downstream applications, including gaming, film production, and industrial design. However, decomposing a 3D shape into geometrically plausible and meaningful components remains a significant challenge. Previous part-based generation methods often struggle to produce well-constructed parts, exhibiting either poor structural coherence, geometric implausibility, inaccuracy, or inefficiency. To address these challenges, we introduce EI-Part, a novel framework specifically designed to generate high-quality 3D shapes with components distinguished by structural coherence, geometric plausibility, accuracy, and generation efficiency. We propose utilizing distinct representations at different stages: an Explode state for part completion and an Implode state for geometry refinement. This strategy allows us to fully leverage spatial resolution, enabli

265

core_reconstruction

medium

Fresco: Frequency–Spatial Consistent Optimization for Fine-Grained Head Avatar Modeling

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

We propose Fresco, a unified optimization paradigm designed to mitigate early over-sharpening, and cross-view drifting in head avatar reconstruction. Fresco combines a Laplacian-pyramid-based frequency curriculum with UV-space consistency regularization to progressively enhance reconstruction quality. The optimization begins by stabilizing low-frequency appearance in the image domain, which suppresses spurious details and promotes reliable convergence. As learning proceeds, consistency across different viewpoints is reinforced through pixel-level alignment on shared UV texture coordinates. Finally, high-frequency components are refined under explicit frequency-band constraints, and seam boundary regularization is applied to preserve local continuity. By optimizing in a frequency- and UV-aligned space, Fresco achieves robust convergence without pseudo high-frequency artifacts and yields c

266

core_reconstruction

medium

LoST: Level of Semantics Tokenization for 3D Shapes

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation.However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D s

267

core_reconstruction

high

SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; dynamic_4d

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Dynamic extensions of 3D Gaussian Splatting (3DGS) achieve high-quality reconstructions through neural motion fields, but per-Gaussian neural inference makes these models computationally expensive. Building on DeformableGS, we introduce Speedy Deformable 3D Gaussian Splatting (SpeeDe3DGS), which bridges this efficiency–fidelity gap through three complementary modules: Temporal Sensitivity Pruning (TSP) removes low-impact Gaussians via temporally aggregated sensitivity analysis, Temporal Sensitivity Sampling (TSS) perturbs timestamps to suppress floaters and improve temporal coherence, and GroupFlow distills the learned deformation field into shared SE(3) transformations for efficient groupwise motion. On the 50 dynamic scenes in MonoDyGauBench, integrating TSP and TSS into DeformableGS accelerates rendering by 6.78$\times$ on average while maintaining neural-field fidelity and using 10$\

268

core_reconstruction

high

GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; dynamic_4d

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) has emerged as a prominent 3D representation for high-fidelity and real-time rendering. Prior work has coupled physics simulation with Gaussians, but predominantly targets soft, deformable materials, leaving brittle fracture largely unresolved. This stems from two key obstacles: the lack of volumetric interiors with coherent textures in GS representation, and the absence of fracture-aware simulation methods for Gaussians. To address these challenges, we introduce GaussianFluent, a unified framework for realistic simulation and rendering of dynamic object states. First, it synthesizes photorealistic interiors by densifying internal Gaussians guided by generative models. Second, it integrates an optimized Continuum Damage Material Point Method (CD-MPM) to enable brittle fracture simulation at remarkably high speed. Our approach handles complex scenarios includi

269

core_reconstruction

high

EMR-SM: Explicit Mesh Reconstruction with Dynamic Topology Adaptation

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Reconstructing surface meshes from multi-view images has remained a core challenge in recent years. Most existing methods, whether implicit or explicit, depend on intermediate representations and post-processing steps like Marching Cubes or TSDF fusion, often resulting in artifacts and fragmented geometry. Directly optimizing explicit meshes is a promising approach. However, it presents two critical challenges. The first is how to adaptively refine mesh topology to capture detail without introducing degenerate faces. The second is how to maintain consistent UV coordinates for high-fidelity texturing as the mesh structure evolves. To overcome these, we propose EMR-SM, a novel framework that directly optimizes explicit meshes by integrating differentiable optimization with discrete topology updates. Specifically, we introduce an adaptive vertex splitting and merging strategy, along with re

270

core_reconstruction

high

Faster-GS: Analyzing and Improving Gaussian Splatting Optimization

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; dynamic_4d

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Recent advances in 3D Gaussian Splatting (3DGS) have focused on accelerating optimization while preserving reconstruction quality. However, many proposed methods entangle implementation-level improvements with fundamental algorithmic modifications or trade performance for fidelity, leading to a fragmented research landscape that complicates fair comparison.In this work, we consolidate and evaluate the most effective and broadly applicable strategies from prior 3DGS research and augment them with several novel optimizations. We further investigate underexplored aspects of the framework, including numerical stability, Gaussian truncation, and gradient approximation.The resulting system, Faster-GS, provides a rigorously optimized algorithm that we evaluate across a comprehensive suite of benchmarks. Our experiments demonstrate that Faster-GS achieves up to 5$\times$ faster training while ma

271

core_reconstruction

high

Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We introduce a feed-forward framework for one-shot animatable mesh head reconstruction that generates high-fidelity, directly animatable 3D head avatars from a single image. Unlike previous work that relies on time-consuming test-time optimization or extensive multi-view data, our method produces complete mesh representations with inherent animatability from a single image in a single forward pass. Our approach employs a dual shape and texture map architecture that simultaneously processes mesh vertices and texture map with extracted image features from a shared transformer backbone, allowing for coherent shape carving and appearance modeling. To prevent mesh collapse and ensure topological integrity during feed-forward deformation, we propose an iterative GRU-based decoding mechanism with progressive geometry deformation and texture refinement, coupled with a novel reprojection-based te

272

core_reconstruction

medium

PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning. Recent approaches leverage vision and language foundation models to directly solve dense tasks through multi-view renderings and text queries. While promising, these pipelines require expensive inference over multiple renderings, depend heavily on large language-model (LLM) prompt engineering for captions, and fail to exploit the inherent 3D geometry of shapes. We address this gap by introducing an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds. Our pre-training approach builds on existing data engines that generate part-annotated 3D shapes by pairing multi-view SAM regions with VLM captioning. Using this data, we train a point cloud transformer encoder in two stages: (1) distillation of dense 2D

273

core_reconstruction

medium

Bringing Your Portrait to 3D Presence

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

We present a unified framework for reconstructing animatable 3D human avatars from a single portrait across head, half-body, and full-body inputs. Our method tackles three bottlenecks: pose- and framing-sensitive feature representations, limited scalable data, and unreliable proxy-mesh estimation.We introduce a Dual-UV representation that maps image features to a canonical UV space via Core-UV and Shell-UV branches, eliminating pose- and framing-induced token shifts.We also build a factorized synthetic data manifold combining 2D generative diversity with geometry-consistent 3D renderings, supported by a training scheme that improves realism and identity consistency.A robust proxy-mesh tracker maintains stability under partial visibility. Together, these components enable strong in-the-wild generalization. Trained only on half-body synthetic data, our model achieves state-of-the-art head

274

core_reconstruction

high

Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure.We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a va

275

core_reconstruction

high

CrossHOI: Learning Cross-View Representations for Monocular 3D Human-Object Interaction Reconstruction

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Reconstructing 3D human-object interaction (HOI) from monocular images is highly challenging especially when human and object are mutually occluded. Existing methods primarily rely on single-view inputs, which fundamentally limit their ability to recover occluded regions and accurately estimate contact areas. To address these challenges, we for the first time, consider to introduce novel-view feature priors to enhance monocular 3D HOI reconstruction. We first design a cross-view generator that learns to infer novel-view image features from a single-view input, enriching spatial geometry at the feature level without requiring extra inputs during inference. Guided by both real and generated view features, a spatial cross-view feature fusion module adaptively aggregates complementary cues to enhance the initial reconstruction of human and object meshes. Built upon this reconstruction, we sa

276

core_reconstruction

high

EDGS: Eliminating Densification for Efficient Convergence of 3DGS

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; depth_correspondence

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting reconstructs scenes by starting from a sparse Structure-from-Motion initialization and refiningunder-reconstructed regions. This process is slow, as it requires multiple densification steps where Gaussians arerepeatedly split and adjusted, following a lengthy optimization path. Moreover, this incremental approach often yieldssuboptimal renderings in high-frequency regions. We propose a fundamentally different approach: eliminate densification with a one-step approximation of scenegeometry using triangulated pixels from dense image correspondences. This dense initialization allows us to estimatethe rough geometry of the scene while preserving rich details from input RGB images, providing each Gaussian withwell-informed color, scale, and position. As a result, we dramatically shorten the optimization path and remove theneed for densification. Unlike methods that rely

277

core_reconstruction

medium

FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Autoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertex-coordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our ``one-face-one-token'' strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder

278

core_reconstruction

high

Human Interaction-Aware 3D Reconstruction from a Single Image

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Reconstructing textured 3D human models from a single image is fundamental for AR/VR and digital human applications. However, existing methods mostly focus on single individuals and thus fail in multi-human scenes, where naive composition of individual reconstructions often leads to artifacts such as unrealistic overlaps, missing geometry in occluded regions, and distorted interactions. These limitations highlight the need for approaches that incorporate group-level context and interaction priors. We introduce a holistic method that explicitly models both group- and instance-level information. To mitigate perspective-induced geometric distortions, we first transform the input into a canonical orthographic space. Our primary component, Human Group-Instance Multi-View Diffusion (HUG-MVD), then generates complete multi-view normals and images by jointly modeling individuals and group contex

279

core_reconstruction

high

Intrinsic Image Fusion for Multi-View 3D Material Reconstruction

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We introduce Intrinsic Image Fusion, a method that reconstructs high-quality physically based materials from multi-view images.Material reconstruction is highly underconstrained and typically relies on analysis-by-synthesis, which requires expensive and noisy path tracing. To better constrain the optimization, we incorporate single-view priors into the reconstruction process. We leverage a diffusion-based material estimator that produces multiple, but often inconsistent, candidate decompositions per view.To reduce the inconsistency, we fit an explicit low-dimensional parametric function to the predictions.We then propose a robust optimization framework using soft per-view prediction selection together with confidence-based soft multi-view inlier set to fuse the most consistent predictions of the most confident views into a consistent parametric material space. Finally, we use inverse pat

280

core_reconstruction

medium

Learning to Infer Parameterized Representations of Plants from 3D Scans

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Plants frequently contain numerous organs, organized in 3D branching systems defining the plant's architecture. Reconstructing the architecture of plants from unstructured observations is challenging because of self-occlusion and spatial proximity between organs, which are often thin structures. To achieve the challenging task, we propose an approach that allows to infer a parameterized representation of the plant's architecture from a given 3D scan of a plant. In addition to the plant's branching structure, this representation contains parametric information for each plant organ, and can therefore be used directly in a variety of tasks. In this data-driven approach, we train a recursive neural network with virtual plants generated using a procedural model. After training, the network allows to infer a parametric tree-like representation based on an input 3D point cloud. Our method is ap

281

core_reconstruction

medium

Learning to Solve PDEs on Neural Shape Representations

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Solving partial differential equations (PDEs) on shapes underpins many shape analysis and engineering tasks; yet, prevailing PDE solvers operate on polygonal/triangle meshes while modern 3D assets increasingly live as neural representations. This mismatch leaves no suitable method to solve surface PDEs directly within the neural domain, forcing explicit mesh extraction or per-instance residual training, preventing end-to-end workflows. We present a novel, mesh-free formulation that learns a local update operator conditioned on neural (local) shape attributes, enabling surface PDEs to be solved directly where the (neural) data lives. The operator integrates naturally with prevalent neural surface representations, is trained once on a single representative shape, and generalizes across shape and topology variations, enabling accurate, fast inference without explicit meshing or per-instance

282

core_reconstruction

high

Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Reconstructing object geometry from radio frequency (RF) signals is fundamentally challenging due to the lensless imaging nature of RF sensing, which leads to low spatial resolution and high noise. Unlike light signals, RF signals can penetrate occlusions and thus capture information about hidden scenes. Existing Non-Line-of-Sight (NLoS) 3D neural reconstruction methods can recover coarse surfaces inside enclosed environments but often suffer from unstable optimization, noisy surface geometry, and surface ambiguity, failing to produce accurate zero-level sets from the signed distance field (SDF). These limitations largely stem from neglecting the role of Line-of-Sight (LoS) geometry outside the enclosed region, which provides valuable physical constraints for modeling signal propagation. In this paper, we introduce a Unified LoS and NLoS neural geometry reconstruction framework that leve

283

core_reconstruction

high

SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; surface_occupancy

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Recent progress in feed-forward 3D Gaussian Splatting (3DGS) has notably improved rendering quality. However, the spatially uniform and highly redundant 3DGS map generated by previous feed-forward 3DGS methods limits their integration into downstream reconstruction tasks. We propose SparseSplat, the first feed-forward 3DGS model that adaptively adjusts Gaussian density according to scene structure and information richness of local regions, yielding highly compact 3DGS maps. To achieve this, we propose entropy-based probabilistic sampling, generating large, sparse Gaussians in textureless areas and assigning small, dense Gaussians to regions with rich information. Additionally, we designed a specialized point cloud network that efficiently encodes local context and decodes it into 3DGS attributes, addressing the receptive field mismatch between the general 3DGS optimization pipeline and f

284

core_reconstruction

high

TokenHand: Discrete Token Representation for Efficient Hand Mesh Reconstruction

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Hand mesh reconstruction has attracted growing attention in recent years.Despite significant progress, existing methods often struggle to balance reconstruction quality and inference efficiency.In this work, we propose TokenHand, a novel framework for single-view 3D hand mesh reconstruction that achieves both high accuracy and real-time inference.Our method represents a 3D hand model using $M$ discrete tokens, each describing a specific sub-structure of the hand.This compositional representation enables efficient modeling with minimal reconstruction error.Furthermore, we reformulate hand mesh reconstruction as a classification problem rather than a regression task.Specifically, a classifier predicts the categories of the $M$ tokens from an input image, and a pre-trained decoder network subsequently reconstructs the 3D hand mesh from the predicted tokens without any post-processing.Extens

285

core_reconstruction

high

TouchDream: 3D Object Completion through Imagined Touch

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Point cloud completion is crucial for robust 3D perception but remains challenging due to its ill-posed nature. Coarse-to-fine methods can lead to unconstrained local guesses in the absence of key structures, whereas diffusion-based approaches may introduce geometric inconsistencies. To overcome these limitations, we present TouchDream, a novel framework that leverages a diffusion model to 'dream' of tactile sensing on object surfaces, which reformulates the sensing process as a learnable generative modeling task. Unlike visual cues, tactile data provides rich local geometry that can be directly converted into 3D space for point fusion, offering a powerful guide for detail-aware completion. Specifically, our approach generate compact tactile latent representations conditioned on coarse points and sampled touch poses. A touch-guided refinement module then leverages touch features to optim

286

core_reconstruction

high

CaT-GS: Efficient 3DGS Rendering for Large Scale Scenes via Inter-frame Caching and Tile Scheduling

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Recent breakthroughs in 3D Gaussian Splatting (3DGS) have advanced neural rendering with high fidelity and speed. However, its performance degrades significantly in large-scale scenes due to the computational burden of tile-based rasterization. Existing optimization efforts either require costly scene re-training or focus on narrow aspects of the pipeline, overlooking critical inefficiencies in real-world deployments. Through a comprehensive analysis, we identify three primary sources of redundancy and low GPU utilization: redundant inter-frame pre-processing, viewpoint-based occlusion redundancy, and severe tile-level load imbalance. To address these issues, we propose CaT-GS, a novel and efficient 3DGS rendering pipeline. CaT-GS introduces a speculative multi-frame preprocessing method to eliminate redundant computations across consecutive frames, and an inter-frame caching mechanism t

287

core_reconstruction

high

ASFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Indoor monocular semantic scene completion (MSSC) is notably more challenging than its outdoor counterpart due to complex spatial layouts and severe occlusions. While transformers are well suited for modeling global dependencies, their high memory cost and difficulty in reconstructing fine-grained details have limited their use in indoor MSSC. To address these limitations, we introduce ASFormer, a serialized transformer framework tailored for indoor MSSC. Our model features three key designs: (1) an Adaptive Serialized Transformer with learnable shifts that dynamically adjust receptive fields; (2) a Center-Relative Positional Encoding that captures spatial information richness; and (3) a Convolution-Modulated Layer Normalization that bridges heterogeneous representations between convolutional and transformer features. Extensive experiments on NYUv2 and Occ-ScanNet demonstrate that ASForm

288

core_reconstruction

medium

Bidirectional Query-Driven Generation of Parametric CAD Sketch

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Learning-based CAD modeling shows great promise in automating parametric design, yet existing approaches often overlook the incremental and state-dependent nature of sketch construction. We present CADSketcher, a query-driven bidirectional framework for completing partial parametric sketches by internalizing the non-linear construction logic of interactive CAD processes. At the core of CADSketcher are two key innovations. First, a bidirectional sketch learner recovers both prior and posterior contexts from arbitrary-span partial sketches via a bidirectional query mechanism, enabling exploration of multiple plausible modeling trajectories. Second, a confidence-guided completion pipeline adaptively determines the expansion direction through a confidence gate and ensures executable instruction generation using a validity compiler, while a progressive context updater preserves sketch consist

289

core_reconstruction

medium

BrepVGAE: Variational Graph Autoencoder with Unified Latent Representation for B-rep

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Due to the heterogeneity of faces and edges in B-rep, conventional graph-based representations is incapable of establishing a unified formulation for faces and edges, thereby constraining the capabilities of B-rep generative models. We propose a B-rep Variational Graph Auto Encoding (BrepVGAE), the first variational graph autoencoder framework capable of holistically encoding and decoding boundary representations of B-rep models.Firstly, we novelly represent both geometry faces and edges as nodes in a graph representation. We then design a sparse graph autoencoder to aggregate the complete B-rep structure into a compact global latent vector. We then construct a decoder that employs set-based generation, which uses bilinear layers to reconstruct adjacency relationships, i.e., topology, with a single latent vector. Afterwards, the same decoder generates node features for all faces and edge

290

core_reconstruction

medium

Erasing Invisible Watermarks via Novel View Synthesis

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; depth_correspondence

core genus=3D Gaussian Splatting, but title/abstract signal is narrower

abstract

Invisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerability of these schemes against sophisticated removal attacks remains essential to assess their reliability and guide robust design. In this work, we expose a fundamental vulnerability in invisible watermarks by reformulating watermark removal as a view synthesis problem. Our key insight is that generating a perceptually consistent alternative ``view" of the same semantic content, akin to re-observing a scene from a shifted perspective, naturally removes the embedded watermark while preserving visual fidelity. This reveals a critical gap: watermarks robust to pixel-space and frequency-domain attacks remain vulnerable to semantic-preserving viewpoint transformations. We introduce a zero-shot diffu

291

core_reconstruction

medium

LoG3D: Ultra-High-Resolution 3D Shape Modeling via Local-to-Global Partitioning

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Generating high-fidelity 3D contents remains a fundamental challenge due to the complexity of representing arbitrary topologies—such as open surfaces and intricate internal structures—while preserving geometric details. Prevailing methods based on signed distance fields (SDFs) are hampered by costly watertight preprocessing and struggle with non-manifold geometries, while point-cloud representations often suffer from sampling artifacts and surface discontinuities. To overcome these limitations, we propose a novel 3D variational autoencoder (VAE) framework built upon unsigned distance fields (UDFs)—a more robust and computationally efficient representation that naturally handles complex and incomplete shapes. Our core innovation is a local-to-global (LoG) architecture that processes the UDF by partitioning it into uniform subvolumes, termed UBlocks. This architecture couples 3D convolutio

292

core_reconstruction

high

MatMart: Material Reconstruction of 3D Objects via Diffusion

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Applying diffusion models to physically-based material estimation and generation has recently gained prominence. In this paper, we propose MatMart, a novel material reconstruction framework for 3D objects, offering the following advantages. First, MatMart adopts a two-stage reconstruction, starting with accurate material prediction from inputs and followed by prior-guided material generation for unobserved views, yielding high-fidelity results. Second, by utilizing progressive inference alongside the proposed view-material cross-attention (VMCA), MatMart enables reconstruction from an arbitrary number of input images, demonstrating strong scalability and flexibility. Finally, MatMart achieves both material prediction and generation capabilities through end-to-end optimization of a single diffusion model, without relying on additional pre-trained models, thereby exhibiting enhanced stabil

293

core_reconstruction

medium

NeAR: Coupled Neural Asset–Renderer Stack

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

Neural asset authoring and neural rendering have emerged as largely disjoint threads: one generates digital assets using neural networks for traditional graphics pipelines, while the other develops neural renderers that map conventional assets to images. However, the joint design of the asset representation and renderer remains largely unexplored. We argue that coupling them can unlock an end-to-end learnable graphics stack with benefits in fidelity, consistency, and efficiency. In this paper, we explore this possibility with **NeAR**: a Coupled Neural Asset–Renderer Stack. On the **asset** side, we build on Trellis-style Structured 3D Latents and introduce a lighting-homogenized neural asset: from a casually lit input, a rectified-flow backbone predicts a Lighting-Homogenized SLAT that encodes geometry and intrinsic material cues in a compact, view-agnostic latent. On the **renderer** s

294

core_reconstruction

high

Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. Specifically, we introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model to capture the knowledge of new environments while preserving the fundamental capability of the foundation model for geometry prediction. To solve the problems of missing groundtruth and the requirement of high efficiency when updating these visual prompts at test time, we introduce a local-global self-supervised learning strategy by enforcing the local and global consistency constraints on predictions. The local consistency constraints are conducted on intermediate and previously local fused results, enabling the model to be trained with high-quality pseudo groundtruth signals; the global cons

295

core_reconstruction

high

Opti-NeuS: Neural Reconstruction for Dual-Layered Transparent and Opaque Objects

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

3D reconstruction of transparent objects from multiple views has been a long-standing challenge. In contrast to opaque objects, transparent objects exhibit complex refraction that causes serious image distortion, resulting in a highly ill-posed problem. Existing reconstruction methods commonly depend on special capture devices or controlled environments, which provide more priors and simplify the modeling of refraction. More importantly, these methods lack the capability for reconstruction of mixed transparent and opaque objects, being confined to transparent or opaque materials. To address these challenges, we propose Opti-NeuS, a novel method for reconstructing transparent and opaque objects without controlled environments or additional input. Opti-NeuS incorporates a novel IoRNetwork to obtain spatially-varying IoR for tracing the refractive ray paths, which can finally model refracti

296

core_reconstruction

high

PRIMU: Uncertainty Estimation for Novel Views in Gaussian Splatting from Primitive-Based Representations of Error and Coverage

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; depth_correspondence

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We introduce Primitive-based Representations of Uncertainty (PRIMU), a post-hoc uncertainty estimation (UE) framework for Gaussian Splatting (GS).Reliable UE is essential for deploying GS in safety-critical domains such as robotics and medicine.Existing approaches typically estimate Gaussian-primitive variances and rely on the rendering process to obtain pixel-wise uncertainties.In contrast, we construct primitive-level representations of error and visibility/coverage from training views, capturing interpretable uncertainty information. These representations are obtained by projecting view-dependent training errors and coverage statistics onto the primitives. Uncertainties for novel views are inferred by rendering these primitive-level representations, producing uncertainty feature maps, which are aggregate through pixel-wise regression on holdout data. We analyze combinations of uncerta

297

core_reconstruction

medium

Residual Primitive Fitting of 3D Shapes with SuperFrusta

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

We introduce a framework for converting 3D shapes into compact and editable assemblies of analytic primitives, directly addressing the persistent trade-off between reconstruction fidelity and parsimony. Our approach combines two key contributions: a novel primitive, termed SuperFrustum, and an iterative inference algorithm, Residual Primitive Fitting (ResFit). SuperFrustum is a analytical primitive that is simultaneously (1) expressive, being able to express various common solids such as cylinders, spheres, cones & their tapered and bent forms, (2) editable, being compactly parameterized with 8 parameters, and (3) optimizable, with a sign distance field differentiable w.r.t. its parameters almost everywhere. ResFit is an unsupervised procedure that interleaves global shape analysis with local optimization, iteratively fitting primitives to the unexplained residual of a shape to discover

298

core_reconstruction

high

Revisiting 3D Reconstruction Kernels as Low-Pass Filters

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

3D reconstruction is to recover 3D signals from the sampled discrete 2D pixels, with the goal to converge continuous 3D spaces.In this paper, we revisit 3D reconstruction from the perspective of signal processing, identifying the periodic spectral extension induced by discrete sampling as the fundamental challenge.Previous 3D reconstruction kernels, such as Gaussians, Exponential functions, and Student's t distributions, serve as the low pass filters to isolate the baseband spectrum.However, their unideal low-pass property results in the overlap of high-frequency components with low-frequency components in the discrete-time signal’s spectrum.To this end, we introduce Jinc kernel with an instantaneous drop to zero magnitude exactly at the cutoff frequency, which is corresponding to the ideal low pass filters.As Jinc kernel suffers from low decay speed in the spatial domain, we further pro

299

core_reconstruction

high

SparseOIT: Improving Order-Independent Transparency 3DGS via Active Set Method

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; depth_correspondence

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) has received tremendous popularity over the past few years due to its photorealistic visual appearance. However, 3DGS uses volumetric rendering that is not suitable for objects with non-lambertian or transparent materials. To remedy this issue, a family of Order-Independent Transparency (OIT) rendering methods propose to remove or modify the depth sorting step in the 3DGS rendering equation. However, the potential of OIT-based method is still underexplored. In this paper, we observe that the OIT modifications to the rendering equation significantly reduce the inter-independence among individual gaussian splats, resulting in very sparse variable dependencies that can be harnessed by specific optimization techniques such as active set method. To this end, we propose \textbf{SparseOIT}, an OIT-based 3DGS reconstruction algorithm that maintains an active set of g

300

core_reconstruction

high

Stochastic Ray Tracing for the Reconstruction of 3D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; depth_correspondence

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Ray-tracing-based 3D Gaussian splatting (3DGS) enjoys the generality of supporting non-pinhole camera models and relightable formulations. However, they are usually lacking in performance, partially due to the need for depth-based sorting of all intersecting Gaussians along the traced rays.In this paper, we introduce a sorting-free differentiable stochastic formulation for ray-traced 3DGS, enabling efficient reconstruction and rendering of both standard and relightable 3DGS scenes.For standard 3DGS, our method offers performance comparable to rasterization-based 3DGS and outperforms sorting-based ray tracing.For relightable 3DGS, our technique provides higher-quality reconstructions and renderings thanks to the accurate shadow and shading computation provided by fully ray-traced shadow and light rays.

301

core_reconstruction

high

TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction with direct reconstruction/geometry signal

abstract

Joint reconstruction of 3D human and object from a single image is an active research area, with pivotal applications in robotics and digital content creation. Despite recent advances, existing approaches suffer from two fundamental limitations. First, their reconstructions rely heavily on physical contact information, which inherently cannot capture non-contact human–object interactions, such as gazing at or pointing toward an object. Second, the reconstruction process is primarily driven by local geometric proximity, neglecting the human and object appearances that provide global context crucial for understanding holistic interactions. To address these issues, we introduce TeHOR, a framework built upon two core designs. First, beyond contact information, our framework leverages text descriptions of human–object interactions to enforce semantic alignment between the 3D reconstruction an

302

core_reconstruction

medium

Underground Plant Exploration: Non-Destructive 3D Root Assessment with GPR Based on Point Graph Neural Network

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

core genus=3D Reconstruction, but title/abstract signal is narrower

abstract

This paper introduces a novel application of machine learning in agriculture for non-destructive 3D root structure reconstruction. Plant roots are critical for providing resources for the entire plant. Ground Penetrating Radar (GPR) is a key tool for identifying subterranean objects with easy and obvious shapes, such as large pipes, but remaining challenging to assess the 3D shapes of roots. In our study, we introduce a novel approach specifically designed based on GPR signal shape priors to detect target signals and perform curve parameter regression based on multiple B-scans from GPR. This process enables the derivation of a precise curve from the detection and regression outcomes. To achieve the reconstruction of a comprehensive 3D root structure, we have developed a shape reconstruction network that processes sparse sliced 3D points through a dedicated point graph network and an upsa

303

core_reconstruction

high

A Geometric Algebra-Informed 3DGS Framework for Wireless Channel Prediction

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

In this paper, we introduce Geometric Algebra–Informed 3D Gaussian Splatting (GAI-GS), a framework for wireless modeling that couples 3D Gaussian splatting with a geometric-algebra–based attention mechanism to explicitly model ray–object interactions in complex propagation environments. GAI-GS encodes joint spatial–electromagnetic (EM) relations into token representations, enabling scene-level aggregation within a unified, end-to-end neural architecture. This design renders ray tracing for wireless propagation physically grounded, with token interactions that respect EM constraints including multipath, path-dependent attenuation, and reflection/diffraction. Through extensive evaluations on on multiple real-world indoor datasets, GAI-GS consistently surpasses current baselines across various wireless tasks.

304

core_reconstruction

high

Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian splatting (3DGS) has become a vital tool for learning a radiance field from multiple posed images. Although 3DGS shows great advantages over NeRF in terms of rendering quality and efficiency, it remains a research challenge to further improve the efficiency of learning 3D Gaussians. To overcome this challenge, we propose novel training strategies and losses to shorten each Gaussian list used to render a pixel, which speeds up the splatting by involving fewer Gaussians along a ray. Specifically, we shrink the size of each Gaussian by resetting their scales regularly, encouraging smaller Gaussians to cover fewer nearby pixels, which shortens the Gaussian lists of pixels. Additionally, we introduce an entropy constraint on the alpha blending procedure to sharpen the weight distribution of Gaussians along each ray, which drives dominant weights larger while making minor weights s

305

core_reconstruction

medium

OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; depth_correspondence

core genus=3D Gaussian Splatting, but title/abstract signal is narrower

abstract

We propose OMG-Avatar, a novel One-shot method that leverages a Multi-LOD (Level-of-Detail) Gaussian representation for animatable 3D head reconstruction from a single image in 0.2s. Our method enables LOD head avatar modeling using a unified model that accommodates diverse hardware capabilities and inference speed requirements. To capture both global and local facial characteristics, we employ a transformer-based architecture for global feature extraction and projection-based sampling for local feature acquisition. These features are effectively fused under the guidance of a depth buffer, ensuring occlusion plausibility. We further introduce a coarse-to-fine learning paradigm to support Level-of-Detail functionality and enhance the perception of hierarchical details. To address the limitations of 3DMMs in modeling non-head regions such as the shoulders, we introduce a multi-region decom

306

core_reconstruction

high

B$^3$-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta–Bernoulli Bayesian Updates

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; generation_editing

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Interactive 3D Gaussian Splatting (3DGS) segmentation is essential for real-time editing of pre-reconstructed assets in film and game production.However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use.We propose \textbf{B$^3$-Seg (Beta--Bernoulli Bayesian Segmentation for 3DGS)}, a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under \textbf{camera-free} and \textbf{training-free} conditions.Our approach reformulates segmentation as sequential Beta--Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG).This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy $(1{-}1/e)$ approximation to the optimal view sampling policy.Experiments on multiple datasets show t

307

core_reconstruction

high

BEA-GS : BEyond RAdiance Supervision in 3DGS for Precise Object Extraction

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; generation_editing

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Most Gaussian Splatting techniques that provide a 3D semantic representation of the scene don't optimize the underlying 3D geometry of the scene. This makes object-level editing or asset extraction challenging. Recent methods, like COBGS, Trace3D, and ObjectGS, acknowledge this limitation and propose approaches that modify the geometry of the scene to represent the underlying semantics. We go a step further and propose a novel solution that provides near perfect boundaries in object extraction. We do so by introducing two new losses in the optimization that take care of: 1. Modifying the geometry of visible Gaussians to respect semantic boundaries, and, 2. Modifying the geometry of non-visible Gaussians that appear once the object is extracted. Our first loss propagates gradients directly through the rasterization to allow for seamless integration within the optimization of the Gaussian

308

core_reconstruction

high

Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Understanding hand-object interaction from monocular videos is crucial for immersive and dexterous interactions in AR/VR and robotic applications. However, existing monocular reconstruction methods primarily assume rigid grasping and static object geometry. When applied to articulated manipulations, the continuous joint rotations and frequent component deformations introduce a strong coupling between shape and motion, leading to severe ambiguity and instability in articulation optimization under monocular observation. To address this challenge, we propose a Clay-to-Stone dual-phase framework, modeling the articulated manipulation at hierarchical granularities, enabling a progression from flexible semantic exploration to structured articulation recovery. In the CLAY phase, our method performs fine-grained control over geometric deformation, guided by inter-part semantic correlation learni

309

core_reconstruction

high

ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Lifting 2D open-vocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. Mainstream methods, built on an embedding paradigm, suffer from three key flaws: (i) geometry-semantic inconsistency, where points, rather than objects, serve as the semantic basis, limiting semantic fidelity; (ii) semantic bloat from injecting gigabytes of feature data into the geometry; and (iii) semantic rigidity, as one feature per Gaussian struggles to capture rich polysemy. To overcome these limitations, we introduce ExtrinSplat, a framework built on the extrinsic paradigm that decouples geometry from semantics. Instead of embedding features, ExtrinSplat clusters Gaussians into multi-granularity, overlapping 3D object groups. A Vision-Language Model (VLM) then interprets these groups to generate lightweight textual hypotheses, creating an extrinsic index layer that natively su

310

core_reconstruction

high

HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

We introduce HyperGaussians, a novel extension of 3D Gaussian Splatting for high-quality animatable face avatars. While tremendous successes have been achieved for static faces, animatable avatars from dynamic videos still fall in the uncanny valley. The de facto standard, 3D Gaussian Splatting (3DGS), represents a face through a collection of 3D Gaussian primitives. 3DGS excels at rendering static faces, but the state-of-the-art still struggles with nonlinear deformations, complex lighting effects, and fine details. While most related works focus on predicting better Gaussian parameters from expression codes, we rethink the 3D Gaussian representation itself and how to make it more expressive. Our insights lead to a novel extension of 3D Gaussians to high-dimensional multivariate Gaussians, dubbed 'HyperGaussians'. The higher dimensionality increases expressivity through conditioning on

311

core_reconstruction

high

MLLMSplat: A 2D MLLM-Powered Framework for 3D Gaussian Splatting Understanding, Generation, and Editing

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; generation_editing

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting (3DGS) has emerged as a mainstream representation for 3D scenes, drawing increasing research attention to its understanding, generation, and editing. However, existing studies remain limited to low-level perception, low-quality generation, and low-efficiency editing, lagging far behind their image counterparts in the era of Multimodal Large Language Models (MLLMs). To bridge this gap, we propose MLLMSplat, a novel framework that adapts 2D MLLMs to achieve high-level understanding, high-quality generation, and high-efficiency editing of 3DGS scenes. Specifically, our comprehensive framework consists of three core designs: (1) a 3DGS tokenizer that can be seamlessly integrated into existing MLLMs in a training-free manner; (2) a 3DGS de-tokenizer that non-intrusively extends the 2D latent diffusion model in MLLMs using a dual positional encoding space, while augmentin

312

core_reconstruction

high

Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Recent significant advances in 3D scene representation have been driven by 3D Gaussian Splatting (3DGS), which has enabled real-time rendering with photorealistic quality. 3DGS often requires a large number of primitives to achieve high fidelity, leading to redundant representations and high resource consumption, thereby limiting its scalability for complex or large-scale scenes. Consequently, effective pruning strategies and more expressive primitives that can reduce redundancy while preserving visual quality are crucial for practical deployment. We propose an efficient, integrated reconstruction-aware pruning strategy that adaptively determines pruning timing and refining intervals based on reconstruction quality, thus reducing model size while enhancing rendering quality. Moreover, we introduce a 3D Difference-of-Gaussians primitive that jointly models both positive and negative densi

313

core_reconstruction

medium

Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; data_benchmark

core genus=3D Gaussian Splatting, but title/abstract signal is narrower

abstract

Radiance field methods (e.g.~3D Gaussian Splatting) have emerged as a powerful paradigm for novel view synthesis, yet their appearance modeling often relies on Spherical Harmonics (SH), which impose fundamental limitations.SH struggle with high-frequency signals, exhibit Gibbs ringing artifacts, and critically fail to capture specular reflections -- a key component of realistic rendering. While alternatives like Spherical Gaussians offer improvements, they introduce significant optimization complexity.We propose Spherical Voronoi (SV) as a unified framework for appearance representation in 3D Gaussian Splatting.SV partitions the directional domain into learnable regions with smooth boundaries, providing an intuitive and stable parameterization for view-dependent effects. For diffuse appearance, SV achieves competitive results while maintaining simpler optimization compared to existing al

314

core_reconstruction

high

FilterGS: Traversal-Free Parallel Filtering and Adaptive Shrinkage for Large-scale LoD 3D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; data_benchmark

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting has revolutionized neural rendering with real-time performance. However, scaling this approach to large scenes using Level-of-Detail methods faces critical challenges: inefficient serial traversal consuming over 60\% of rendering time, and redundant Gaussian-tile pairs that incur unnecessary processing overhead. To address these limitations, we propose FilterGS, featuring a parallel filtering mechanism with two complementary filters that enable efficient selection without tree traversal, coupled with a scene-adaptive Gaussian shrinkage strategy that minimizes redundancy through opacity-based scaling. Extensive experiments demonstrate that FilterGS achieves state-of-the-art rendering speeds while maintaining competitive visual quality across multiple large-scale datasets.

315

core_reconstruction

high

Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; generation_editing

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting has recently enabled fast and photorealistic reconstruction of static 3D scenes. However, dynamic editing of such scenes remains a significant challenge. We introduce a novel framework, Physics-Guided Score Distillation, to address a fundamental conflict: physics simulation provides a strong motion prior that is insufficient for photorealism , while video-based Score Distillation Sampling (SDS) alone cannot generate coherent motion for complex, multi-particle scenarios. We resolve this through a unified optimization framework where physics simulation guides Score Distillation to jointly refine the motion prior for photorealism while simultaneously optimizing appearance. Specifically, we learn a neural dynamics model that predicts particle motion and appearance, optimized end-to-end via a combined loss integrating Video-SDS for photorealism with our physics-guidance

316

core_reconstruction

high

3D Gaussian Splatting at Arbitrary Resolution with Compact Proxy Anchors

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Despite achieving high-quality rendering, 3D Gaussian Splatting suffers from aliasing when the rendering resolution changes, as it is typically trained at a fixed resolution. To address this limitation, we introduce a method that enables the model to generate resolution-adaptive 3D Gaussians under arbitrary resolution changes. In particular, built upon Scaffold-GS, we enhance the anchor feature representation by incorporating a resolution-embedding to encode continuous resolution information. From these enhanced anchor features, a pixel coverage gate dynamically forms resolution-adaptive 3D Gaussians. Furthermore, we drastically reduce storage requirements by selecting a compact subset of proxy anchors and designing a residual anchor predictor that reconstructs the unselected leaf anchors based on the proxy anchors, enabling faithful scene representation without compromising visual fidel

317

core_reconstruction

high

IR-HGP: Physically-Aware Gaussian Inverse Rendering for High-Illumination Scenes via Generative Priors

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Applying 3D Gaussian Splatting to inverse rendering, especially for relightable assets under high-illuminance conditions, remains challenging. Strong specular highlights and complex reflections complicate material-light disentanglement, often baking in shadows and losing specular detail. To address this, we introduce IR-HGP, a framework that achieves robust disentanglement using three synergistic modules: First, a Hybrid Visibility Decomposition module ensures physical visibility consistency. Second, a Generative Illumination Field Prior module infers detailed and high-dynamic range environmental lighting. Finally, a Physics-Aware Radiance Correction module stabilizes optimization and mitigates illumination artifacts. Our framework achieves SOTA material recovery and relighting performance, outperforming existing methods under challenging illumination conditions. It reconstructs the vi

318

core_reconstruction

high

Learning Differentiable Hierarchies in 3D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Although 3D Gaussian Splatting (3DGS) has achieved impressive performance in real-time rendering, its unordered Gaussians make level-of-detail (LoD) construction and model compression highly challenging, limiting its applicability in customized scenarios.In this work, we propose a learning-based Gaussian hierarchy representation that ranks Gaussians by their contribution to the scene, enabling flexible LoD representations across arbitrary Gaussian counts.We first introduce a unified, continuous formulation and metric for Gaussian hierarchy. Then, we introduce a hierarchy-based modulated rendering method built upon a Differentiable Decreasing Step Function, which enables efficient hierarchy learning while maintaining approximately equivalent rendering. Moreover, we develop a PDF-Guided Active-Region Sampling strategy that encourages the learned hierarchy to become widely distributed withi

319

core_reconstruction

high

NG-GS: NeRF-guided 3D Gaussian Splatting Segmentation

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Recent advances in 3D Gaussian Splatting (3DGS) have enabled highly efficient and photorealistic novel view synthesis. However, segmenting objects accurately in 3DGS remains challenging due to the discrete nature of Gaussian representations, which often leads to aliasing and artifacts at object boundaries. In this paper, we introduce NG-GS, a novel framework for high-quality object segmentation in 3DGS that explicitly addresses boundary discretization. Our approach begins by automatically identifying ambiguous Gaussians at object boundaries using mask variance analysis. We then apply radial basis function (RBF) interpolation to construct a spatially continuous feature field, enhanced by multi-resolution hash encoding for efficient multi-scale representation. A joint optimization strategy aligns 3DGS with a lightweight NeRF module through alignment and spatial continuity losses, ensuring

320

core_reconstruction

high

Urban-GS: A Unified 3D Gaussian Splatting Framework for Compact and High-Fidelity Aerial-to-Street Reconstruction

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Recently, 3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction, enabling efficient and high-fidelity novel view synthesis. However, seamless integration of both aerial and street view images to model urban scenes remains a significant challenge for 3DGS. This joint setting suffers from extreme view coverage disparity, complex multi-scale details, and imbalanced viewpoint distributions.In this work, we present Urban-GS, a novel framework built upon Gaussian Splatting for the compact unified reconstruction and high-fidelity rendering of urban scenes from both aerial and street views. Specifically, we first develop an Aerial-Street Joint Adaptive Densification method to resolve the densification conflicts arising from large view coverage disparity. We then introduce a Contribution-based Anchor Pruning strategy to effectively mitigate the storage overhead from captu

321

core_reconstruction

high

Write Where It Matters: Policy-Guided Watermarks for 3D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Recent advances in 3D Gaussian Splatting (3DGS) enable photorealistic real-time rendering but also increase the risks of unauthorized copying and redistribution. Existing 3DGS watermarking methods typically rely on handcrafted thresholds or globally fixed hyperparameters to balance invisibility and robustness, making their embedding behavior static and scene-agnostic. We instead formulate 3DGS watermarking as a goal-directed decision process and introduce Write Where It Matters (W2M), the first reinforcement learning-based framework that adaptively learns where and how much to embed. By modeling the embedding process as a Markov Decision Process, W2M uses a lightweight policy network to allocate precise Gaussian updates directly from immediate reward feedback, iteratively. The reward incentivizes both rendering-space invisibility and decoding robustness under various image- and model-lev

322

core_reconstruction

high

Z-Order Transformer for Feed-Forward Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in photorealistic novel view synthesis. However, traditional 3DGS relies on a slow, iterative optimization process, which limits its use in scenarios demanding real-time results. To overcome this bottleneck, recent feed-forward methods aim to predict Gaussian attributes directly from images, but they often struggle with the redundancy of Gaussian primitives and rendering quality. In this paper, we introduce a transformer-based architecture specifically designed for feed-forward Gaussian Splatting. Our key insight is that spatial and semantic relationships among Gaussians can be effectively captured through a sparse attention mechanism, enabled by a Z-order strategy that organizes the unstructured Gaussian set into a spatially coherent sequence. Furthermore, we incorporate this Z-order strategy to adaptively

323

core_reconstruction

high

NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

3D Gaussian Splatting can exploit frustum culling and level-of-detail strategies to accelerate rendering of scenes containing a large number of primitives. However, the semi-transparent nature of Gaussians prevents the application of another highly effective technique: occlusion culling. We address this limitation by proposing a novel method to learn the viewpoint-dependent visibility function of all Gaussians in a trained model using a small, shared MLP across instances of an asset in a scene. By querying it for Gaussians within the viewing frustum prior to rasterization, our method can discard occluded primitives during rendering. Leveraging tensor cores for efficient computation, we integrate these neural queries directly into a novel instanced software rasterizer. Our approach outperforms the current state of the art for composed scenes in terms of VRAM usage and image quality, utili

324

core_reconstruction

high

SunFaded: Illumination-Aware Gaussian Splatting for Dark Scenes with Camera-Mounted Active Lighting

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance

core genus=3D Gaussian Splatting with direct reconstruction/geometry signal

abstract

Gaussian Splatting has emerged as a popular 3D representation technique, but still struggles with appearance inconsistencies, especially in dark scenes that require active illumination (e.g., camera flashes or co-moving light sources) to capture usable images, leading to dramatic local appearance fluctuations.While existing methods mainly focus on modeling global appearance changes for in-the-wild scenes, such as those caused by different times of day or weather conditions, they fail to handle the severe variations present in dark scenes with moving light sources.In this paper, we propose a novel Gaussian Splatting–based approach for constructing scene representations in dark scenes where active light sources are rigidly attached to the camera and move together with it.Within this framework, we introduce an illumination-weighted loss function that drives the representation toward the und

325

core_reconstruction

medium

Where, What, Why: Toward Explainable 3D-GS Watermarking

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance

core genus=3D Gaussian Splatting, but title/abstract signal is narrower

abstract

As 3D Gaussian Splatting becomes the de facto representation for interactive 3D assets, robust yet imperceptible watermarking is critical. We present a representation-native framework that separates where to write from how to preserve quality. A Trio-Experts module operates directly on Gaussian primitives to derive priors for carrier selection, while a Safety and Budget Aware Gate (SBAG) allocates Gaussians to watermark carriers—optimized for bit resilience under perturbation and bitrate budgets—and to visual compensators that are insulated from watermark loss. To maintain fidelity, we introduce a channel-wise group mask that controls gradient propagation for carriers and compensators, thereby limiting Gaussian parameter updates, repairing local artifacts, and preserving high-frequency details without increasing runtime. Our design yields view-consistent watermark persistence and strong

326

core_reconstruction

medium

Hyper-PCN: Hypergraph-based Point Cloud Completion via High-order Correlation Modeling

3D Vision & Geometry / Point Cloud

C. cluster representative

surface_occupancy

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Point cloud completion is an important yet challenging problem in 3D computer vision, which aims to reconstruct complete and dense 3D shapes from partial point clouds. Although transformer-based and geometry-based approaches have made significant progress, they often struggle to capture the complex, high-order correlations inherent in point clouds. To address this limitation, we propose Hyper-PCN, a point cloud completion framework that leverages hypergraphs to explicitly model complex, higher-order correlations within incomplete inputs for more accurate completion. It comprises two key modules: Hyper Refinement Stack, designed to progressively capture coarse-to-fine high-order correlations through a series of hypergraph learning stages, and Anchor-based Hypergraph Neural Network, which employs a two-stage sampling strategy to construct collaborative hypergraphs, ensuring robust modeling

327

core_reconstruction

medium

3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects

Data & Evaluation / Benchmark

D. adjacent but useful context

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Accurate 3D reconstruction of objects with reflective, transparent, or low-texture surfaces remains a significant challenge. Such materials often violate key assumptions in multi-view reconstruction pipelines, such as photometric consistency and the reliance on distinct geometric texture cues. Existing datasets primarily focus on diffuse, textured objects, thereby offering limited insight into performance under real-world material complexities. In this paper, we introduce 3DReflecNet, a large-scale hybrid dataset exceeding 22 TB that is specifically designed to benchmark and advance 3D vision methods for these challenging materials. 3DReflecNet combines two types of data: over 100,000 synthetic instances generated via physically-based rendering of more than 10,000 shapes, and over 1,000 real-world objects scanned using consumer RGB-D devices. Together, these data consist of more than 7 m

328

core_reconstruction

medium

AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction

Remote Sensing & Earth / Remote Sensing

D. adjacent but useful context

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Recent advances in 4D scene reconstruction have greatly improved dynamic modeling across various domains. However, existing approaches remain limited under aerial conditions with single-view capture, wide spatial range, and dynamic objects of limited spatial footprint and large motion disparity. These challenges cause severe depth ambiguity and unstable motion estimation, making monocular aerial reconstruction inherently ill-posed.To this end, we present AeroDGS, a physics-guided 4D Gaussian splatting framework for monocular UAV videos. AeroDGS introduces a Monocular Geometry Lifting module that reconstructs reliable static and dynamic geometry from a single aerial sequence, providing a robust basis for dynamic estimation. To further resolve monocular ambiguity, we propose a Physics-Guided Optimization module that incorporates differentiable ground-support, upright-stability, and traject

329

core_reconstruction

medium

Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction

Computational Imaging / Computational Imaging

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

The task of capturing and rendering 3D dynamic scenes from 2D images has become increasingly popular in recent years.However, most conventional cameras are bandwidth-limited to 30–60 FPS, restricting these methods to static or slowly evolving scenes.While overcoming bandwidth limitations is difficult in general scenes, recent years have seen a flurry of computational imaging methods that yield high-speed videos using conventional cameras for specific scenarios (e.g., motion capture and particle image velocimetry).However, most of these methods require modifications to camera optics or the addition of mechanically moving components, limiting them to a single-view high-speed capture. Consequently, these cannot be readily used to capture a 3D representation of rapid scene motion. In this paper, we propose a novel method to capture and reconstruct a volumetric representation of a high-speed

330

core_reconstruction

medium

DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce Driving Gaussian Grounded Transformer (DGGT), a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves tem

331

core_reconstruction

medium

DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Simulation is essential to the development and evaluation of autonomous robots such as self-driving vehicles. Neural reconstruction is emerging as a promising solution as it enables simulating a wide variety of scenarios from real-world data alone in an automated and scalable way. However, while methods such as NeRF and 3D Gaussian Splatting can produce visually compelling results, they often exhibit artifacts particularly when rendering novel views, and fail to realistically integrate inserted dynamic objects, especially when they were captured from different scenes. To overcome these limitations we introduce DiffusionHarmonizer, an online generative enhancement framework that transforms renderings from such imperfect scenes into photorealistic, temporally consistent outputs. At its core is a single-step temporally-conditioned enhancer that is converted from a pretrained multi-step imag

332

core_reconstruction

medium

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

Video & Motion / Human Motion

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting.However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild.To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame.The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly.Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruct

333

core_reconstruction

medium

EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy

Medical & Scientific Imaging / Medical Imaging

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Volume electron microscopy (vEM) enables nanoscale 3D imaging of biological structures but remains constrained by acquisition trade-offs, leading to anisotropic volumes with limited axial resolution. Existing deep learning methods seek to restore isotropy by leveraging lateral priors; yet their assumptions break down for morphologically anisotropic structures. We present **EMGauss**, a general framework for 3D reconstruction from planar scanned 2D slices with applications in vEM, which circumvents the inherent limitations of isotropy-based approaches. Our key innovation is to reframe slice-to-3D reconstruction as a 3D dynamic scene rendering problem based on Gaussian splatting, where the progression of axial slices is modeled as the temporal evolution of 2D Gaussian point clouds. To enhance fidelity in data-sparse regimes, we incorporate a **Teacher–Student bootstrapping mechanism** that

334

core_reconstruction

medium

FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

In controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce **FaithFusion**, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications. Extensive experiments on

335

core_reconstruction

medium

FVGen: Scaling 3D Scene Datasets with Certainty-Aware Free-View Generation from Scene Geometry Reconstruction

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; generation_editing; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

The development of generalizable Novel View Synthesis (NVS) models is critically limited by the scarcity of large-scale training data with diverse and accurate camera trajectories. While real-world captures are photorealistic, they are typically sparse and discrete. Conversely, synthetic data scales but suffers from a domain gap and often lacks realistic semantics. We introduce FVGen, a novel framework that leverages the power of scene reconstruction to transform limited real-world image sequences into a scalable source of high-quality training data. Our key insight is that an imperfect reconstructed scene serves as a rich geometric proxy, but naively sampling from it amplifies artifacts. To this end, we propose a certainty-aware free-view sampling strategy that identifies novel viewpoints which are both semantically meaningful and minimally affected by reconstruction errors. We demonstr

336

core_reconstruction

medium

Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark

direct reconstruction/3DGS/4D title linked to core representation cluster

abstract

3D semantic occupancy prediction is crucial for autonomous driving, yet vision-only approaches suffer from weak geometric cues, and existing multi-modal frameworks often depend on dense voxel or BEV tensors that impose heavy computational cost. We present **Gau-Occ**, a multi-modal framework that models the scene as a compact collection of semantic 3D Gaussians, enabling geometry-guided fusion without dense volumetric processing.To enhance geometric completeness, a learned **LiDAR Completion Diffuser (LCD)** trained on real-world priors recovers missing structures from sparse LiDAR, and the completed points are encoded as semantic Gaussian anchors.To further integrate multi-view image semantics, we introduce **Gaussian Anchor Fusion (GAF)**, a geometry-aligned aggregation module that performs anchor-guided 2D sampling, local neighborhood encoding, and cross-modal alignment. By construc

337

core_reconstruction

medium

ParkGaussian: Surround-view 3D Gaussian Splatting for Autonomous Parking

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; gaussian_radiance; pose_calibration_localization; robotics_mapping; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Parking is a critical task for autonomous driving systems (ADS), with unique challenges in crowded parking slots and GPS-denied environments. However, existing works focus on 2D parking slot perception, mapping, and localization, 3D reconstruction remains underexplored, which is crucial for capturing complex spatial geometry in parking scenarios. Naively improving the visual quality of reconstructed parking scenes does not directly benefit autonomous parking, as the key entry point for parking is the slots perception module. To address these limitations, we curate the first benchmark named ParkRecon3D, specifically designed for parking scene reconstruction. It includes sensor data from four surround-view fisheye cameras with calibrated extrinsics and dense parking slot annotations. We then propose ParkGaussian, the first framework that integrates 3D Gaussian Splatting (3DGS) for parking

338

core_reconstruction

medium

PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems

Robustness & Safety / Safety

D. adjacent but useful context

general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Poisoning input views of 3D reconstruction systems has been recently studied.However, we identify that existing studies simply backpropagate adversarial gradients through the 3D reconstruction pipeline as a whole, without uncovering the new vulnerability rooted in specific modules of the 3D reconstruction pipeline.In this paper, we argue that the structure-from-motion (SfM) initialization, as the geometric core of many widely used reconstruction systems, can be targeted to achieve strong poisoning effects. To this end, we propose PoInit-of-View, which optimizes adversarial perturbations to intentionally introduce cross-view gradient inconsistencies at projections of corresponding 3D points. These inconsistencies disrupt keypoint detection and feature matching, thereby corrupting pose estimation and triangulation within SfM, eventually resulting in low-quality rendered views. We also pro

339

core_reconstruction

medium

Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Monocular Semantic Scene Completion (SSC) aims to reconstruct complete 3D semantic scenes from a single RGB image, offering a cost-effective solution for autonomous driving and robotics. However, the inherently imbalanced nature of voxel distributions—where over 93% of voxels are empty and foreground classes are rare—poses significant challenges. Existing methods often suffer from redundant emphasis on uninformative voxels and poor generalization to long-tailed categories. To address these issues, we propose VoxSAMNet (Voxel Sparsity-Aware Modulation Network), a unified framework that explicitly models voxel sparsity and semantic imbalance. Our approach introduces: (1) a Dummy Shortcut for Feature Refinement (DSFR) module that bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention; (2) a Foreground Modulation Strategy combining Foreground Dro

340

core_reconstruction

medium

TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Embodied 3D Semantic Scene Completion (SSC) infers dense geometry and semantics from continuous egocentric observations. Most existing Gaussian-based methods rely on random initialization of many primitives within predefined spatial bounds, resulting in redundancy and poor scalability to unbounded scenes. Recent depth-guided approach alleviates this issue but remains local, suffering from latency and memory overhead as scale increases.To overcome these challenges, we propose TGSFormer, a scalable Temporal Gaussian Splatting framework for embodied SSC. It maintains a persistent Gaussian memory for temporal prediction, without relying on image coherence or frame caches.For temporal fusion, a Dual Temporal Encoder jointly processes current and historical Gaussian features through confidence-aware cross-attention.Subsequently, a Confidence-aware Voxel Fusion module merges overlapping primiti

341

core_reconstruction

medium

UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes

Remote Sensing & Earth / Remote Sensing

D. adjacent but useful context

general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments make lighting changes unavoidable.However, existing datasets either restrict capture to short time windows, thus lacking meaningful illumination diversity, or span months and seasons, where geometric and semantic changes confound the isolated study of lighting robustness.We introduce UAVLight, a controlled-yet-real benchmark for illumination-robust 3D reconstruction. Each scene is captured along

342

core_reconstruction

medium

Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping; generation_editing; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Creating realistic and simulation-ready 3D assets is crucial for autonomous driving research and virtual environment construction. However, existing 3D vehicle generation methods are often trained on synthetic data with significant domain gaps from real-world distributions. The generated models often exhibit arbitrary poses and undefined scales, resulting in poor visual consistency when integrated into driving scenes. In this paper, we present Unposed-to-3D, a novel framework that learns to reconstruct 3D vehicles from real-world driving images using image-only supervision. Our approach consists of two stages. In the first stage, we train a image-to-3D reconstruction network using posed images with known camera parameters. In the second stage, we remove camera supervision and use a camera prediction head that directly estimates the camera parameter from unposed images. The predicted pose

343

core_reconstruction

medium

Video2Robo: 3DGS-based Synthetic Data from One Video Enables Scalable Robot Learning

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Scalable robot learning is hindered by the high cost of acquiring diverse, high-quality embodied data. Existing data generation approaches partially mitigate this issue but typically depend on hard-to-access hardware and labor-intensive manual effort, with limited generalization to diverse scene configurations. To overcome these limitations, we propose Video2Robo, a framework that generates high-quality and diverse robot data directly from a single human demonstration video, enabling seamless deployment on physical robots. At its core, Video2Robo leverages 3D Gaussian Splatting (3DGS) as a powerful scene representation, enabling high-fidelity rendering and explicit 3D scene editing. The framework tracks temporally consistent motion trajectories of task-relevant objects from raw video footage and identifies key task skills, guiding robots to execute tasks kinematically plausibly under nov

344

core_reconstruction

medium

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has sparked interest in extending these models to 3D scenes, with the goal of human-like visual-spatial intelligence. However, achieving deep spatial understanding comparable to human capabilities remains challenging for both model design and data acquisition. Existing methods often rely on external depth sensors for geometry capture or off-the-shelf algorithms for pre-constructing 3D maps, which limits their scalability.In this work, we introduce VLM-3R, a framework for Vision-Language Models that couples 3D reconstructive instruction tuning with scalable training data curation and a new benchmark for temporal reasoning. Specifically, VLM-3R processes monocular video frames with a geometry encoder that derives implicit 3D tokens representing scene context (spatial tokens) and camera motion (view tokens). In

345

core_reconstruction

medium

WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Editable high-fidelity 4D scenes are crucial for autonomous driving, as they can be applied to end-to-end training and closed-loop simulation. However, existing reconstruction methods are primarily limited to replicating observed scenes and lack the capability for diverse weather simulation. While image-level weather editing methods tend to introduce scene artifacts and offer poor controllability over the weather effects. To address these limitations, we propose \textbf{WeatherCity}, a novel framework for 4D urban scene reconstruction and weather editing. Specifically, we leverage a text-guided image editing model to achieve flexible editing of image weather backgrounds. To tackle the challenge of multi-weather modeling, we introduce a novel weather Gaussian representation based on shared scene features and dedicated weather-specific decoders. This representation is further enhanced with

346

core_reconstruction

medium

RecEdit-Drive: 3D Reconstruction-Guided Spatiotemporal Video Editing for Autonomous Driving Scenes

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

High-quality video editing and processing are crucial in domains such as filmmaking and autonomous driving, where accurate visual refinement and data preparation are essential. However, it is challenging to achieve precise control over dynamic objects while maintaining spatiotemporal consistency. Current approaches typically utilize text prompts or 2D structural priors for video editing to ensure consistency, yet they struggle to effectively constrain the spatial variations of dynamic 3D objects. In this paper, we introduce $\textbf{RecEdit-Drive}$, a framework that integrates $\textbf{Spatial Feature Warping}$ and $\textbf{Spatiotemporal Collaborative Modeling}$ to effectively control 3D object variations and enhance video consistency. The spatial feature warping enhances precise control over the edited foreground 3D objects, enhancing spatial consistency in the generated videos; and th

347

core_reconstruction

medium

MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Active mapping aims to determine how an agent should move to efficiently reconstruct an unknown environment. Most existing approaches rely on greedy next-best-view prediction, resulting in inefficient exploration and incomplete scene reconstruction.To address this limitation, we introduce MAGICIAN a novel long-term planning framework that maximizes accumulated surface coverage gain through Imagined Gaussians, a predicted scene representation derived from a pre-trained occupancy network with strong structural priors. This representation enables efficient computation of overage gain for any novel viewpoint via fast volumetric rendering.The resulting speedup allows the integration of the gain metric into a tree-search algorithm for planning long-horizon paths.We update Imagined Gaussians and refine the planned trajectory in a closed-loop manner.Our method achieves state-of-the-art performan

348

core_reconstruction

medium

Exact-GS: Mathematically Rigorous and Accurate 3D Gaussian Splatting for 3D X-ray Reconstruction

Medical & Scientific Imaging / Medical Imaging

D. adjacent but useful context

general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

We propose Exact-GS, a novel mathematically rigorous and accurate 3D Gaussian Splatting model designed to perform 3D X-ray computed tomography (CT) reconstruction and novel view synthesis. Recently, 3D Gaussian Splatting achieved considerable progress at 3D representation. Unfortunately, due to the affine approximation of the projective transformation, previous 3DGS-based methods inevitably suffer from artifacts and projection inconsistencies. To address this problem, some ray tracing based methods perform integration along the ray across Gaussians. However, these methods are computationally inefficient on the forward and backward pass. We introduce a novel closed-form splatting solution for this problem with mathematically rigorous derivation. Our model is the first to achieve the same exact rendering quality as ray tracing based methods without any approximation under a splatting-based

349

core_reconstruction

medium

RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cue for 3D Object Detection

Detection & Tracking / Detection

D. adjacent but useful context

gaussian_radiance; pose_calibration_localization; dynamic_4d; robotics_mapping

direct reconstruction/3DGS/4D title linked to core representation cluster

abstract

4D millimeter-wave radar is a promising sensing modality for autonomous driving, yet effective 3D object detection from 4D radar and monocular images remains challenging. Existing fusion approaches either rely on instance proposals lacking global context or dense BEV grids constrained by rigid structures, lacking a flexible and adaptive representation for diverse scenes. To address this, we propose RaGS, the first framework that leverages 3D Gaussian Splatting (GS) to fuse 4D radar and monocular cues for 3D object detection. 3D GS models the scene as a continuous field of Gaussians, enabling dynamic resource allocation to foreground objects while maintaining flexibility and efficiency. Moreover, the velocity dimension of 4D radar provides motion cues that help anchor and refine the spatial distribution of Gaussians. Specifically, RaGS adopts a cascaded pipeline to construct and progressi

350

core_reconstruction

medium

ST4R-Splat: Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting

Segmentation & Dense Prediction / Segmentation

D. adjacent but useful context

gaussian_radiance; dynamic_4d; robotics_mapping; data_benchmark

direct reconstruction/3DGS/4D title linked to core representation cluster

abstract

Understanding and segmenting objects in dynamic 4D environments from natural language is crucial yet underexplored. Existing works either perform referring segmentation in static 3D scenes or build open-vocabulary 4D language fields, but none of them supports grounding complex spatio-temporal referring descriptions in explicit 4D reconstructions. Based on 4D Gaussian Splatting(4DGS), We formalize this missing setting as Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting (STRS-4DGS): given a 4DGS representation of a dynamic scene and a referring expression, the goal is to identify the target object and segment it across both space and time, resolving where the described instance is and when it exhibits the queried state. To tackle this challenge, we propose ST4R-Splat, the first framework for STRS-4DGS. ST4R-Splat builds on deformable 4D Gaussians and introduces an Instance-A

351

core_reconstruction

medium

Revisiting Pose Sensitivity in Splat-based Computed Tomography under Sparse-view Reconstruction

Medical & Scientific Imaging / Medical Imaging

D. adjacent but useful context

general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark

direct reconstruction/3DGS/4D title linked to core representation cluster

abstract

X-ray computed tomography (CT) reconstructs volumetric representations of objects from projection images obtained by transmitting X-rays through a target. Recent splat-based tomography, which represents a volume as a continuous distribution of 3D Gaussians, has demonstrated both high reconstruction quality and fast convergence in cone-beam sparse-view CT. However, when deployed in real CT systems with limited and non-uniform view distributions, we observe distinctive streak and strip artifacts that are far more pronounced than in conventional reconstruction methods. Through detailed analysis, we show that these artifacts primarily originate from pose inaccuracies in the acquisition geometry rather than from view sparsity itself. We revisit pose sensitivity in the splatting formulation and derive a stable gradient-based framework that jointly refines geometric parameters during reconstruc

352

core_reconstruction

medium

DetectSCI: Toward Object-Guided ROI Reconstruction for High-Resolution Video Snapshot Compressive Imaging

Computational Imaging / Computational Imaging

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark

direct reconstruction/3DGS/4D title linked to core representation cluster

abstract

Video snapshot compressive imaging (SCI) offers a promising alternative to high-speed cameras by encoding multiple frames into a single 2D measurement. However, SCI requires algorithms to reconstruct the high-speed video and as resolution increases, reconstruction becomes computationally expensive and memory-intensive. Much of resource is wasted on recovering large background regions that contain little useful information, highlighting the need for selective, object-driven reconstruction. Existing object detectors struggle to perform accurately on SCI measurements due to the spatial–temporal aliasing introduced by coded exposure. To address this challenge, we proposes DetectSCI, the first framework enabling object-guided region-of-interest (ROI) reconstruction for high-resolution SCI. The inside detector comprises two key components: an encoder built from weight-sharing Mamba-Implicit Mo

353

core_reconstruction

medium

Generative Diffusion Priors for 3D Mapping of the Dark Universe

Medical & Scientific Imaging / Medical Imaging

D. adjacent but useful context

general_reconstruction; gaussian_radiance; robotics_mapping; data_benchmark

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Reconstructing the three-dimensional distribution of dark matter from weak-lensing observations is a central but highly ill-posed inverse problem in cosmology. Unlike standard 3D reconstruction with multiple viewpoints, we observe the universe from a single line of sight, through noisy shape distortions of galaxies with uncertain distances, so meaningful recovery of the 3D matter field requires strong prior assumptions. Existing methods either produce point estimates with handcrafted priors or use neural ensembles for approximate Bayesian uncertainty, and struggle to capture the non-Gaussian, filamentary structure of the cosmic web. With the advent of new high-resolution cosmological simulations, we now have an alternative source of prior knowledge that captures the nonlinear statistics of structure formation with far greater fidelity than analytic prescriptions. We leverage these simula

354

core_reconstruction

medium

REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting

Multimodal & Language / VLM / MLLM

D. adjacent but useful context

gaussian_radiance; pose_calibration_localization; surface_occupancy; generation_editing

direct reconstruction/3DGS/4D title linked to core representation cluster

abstract

Bridging the gap between complex human instructions and precise 3D object grounding remains a significant challenge in vision and robotics. Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions, while 2D vision-language models that excel at such reasoning lack intrinsic 3D spatial understanding. In this paper, we introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation without requiring extensive 3D-specific post-training. We perform segmentation directly on 3D Gaussian Splatting representations, capitalizing on their ability to render photorealistic novel views that are highly suitable for MLLM comprehension. As directly feeding one or more rendered views to the MLLM can lead to high sensitivity to viewpoint selection, we propose a novel Global-to-Local Spatial Grounding strategy. Specifically

355

core_reconstruction

medium

GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction

Medical & Scientific Imaging / Medical Imaging

D. adjacent but useful context

general_reconstruction; gaussian_radiance; surface_occupancy

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Slice-based volumetric imaging is widely applied and it demands representations that compress aggressively while preserving internal structure for analysis. This paper introduces GaussianPile, unifying 3D Gaussian splatting with an imaging system-aware focus model to address this challenge. Our new method introduces three key innovations: (i) a slice‑aware piling strategy that positions anisotropic 3D Gaussians to model through‑slice contributions, (ii) a differentiable projection operator that encodes the finite‑thickness point spread function of the imaging acquisition system, and (iii) a compact encoding and joint optimization pipeline that simultaneously reconstructs and compresses the Gaussian sets. Our CUDA-based design retains the compression and real‑time rendering efficiency of Gaussian primitives while preserving high‑frequency internal volumetric detail. Experiments on microsc

356

core_reconstruction

medium

Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement

Medical & Scientific Imaging / Medical Imaging

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy

direct reconstruction/3DGS/4D title linked to core representation cluster

abstract

Prospective reconstruction is crucial in many clinical applications such as MRI-guided radiotherapy, which demands accurate image reconstruction and fast motion estimation from currently acquired measurements. However, prospective reconstruction remains challenging due to ultra-sparse sampling and stringent latency requirements. In this work, we propose PDMR, an Prospective Dynamic 3D MRI Reconstruction framework with latent-space motion tracking. Our core idea is to learn an efficient and generalizable latent manifold of motion fields offline, enabling rapid online adaptation for prospective reconstruction. Specifically, we parameterize the deformation vector fields (DVFs) on a low-dimensional manifold, effectively reducing the search space for fast online adaptation, and employ a tri-plane representation to achieve geometry-aware and memory-efficient encoding of 3D motion. Experiments

357

core_reconstruction

medium

MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Driving

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

gaussian_radiance; depth_correspondence; robotics_mapping; data_benchmark

direct reconstruction/3DGS/4D title linked to core representation cluster

abstract

Generative models have shown great potential in trajectory planning. Recent studies demonstrate that anchor-guided generative models are effective in modeling the uncertainty of driving behaviors and improving overall performance. However, these methods rely on discrete anchor vocabularies that must sufficiently cover the trajectory distribution during testing to ensure robustness, inducing an inherent trade-off between vocabulary size and model performance.To overcome this limitation, we propose MeanFuser, an end-to-end autonomous driving method that enhances both efficiency and robustness through three key designs. (1) We introduce Gaussian Mixture Noise (GMN) to guide generative sampling, enabling a continuous representation of the trajectory space and eliminating the dependency on discrete anchor vocabularies. (2) We introduce ``MeanFlow Identity", which models the mean velocity fiel

358

core_reconstruction

medium

Adaptive Anisotropic Gaussian Splatting for Multi-contrast MRI Arbitrary-Scale Super-Resolution with Anatomy Guidance

Medical & Scientific Imaging / Medical Imaging

D. adjacent but useful context

gaussian_radiance; surface_occupancy; robotics_mapping

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

Implicit neural representation (INR) based methods learn a continuous mapping from a low-resolution (LR) target magnetic resonance (MR) image and a high-resolution (HR) reference image to achieve arbitrary-scale super-resolution (SR). However, their inherent spectral bias favors learning low-frequency (LF) components, often failing to capture the sharp transitions at anatomical boundaries and resulting in the loss of high-frequency (HF) details. Inspired by 3D Gaussian splatting, we propose GaussM²ASR (Gaussian Multi-contrast MRI Arbitrary-scale Super-Resolution), which converts the challenging task of HF anatomical reconstruction into a smoother parameter optimization problem by learning the parameters of anisotropic 2D Gaussian kernels. To handle inter-contrast discrepancies, we introduce an anatomy-guided pipeline comprising three core modules: a Structure Prior Modulation Fusion (SPM

359

core_reconstruction

medium

TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

pose_calibration_localization; dynamic_4d; robotics_mapping

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass. However, when deployed in online settings such as driving scenarios, predictions are made over temporal windows, making it non-trivial to maintain consistency across time. Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry. In this work, we propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies. In addition, we adopt a point-agnostic submap registration design that is inherently robust to noisy geometry predictions. The proposed framework is fully pl

360

core_reconstruction

medium

DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures

Multimodal & Language / Agentic AI

D. adjacent but useful context

general_reconstruction; gaussian_radiance

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

3D Gaussian Splatting achieves real-time photo-realistic rendering but struggles when training images contain transient objects that violate multi-view consistency. Existing methods face a fundamental dilemma: accurate transient detection requires well-reconstructed static scenes, yet clean reconstruction depends on reliable transient masks. This circular dependency causes persistent artifacts when both components are jointly optimized from poor initialization. We present DualSplat, a two-stage framework which sidesteps this dilemma by first generating pseudo masks from reconstruction failures, then using them to guide clean scene optimization. We observe that transient objects manifest as incomplete fragments during initial training, since they appear in only a subset of views. We consolidate these failures into pseudo masks via instance-level thresholding and a feature-residual filter

361

core_reconstruction

medium

RemedyGS: Defend 3D Gaussian Splatting Against Computation Cost Attacks

Robustness & Safety / Safety

D. adjacent but useful context

general_reconstruction; gaussian_radiance

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

As a mainstream technique for 3D reconstruction, 3D Gaussian splatting (3DGS) has been applied in a wide range of applications and services. Recent studies have revealed critical vulnerabilities in this pipeline and introduced computation cost attacks that lead to malicious resource occupancies and even denial-of-service (DoS) conditions, thereby hindering the reliable deployment of 3DGS. In this paper, we propose the first effective and comprehensive black-box defense framework, named RemedyGS, against such computation cost attacks, safeguarding 3DGS reconstruction systems and services. Our pipeline comprises two key components: a detector to identify the attacked input images with poisoned textures and a purifier to recover the benign images from their attacked counterparts, mitigating the adverse effects of these attacks. Moreover, we incorporate adversarial training into the purifier

362

core_reconstruction

medium

Eulerian Gaussian Splatting using Hashed Probability Pyramids

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

gaussian_radiance; robotics_mapping

3D Vision & Geometry paper with direct reconstruction title and abstract signal

abstract

We introduce a probabilistic splat-based radiance field framework that retains the fast rasterization and test-time efficiency of 3D Gaussian Splatting (3DGS) while replacing heuristic primitive manipulation with gradient-based optimization of a volumetric probability density. Rather than relocating, splitting, or culling Gaussians via hand-tuned densification (e.g., ADC), we treat primitive locations as samples drawn from a persistent, learnable density. We instantiate this density with a novel, memory-efficient multi-scale hierarchical grid that enables end-to-end gradient-based control over primitive population density. To stabilize stochastic training, we derive an unbiased gradient estimator with control variates that markedly reduces variance. By allowing probability mass to flow to where the loss demands, our method eliminates brittle priors and naturally explores the volume, achi

363

strong_bridge

medium

Clone Deterministic 3D Worlds

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future physical state of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. However, existing world models often focus on random generation of open worlds, but neglect the need for high-fidelity modeling of deterministic scenarios (such as fixed-map mazes and static space robot navigation). In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone a deterministic 3D world. 1) Through diagnostic experiment, we quantitatively demonstrate that high-fidelity cloning is feasible and the primary bottleneck for long-horizon fidelity is the geometric st

364

strong_bridge

medium

NeuROK: Generative 4D Neural Object Kinematics

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

dynamic/4D paper with direct reconstruction signal

abstract

Data-driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simulative 4D dynamics---realistic temporal deformations of static objects under various physical conditions---remains challenging and often ad hoc despite being critical for building comprehensive 3D world models. Most existing methods assume a predefined physical model and use system identification to estimate parameters, restricting these methods to specific categories and small-scale datasets. We propose that these restrictions can be overcome by learning a data-driven kinematic state parameterization for object-centric physical systems. Specifically, we learn both a latent space of all possible states of the object and a decoder that maps any sampled latent to a plausibly deformed shape of the object. We refer to this parameteriza

365

strong_bridge

medium

SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy

dynamic/4D paper with direct reconstruction signal

abstract

Learning dense correspondences across deformable 3D shapes remains a long-standing challenge due to structural variability, non-isometric deformation, and inconsistent topology. Existing methods typically trade off generalization, geometric fidelity, and efficiency.We address this by proposing SGSoft, a unified intrinsic pipeline that (i) constructs a geodesic correspondence field on a canonical template, (ii) learns multimodal dense descriptors guided by pretrained semantic priors with this geodesic correspondence field supervision, (iii) retrieves dense correspondences in a single feed-forward pass via nearest-neighbor search in descriptor space.This formulation enables stable and topology-invariant supervision under large pose variation, structural differences, and remeshing.SGSoft achieves state-of-the-art inter-category generalization while offering the best accuracy–efficiency trad

366

strong_bridge

medium

Spatia: Video Generation with Updatable Spatial Memory

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; pose_calibration_localization; dynamic_4d; surface_occupancy; generation_editing

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory–aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic–static disentanglement design enhances spatial consistency throughout the generation process while preserving the model’s ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.

367

strong_bridge

medium

D-Prism: Differentiable Primitives for Structured Dynamic Modeling

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy

dynamic/4D paper with direct reconstruction signal

abstract

Capturing both geometry and rigid motion for structured dynamic objects, like multi-part assemblies or jointed mechanisms, remains a key challenge. Existing dynamic methods, such as deformable meshes or 3DGS, rely on unstructured representations and fail to jointly model suitable geometry and articulated motion. Primitive-based methods excel at structured static scenes, but their dynamic potential is still unexplored. We propose D-Prism, the first framework to achieve high-fidelity structured dynamic modeling by extending differentiable primitives to the dynamic domain.Specifically, we bind 3DGS to primitive surfaces, leveraging their respective strengths in appearance and geometry. We introduce a deformation network to control primitive motion, ensuring it accurately matches the object's movement. Furthermore, we design a novel adaptive control strategy to dynamically adjust primitive c

368

strong_bridge

medium

Dark3R: Learning Structure from Motion in the Dark

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark

Gaussian/radiance representation linked to pose/mapping/metric bridge

abstract

We introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signal-to-noise ratios (SNRs) below $-4$ dB—a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher–student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy--clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson–Gaussian noise model applied to well-exposed raw images.To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes $\sim$42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structur

369

strong_bridge

medium

Perceptual 3D Simulation With Physical World Modeling

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Predicting how a scene will evolve after a desired 3D transformation from images is a central goal in vision, graphics, and robotics. Yet unlike ideal simulators with full access to 3D geometry and dynamics, real world systems must rely on perceptual inputs and local actions that are inherently partial and incomplete. In this work, we present P3Sim, a physical world modeling system that simulates future scene states under both partial observations and incomplete 3D transformation signals. P3Sim is composed of three interacting components: a learned physical world model, a geometric conditioning module, and a persistent scene memory. The world model interprets perception as probabilistic inference over multimodal scene variables, providing predictions of the distributions of any scene variable conditioned on any combination of others. The geometric conditioning module provides a partial 3

370

strong_bridge

medium

Real-Time Dynamic Scene Rendering with Controlled Compressibility and Contact Awareness

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark

dynamic/4D paper with direct reconstruction signal

abstract

Existing dynamic scene rendering methods often adopt rigid-body or direction-limited assumptions, yet real-world motion and contact routinely violate these, producing artifacts near occlusion boundaries. To address this, we introduce a unified, source-aware framework for dynamic rendering that enforces the consistency of Gaussian primitives under an explicit manifold constraints. We project predicted velocities onto physically grounded priors via efficient, parallel inner solves: (i) a Helmholtz parameterization that separates divergence-free and potential-flow motion components; (ii) an anisotropic, compressible directional prior; and (iii) an affine family that disentangles rotation from isotropic scaling. Experiments on extensive benchmarks show consistent improvements over state-of-the-art methods in reconstruction fidelity and temporal coherence. Our approach ensures physically real

371

strong_bridge

medium

SceneTok: A Compressed, Diffusable Token Space for 3D Scenes

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing

Gaussian/radiance representation linked to pose/mapping/metric bridge

abstract

We present SceneTok, a novel tokenizer for encoding view sets of scenes into a compressed and diffusable set of unstructured tokens. Existing approaches for 3D scene representation and generation commonly use 3D data structures or view-aligned fields. In contrast, we introduce the first method that encodes scene information into a small set of permutation invariant tokens that is disentangled from the spatial grid. The scene tokens are predicted by a multi-view tokenizer given many context views and rendered into novel views by employing a light-weight rectified flow decoder. A diffusion transformer enables scene generation on the compressed token space. We show that the compression is two orders of magnitude stronger than for other representations while still reaching state-of-the-art reconstruction quality. Further, our representation can be rendered from novel trajectories, including

372

strong_bridge

medium

VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretr

373

strong_bridge

medium

ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes

3D Vision & Geometry / 3D Gaussian Splatting

B. bridge: reconstruction becomes mapping/world model

gaussian_radiance; dynamic_4d; robotics_mapping; data_benchmark

dynamic/4D paper with direct reconstruction signal

abstract

Multi-period image collections are common in real-world applications. Cities are re-scanned for mapping, construction sites are revisited for progress tracking, and natural regions are monitored for environmental change. Such data form multi-period scenes, where geometry and appearance evolve. Reconstructing such scenes is an important yet underexplored problem. Existing pipelines rely on incompatible assumptions: static and in-the-wild methods enforce a single geometry, while dynamic ones assume smooth motion, both failing under long-term, discontinuous changes. To solve this problem, we introduce ChronoGS, a temporally modulated Gaussian representation that reconstructs all periods within a unified anchor scaffold. It‘s also designed to disentangle stable and evolving components, achieving temporally consistent reconstruction of multi-period scenes. To catalyze relevant research, we re

374

strong_bridge

high

DROID-SLAM in the Wild

3D Vision & Geometry / Pose Estimation

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; pose_calibration_localization; robotics_mapping; data_benchmark

pose/localization bridge genus=Pose Estimation with reconstruction/map signal

abstract

We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 8 FPS. The source code will be publicly

375

strong_bridge

high

HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps

3D Vision & Geometry / Pose Estimation

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; pose_calibration_localization; robotics_mapping; data_benchmark

pose/localization bridge genus=Pose Estimation with reconstruction/map signal

abstract

Visual localization on standard-definition (SD) maps has emerged as a promising low-cost and scalable solution for autonomous driving. However, existing regression-based approaches often overlook inherent geometric priors, resulting in suboptimal training efficiency and limited localization accuracy. In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and standard-definition (SD) maps. We construct input pairs that satisfy a homography constraint by projecting ground-view features into the BEV domain and enforcing semantic alignment with map features. Then we leverage homography relationships to guide feature fusion and restrict the pose outputs to a valid feasible region, which significantly improves training efficiency and localization accuracy compared to prior methods relying on attention-based fusi

376

strong_bridge

medium

StreamVLO: Streaming Visual–LiDAR Odometry with Cumulative Drift Compensation

3D Vision & Geometry / Pose Estimation

B. bridge: reconstruction becomes mapping/world model

pose_calibration_localization; depth_correspondence; dynamic_4d; robotics_mapping

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

We propose StreamVLO, a streaming visual–LiDAR odometry framework that performs unified spatio-temporal correlation with Mamba models and tackles the long-standing cumulative drift problem via an online Cumulative Drift Compensation scheme for localization in 4D dynamic environments. Specifically, StreamVLO introduces a unified spatio-temporal correlation module built on Mamba to fuse heterogeneous visual and LiDAR cues across multi-frame clips, overcoming the limited temporal exploration of prior pairwise methods. Furthermore, a Cumulative Drift Compensation module minimizes cumulative drift by iteratively learning residual corrections from multiple historical frames in a causal manner. To strengthen spatial feature representation on salient regions, we adopt a Keypoint-Aware Auxiliary Loss with a winner-takes-all strategy. StreamVLO achieves state-of-the-art performance on two commonly

377

strong_bridge

high

VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

3D Vision & Geometry / Pose Estimation

B. bridge: reconstruction becomes mapping/world model

pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark

pose/localization bridge genus=Pose Estimation with reconstruction/map signal

abstract

Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird’s-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mecha

378

strong_bridge

high

VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation

3D Vision & Geometry / Pose Estimation

B. bridge: reconstruction becomes mapping/world model

pose_calibration_localization; depth_correspondence; robotics_mapping

pose/localization bridge genus=Pose Estimation with reconstruction/map signal

abstract

Accurate global localization is crucial for autonomous driving and robotics, especially in dense urban environments where GNSS is often unreliable due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. To address this challenge, we propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to establish horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to r

379

strong_bridge

high

Similarity-Consistent Likelihood Diffusion enables Hidden Person Detection from Wall Reflections

3D Vision & Geometry / Pose Estimation

B. bridge: reconstruction becomes mapping/world model

pose_calibration_localization; robotics_mapping

pose/localization bridge genus=Pose Estimation with reconstruction/map signal

abstract

This paper studies passive non-line-of-sight corner-camera detection and human localization using faint indirect reflections on a visible wall. The challenge is twofold: multi-exposure wall observations are unstable and entangled with sensor nonlinearities, and mapping these observations to a hidden-view RGB image is severely underdetermined, making purely discriminative regressors brittle and unconstrained diffusion priors stochastic. To address these challenges, we introduce the Similarity-Likelihood Diffusion Network (SLD-Net), a two-stage framework that produces measurement-consistent, deterministic reconstructions. First, DeLi-Inversion forms an exposure-aware differential representation and jointly predicts an initial reconstruction and a pixel-wise precision map, yielding a heteroscedastic pseudo-likelihood. Second, SiCo-Diffusion injects this likelihood as precision-weighted ener

380

strong_bridge

high

Uncertainty-Aware Modality Fusion for Unaligned RGB-T Salient Object Detection

3D Vision & Geometry / Pose Estimation

B. bridge: representation meets metric pose

gaussian_radiance; pose_calibration_localization; surface_occupancy

pose/localization bridge genus=Pose Estimation with reconstruction/map signal

abstract

Unaligned RGB-T salient object detection (SOD) remains challenging due to severe cross-modal spatial discrepancies and unreliable feature fusion. Existing methods often assume perfect alignment or rely on geometric registration, which is computationally demanding and sensitive to cross-modal inconsistencies. To address these limitations, we propose an uncertainty-aware modality fusion network (UMFNet) that reformulates RGB-T SOD as an uncertainty-aware representation learning problem. Specifically, the proposed uncertainty alignment module (UAM) models pixel-wise features as Gaussian latent distributions to estimate local uncertainty and identify cross-modal consistency regions within the feature space, thereby achieving implicit alignment without explicit registration. Furthermore, the confidence-guided global modulation (CGM) mechanism leverages confidence maps derived from uncertainty

381

strong_bridge

high

AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; data_benchmark

pose/localization bridge genus=Pose Estimation with reconstruction/map signal

abstract

Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions.First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation.Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose minimizing the feature discrepancy between on-the-flyrendered object features and obser

382

strong_bridge

high

CoLoR: The Devil is in Scene Coordinate Regression for Large-Scale Visual Localization

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; data_benchmark

pose/localization bridge genus=Pose Estimation with reconstruction/map signal

abstract

Scene Coordinate Regression (SCR) has emerged as a memory-efficient paradigm for visual localization.While SCR has demonstrated performance comparable to classic feature matching based approaches in small-scale scenes, it has consistently underperformed in large-scale environments.Large-scale localization is hampered by two challenges: sparse co-visibility and local appearance ambiguity.In this work, we propose **CoLoR**, a novel training framework tailored for large-scale SCR.First, we explicitly and efficiently partition scene points into multi-view and single-view sets and introduce a two-stage bootstrapping paradigm to provide complete and strong supervision for all points.Second, we propose a multi-granularity retrieval feature, which unifies the conventional global and local features as retrieval-oriented representations at the image and pixel levels, respectively, to enforce featu

383

strong_bridge

medium

Diff4Splat: Repurposing Video Diffusion Models for Dynamic Scene Generation

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; dynamic_4d; generation_editing; data_benchmark

dynamic/4D paper with direct reconstruction signal

abstract

We introduce Diff4Splat, a feed-forward framework for dynamic scene generation from a single image. Our method synergizes the powerful generative priors of video diffusion models with geometric and motion constraints learned from a large-scale 4D dataset. Given a single image, a camera trajectory, and an optional text prompt, our model directly predicts a dynamic scene represented by a deformable 3D Gaussian field. This approach captures appearance, geometry, and motion in a single pass, eliminating the need for test-time optimization or post-hoc processing. At the core of our framework is a video latent transformer that enhances existing video diffusion models, enabling them to jointly model spatio-temporal dependencies and predict 3D Gaussian Primitives over time. Supervised by objectives targeting appearance fidelity, geometric accuracy, and motion consistency, Diff4Splat generates hi

384

strong_bridge

medium

Event6D: Event-based Novel Object 6D Pose Tracking

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; dynamic_4d; data_benchmark

dynamic/4D paper with direct reconstruction signal

abstract

Event cameras provide microsecond latency, making them suitable for 6D object pose tracking in fast, dynamic scenes where conventional RGB and depth pipelines suffer from motion blur and large pixel displacements. We introduce EventTrack6D, an event-depth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. Conditioned on the most recent depth measurement, our dual reconstruction recovers dense photometric and geometric cues from sparse event streams. Our EventTrack6D operates at over 120 FPS and maintains temporal consistency under rapid motion. To support training and evaluation, we introduce a comprehensive benchmark suite: a large-scale synthetic dataset for training and two complementary evaluation sets, including real and simulated event datasets. Trained exclus

385

strong_bridge

high

PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; data_benchmark

pose/localization bridge genus=Pose Estimation with reconstruction/map signal

abstract

We present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity.PiLoT breaks this paradigm by directly registering live video stream against a geo-referenced 3D map. To achieve robust, accurate, and real-time performance, we introduce three key contributions: 1) a Dual-Thread Engine that decouples map rendering from core localization thread, ensuring both low latency while maintaining drift-free accuracy; 2) a large-scale synthetic dataset with precise geometric annotations (camera pose, depth maps). This dataset enables the training of a

386

strong_bridge

medium

ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; surface_occupancy; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given a image sequence, we leverage off-the-shelf visual-inertial SLAM,3D detection algorithms and VLMs to extract for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strateg

387

strong_bridge

medium

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; dynamic_4d; generation_editing; data_benchmark

dynamic/4D paper with direct reconstruction signal

abstract

Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion.To this end, we collect **SpatialVID**, a dataset consisting of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions.Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subseq

388

strong_bridge

high

Learning Scene Coordinate Reconstruction from Unposed Images via Pose Graph Optimization

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization; surface_occupancy

pose/localization bridge genus=Pose Estimation with reconstruction/map signal

abstract

Learning-based structure-from-motion methods such as ACE-Zero have demonstrated strong performance in estimating camera poses and scene coordinates from unordered image collections without requiring ground truth supervision. However, the lack of global and multi-view consistency constraints in ACE-Zero can lead to pose drift and misalignment, particularly in complex or ambiguous scenes. In this work, we propose a hybrid framework that integrates pose graph optimization (PGO) into ACE-Zero to refine camera poses and suppress incorrect refinements. We construct pose graphs directly from ACE-Zero outputs by extracting relative pose constraints from predicted scene coordinates. Furthermore, we introduce an uncertainty-aware optimization strategy by estimating confidence scores using geometric priors, including epipolar and optical flow consistencies across views. Our approach improves the ro

389

strong_bridge

high

Sparse–View Localization via Online Neural 3D Regression

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence

pose/localization bridge genus=Pose Estimation with reconstruction/map signal

abstract

We present ON3R, an online-trained neural regressor addressing sparse-view structureless localization, where database images have limited visual overlap and no prebuilt 3D map. Given any sparse matches between a query and a $K$-tuple of posed database views, ON3R predicts 3D coordinates for matched query keypoints, supervised by database reprojection residuals and a monocular depth prior. Afterwards, the absolute pose of the query is estimated via P3P-RANSAC and refined with lightweight bundle adjustment. Across MegaDepth, Cambridge Landmarks, and a sparsified version of Aachen Day-Night, ON3R outperforms existing methods. ON3R is particularly effective when the data is extremely sparse -- we focus on $K\leq10$ database images. The code, data splits, and SfM models will be made available for full reproducibility.

390

strong_bridge

high

JUMP-Hand: Learning Joint-wise Uncertainty to Gate Mixture of View Experts for Multi-View 3D Hand Reconstruction

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization; surface_occupancy

pose/localization bridge genus=Pose Estimation with reconstruction/map signal

abstract

In this paper, JUMP-Hand is proposed as a novel method for multi-view 3D hand reconstruction, which is the first to introduce probabilistic joint-wise uncertainty as an explicit gating mechanism to fuse multi-view information.Existing approaches usually fuse multi-view information by naïve pooling or implicit attention.However, they overlook that each hand joint exhibits varying visibility and reliability across views, which may degrade performance by indiscriminately aggregating noisy or unreliable information.For instance, one joint may be clearly visible in one view, while another joint is occluded in that view but visible in a different view.In contrast, JUMP-Hand addresses this by introducing the core insight of Mixture of Experts (MoE) and regard each 2D view as an expert.The key idea is that the reliability of each view expert is quantified through joint-wise uncertainty modeling,

391

strong_bridge

medium

MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; dynamic_4d; data_benchmark

dynamic/4D paper with direct reconstruction signal

abstract

We present MoVieS, a motion-aware view synthesis model that reconstruct 4D dynamic scenes from monocular videos in one second. It represents dynamic 3D scenes with pixel-aligned Gaussian primitives and explicitly supervises their time-varying motions. This allows, for the first time, the unified modeling of appearance, geometry and motion from monocular videos, and enables reconstruction, view synthesis and 3D point tracking within a single learning-based framework. By bridging view synthesis with geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision. As a result, it also naturally supports a wide range of zero-shot applications, such as scene flow estimation and moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive p

392

strong_bridge

medium

AvatarPointillist: AutoRegressive 4D Gaussian Avatarization

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; dynamic_4d; surface_occupancy

dynamic/4D paper with direct reconstruction signal

abstract

We introduce AvatarPointillist, a novel framework for generating dynamic 4D Gaussian avatars from a single portrait image. At the core of our method is a decoder-only Transformer that autoregressively generates a point cloud for 3D Gaussian Splatting. This sequential approach allows for precise, adaptive construction, dynamically adjusting point density and the total number of points based on the subject's complexity. During point generation, the AR model also jointly predicts per-point binding information, enabling realistic animation. After generation, a dedicated Gaussian decoder converts the points into complete, renderable Gaussian attributes. We demonstrate that conditioning the decoder on the latent features from the AR generator enables effective interaction between stages and markedly improves fidelity. Extensive experiments validate that AvatarPointillist produces high-quality,

393

strong_bridge

medium

Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; dynamic_4d; data_benchmark

dynamic/4D paper with direct reconstruction signal

abstract

Forecasting dynamic scenes remains a fundamental challenge in computer vision, as limited observations make it difficult to capture coherent object-level motion and long-term temporal evolution.We present Motion Group-aware Gaussian Forecasting (MoGaF), a framework for long-term scene extrapolation built upon the 4D Gaussian Splatting representation.MoGaF introduces motion-aware Gaussian grouping and group-wise optimization to enforce physically consistent motion across both rigid and non-rigid regions, yielding spatially coherent dynamic representations.Leveraging this structured space-time representation, a lightweight forecasting module predicts future motion, enabling realistic and temporally stable scene evolution.Experiments on synthetic and real-world datasets demonstrate that MoGaF consistently outperforms existing baselines in rendering quality, motion plausibility, and long-ter

394

strong_bridge

medium

EmoDiffTalk：Emotion-aware Diffusion for Editable 3D Gaussian Talking Head

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; dynamic_4d; generation_editing

dynamic/4D paper with direct reconstruction signal

abstract

Recent photo-realistic 3D talking head via 3D Gaussian Splatting still has significant shortcoming in emotional expression manipulation, especially for fine-grained and expansive dynamics emotional editing using multi-modal control. This paper introduces a new editable 3D Gaussian talking head, i.e. EmoDiffTalk. Our key idea is a novel Emotion-aware Gaussian Diffusion, which includes an action unit (AU) prompt Gaussian diffusion process for fine-grained facial animator, and moreover an accurate text-to-AU emotion controller to provide accurate and expansive dynamic emotional editing using text input. Experiments on public EmoTalk3D and RenderMe-360 datasets demonstrate superior emotional subtlety, lip-sync fidelity, and controllability of our EmoDiffTalk over previous works, establishing a principled pathway toward high-quality, diffusion-driven, multimodal editable 3D talking-head synth

395

strong_bridge

high

SAG-GNN: Semantic-Aware Guided GNN for Descriptor-Free 2D-3D Matching

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; surface_occupancy

pose/localization bridge genus=Pose Estimation with reconstruction/map signal

abstract

Image-to-point cloud matching (2D-3D matching) establishes accurate correspondences between image keypoints and 3D points for 6-DoF camera pose estimation. Existing methods either suffer from poor generalization due to scene-specific coordinate regression requiring per-scene retraining, or incur high storage and maintenance costs from descriptor-based matching that relies on large descriptor sets. Consequently, descriptor-free approaches have gained attention by avoiding heavy storage while improving generalizability; however, most rely only on low-level geometric cues, which limits performance. Leveraging the benefits of semantics in providing context, resolving ambiguities, and enhancing robustness in challenging scenes, we propose the Semantic-Aware Guided Graph Neural Network (SAG-GNN), integrating high-level semantics into descriptor-free 2D-3D matching. Specifically, we design a co

396

strong_bridge

high

HumanBA: Human-Aware Bundle Adjustment via Global Human-Camera Decoupling

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization

pose/localization bridge genus=Pose Estimation with reconstruction/map signal

abstract

Recovering global human and camera motion from monocular video is essential for world-coordinate human reconstruction but remains challenging due to entangled motions in image space. Traditional SLAM methods estimate monocular camera motion but fail in scenes dominated by foreground objects such as humans. A common workaround is to mask out dynamic objects, yet this approach becomes brittle when humans occupy most of the view or the background is too noisy, leading to unstable tracking and loss of constraints. This paper takes the opposite stance and reintegrates human motion as informative landmarks. We introduce HumanBA, a human-aware bundle adjustment framework that transforms dynamic humans into usable constraints via motion decoupling. HumanBA subtracts the human-induced component from observed joint trajectories, isolating a camera-induced (pseudo-static) component that can be saf

397

strong_bridge

medium

PhysHO: Physics-Based Dynamic 3D Gaussian Human and Object from Monocular Video

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; dynamic_4d

dynamic/4D paper with direct reconstruction signal

abstract

Physically plausible reconstruction of human–object dynamics from a single video remains under-explored in physics-based methods. Most prior approaches omit human-generated internal actuation by assuming motion driven solely by gravity and simple contacts. They also rely on idealized constitutive laws that underfit heterogeneous and anisotropic materials. We introduce PhysHO, which tightly couples SMPL-driven Linear Blend Skinning (LBS) with a Material Point Method (MPM) simulator to address these gaps. Our key insight is to use LBS as an interpretable actuation prior and MPM to propagate those forces through contact under physical constraints. Concretely, we derive targeted actuation with a PD controller guided by LBS trajectories and gate it per particle via a learnable LBS-impact factor so that only particles inside the SMPL volume are directly actuated. We model real materials with r

398

strong_bridge

medium

CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Autonomous driving must operate reliably across diverse surfaces to enable safe mobility. However, most driving datasets are captured on well-paved flat roads. Moreover, recent driving datasets primarily provide sparse LiDAR ground truth for images, which is insufficient for assessing fine-grained geometry in depth estimation and completion. To address these gaps, we introduce CARD, a multi-modal driving dataset that delivers quasi-dense 3D ground truth across continuous sequences rich in speed bumps, potholes, irregular surfaces and off-road segments. Our sensor suite includes synchronized global-shutter stereo cameras, front and rear LiDARs, 6-DoF poses from LiDAR-inertial odometry, per-wheel motion traces, and full calibration. Notably, our multi-LiDAR fusion yields ~500K valid depth pixels per frame, about 6.5x more than KITTI Depth Completion and 10x more on average than other publi

399

strong_bridge

medium

Dexterous World Models

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static—limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), an scene-action-conditioned video diffusion model enabling embodied interaction within static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human–scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues in the egocentric view to model action-conditioned dynamics directly. We train our model on a synthetic h

400

strong_bridge

medium

DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Vision-based autonomous driving has gained much attention due to its low costs and excellent performance.Compared with dense BEV (Bird’s Eye View) or sparse query models, Gaussian-centric method is a comprehensive yet sparse representation by describing scene with 3D semantic Gaussians. In this paper, we introduce DLWM, a novel paradigm with Dual Latent World Models specifically designed to enable holistic gaussian-centric pre-training in autonomous driving using two stages. In the first stage, DLWM predicts 3D Gaussians from queries by self-supervised reconstructing multi-view semantic and depth images. Equipped with fine-grained contextual features, in the second stage, two latent world models are trained separately for temporal feature learning, including Gaussian-flow-guided latent prediction for downstream occupancy perception and forecasting tasks, and ego-planning-guided latent pr

401

strong_bridge

medium

GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Physics-aware driving world model is essential for drive planning, out-of-distribution data synthesis, and closed-loop evaluation. However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We furth

402

strong_bridge

medium

GSV2X: Geometry-Aware Uncertainty Modeling and Orthogonal Fusion for Robust Roadside Perception

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; gaussian_radiance; pose_calibration_localization; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Reliable 3D perception from multi-view roadside sensors hinges on the robust fusion of camera and LiDAR data, a task complicated by geometric misalignments and sensor calibration errors. This paper presents GSV2X, a fusion framework that tackles these challenges through two core contributions. First, to achieve robustness against spatial uncertainty, we lift 2D image features into a unified Bird's-Eye-View (BEV) space by representing them as 3D Gaussian distributions. By incorporating learnable perturbations guided by camera geometry, our model explicitly accounts for potential calibration inaccuracies. Second, to maximize the synergy between modalities, we propose a new orthogonal fusion module. This module employs constrained attention to enforce orthogonality between camera and LiDAR features, effectively disentangling redundant information and promoting the learning of complementary

403

strong_bridge

medium

NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

In this paper, we propose **NeoVerse**, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks.

404

strong_bridge

medium

ORV: 4D Occupancy-centric Robot Video Generation

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Recent embodied intelligence suffers from data scarcity, while conventional simulators lack visual realism. Controllable video generation is emerging as a promising data engine, yet current action-conditioned methods still fall short: generated videos are limited in fidelity and temporal consistency, poorly aligned with controls, and often constrained to singleview settings. We attribute these issues to the representational gap between sparse control inputs and dense pixel outputs. Thus, we introduce ORV, a 4D occupancy-centric framework for robot video generation that couples action priors with occupancy-derived visual priors. Concretely, we align chunked 7-DoF actions with video latents via an Action-Expert AdaLN modulation, and inject 2D renderings of 4D semantic occupancy into the generation process as soft guidance. Meanwhile, a central obstacle is the lack of occupancy data for emb

405

strong_bridge

medium

Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping; generation_editing; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Recent progress in robot learning has been driven by large-scale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for spatial generalization in manipulation tasks. To reduce repetitive data collection, we present Real2Edit2Real, a framework that generates new demonstrations by bridging 3D editability with 2D visual data through a 3D control interface. Our approach first reconstructs scene geometry from multi-view RGB observations with a metric-scale 3D reconstruction model. Based on the reconstructed geometry, we perform depth-reliable 3D editing on point clouds to generate new manipulation trajectories while geometrically correcting the robot poses to recover physically consistent depth, which serves as a reliable condition for synthesizing new demonstrations. Finally,

406

strong_bridge

medium

Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

Gaussian/radiance representation linked to pose/mapping/metric bridge

abstract

Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments. However, this unstructured, in-the-wild video data is incompatible with ADS expecting structured, multi-modal sensor inputs for system validation and training purposes. To bridge this data gap, we propose Sensor2Sensor, a novel generative modeling paradigm that translates in-the-wild monocular dashcam videos into a high-fidelity, multi-modal sensor suite that we refer to as the AV log, which includes multi-view camera images and LiDAR point

407

strong_bridge

medium

Stereo World Model

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower com

408

strong_bridge

medium

U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we present **U4D**, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a "hard-to-easy" manner through two sequential stages: (1) *uncertainty-region modeling*, which reconstructs high-entropy regions with fine geometric fidelity, and (2) *uncertainty-conditioned completion*, which synthesizes the remaining a

409

strong_bridge

medium

Unified Camera Positional Encoding for Controlled Video Generation

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; dynamic_4d; robotics_mapping; generation_editing; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in three-dimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real-world cameras. We introduce **Relative Ray Encoding**, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera-controlled text-to-video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective for **Absolute Orientation Encoding**, enabling full con

410

strong_bridge

medium

UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; robotics_mapping; generation_editing; data_benchmark

Gaussian/radiance representation linked to pose/mapping/metric bridge

abstract

Recent progress in 3D reconstruction, such as NeRFs and 3D Gaussian Splatting, has made it easy to recover geometry and appearance from images. However, these static representations remain blind to the physics that govern how objects deform and respond to forces. Building interactive 3D worlds therefore requires predicting not only shape but the underlying material properties. Prior approaches either rely on slow test-time optimization or, more recently, a fast feed-forward predictor such as Pixie. However, these models produce only a single point estimate of physical parameters and are limited to a single simulation backend, restricting both expressiveness and portability. We introduce UniPixie, a generative physics-from-pixels framework that overcomes both limitations. UniPixie predicts a controllable, continuous soft-to-stiff distribution of plausible material properties from a single

411

strong_bridge

medium

WRIVINDER: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery

Remote Sensing & Earth / Remote Sensing

D. adjacent but useful context

general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Aligning ground-level imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth–based metric cues to produce a stable zenith-view rendering that can be directly matched to satellite context for metrically accurate camera geo-localization. To support systematic evaluation of this task—which lacks suitable benchmarks—we also release MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments. Together, Wrivin

412

strong_bridge

medium

HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

gaussian_radiance; dynamic_4d; robotics_mapping; generation_editing; data_benchmark

Gaussian/radiance representation linked to pose/mapping/metric bridge

abstract

Controllable driving scene generation is critical for realistic and scalable autonomous driving simulation, yet existing approaches struggle to jointly achieve photorealism and precise control. We introduce \textbf{HorizonForge}, a unified framework that reconstructs scenes as editable Gaussian Splats and Meshes, enabling fine-grained 3D manipulation and language-driven vehicle insertion. Edits are rendered through a noise-aware video diffusion process that enforces spatial and temporal consistency, producing diverse scene variations in a single feed-forward pass without per-trajectory optimization. To standardize evaluation, we further propose \textbf{HorizonSuite}, a comprehensive benchmark spanning ego- and agent-level editing tasks such as trajectory modifications and object manipulation. Extensive experiments show that Gaussian Splatting delivers substantially higher fidelity than a

413

strong_bridge

medium

GEM: Generating LiDAR World Model via Deformable Mamba

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

World models, which simulate environmental dynamics and generate sensor observations, are gaining increasing attention in autonomous driving. However, progress in LiDAR-based world models has lagged behind those built on camera videos or occupancy data, primarily due to two core challenges: the inherent disorder of point clouds and the difficulty of distinguishing dynamic objects from static structures. To address these issues, we propose **GEM**: a **G**enerative LiDAR world model that leverages d**E**formable **M**amba architecture, significantly improving fidelity and imaginative capability. Specifically, leveraging the structural similarity between sequential laser scanning and Mamba's processing mechanism, we first tokenize LiDAR sweeps into compact representations via a custom LiDAR scene tokenizer. After unsupervised disentanglement of tokenized features via a dynamic-static separ

414

strong_bridge

medium

An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving

Data & Evaluation / Benchmark

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Panoptic occupancy prediction aims to jointly infer voxel-wise semantics and instance identities within a unified 3D scene representation. Nevertheless, progress in this field remains constrained by the absence of high-quality 3D mesh resources, instance-level annotations, and physically consistent occupancy datasets. Existing benchmarks typically provide incomplete and low-resolution geometry without instance-level annotations, limiting the development of models capable of achieving precise geometric reconstruction, reliable occlusion reasoning, and holistic 3D understanding. To address these challenges, this paper presents an instance-centric benchmark for the 3D panoptic occupancy prediction task. Specifically, we introduce ADMesh, the first unified 3D mesh library tailored for autonomous driving, which integrates over 15K high-quality 3D models with diverse textures and rich semantic

415

strong_bridge

medium

Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; gaussian_radiance; robotics_mapping; data_benchmark

Gaussian/radiance representation linked to pose/mapping/metric bridge

abstract

Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning

416

strong_bridge

medium

Multi-Scale Gaussian-Language Map for Embodied Navigation and Reasoning

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

gaussian_radiance; pose_calibration_localization; surface_occupancy; robotics_mapping

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Understanding the geometric and semantic structure of environments is essential for embodied agents. Existing semantic mapping methods trade off between explicit geometry and multi-scale semantics,and lack a native interface for large models, thus requiring additional training of feature projection for semantic alignment. To this end, we propose the multi-scale Gaussian-Language Map (GLMap), which introduces three key designs: (1) explicit geometry, (2) multi-scale semantics covering both instance and region level concepts, and (3) a dual-modality interface where each semantic unit jointly stores a natural language description and a 3D Gaussian representation. The 3D Gaussians enable compact storage and fast rendering of task-relevant images via Gaussian splatting. To enable efficient incremental construction, we further propose a Gaussian Estimator that analytically derives Gaussian par

417

strong_bridge

medium

Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras

Computational Imaging / Computational Imaging

D. adjacent but useful context

pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Conventional multi-projector calibration requires projecting and capturing structured light patterns for each projector sequentially, causing calibration time and effort to increase linearly with the number of projectors. This scalability bottleneck has long limited the deployment of large-scale projection mapping systems. We present a new calibration framework that breaks this limitation by embedding cameras into the surface of the calibration target. The embedded cameras directly capture the incoming projection light, enabling the separation of simultaneously projected structured light patterns from multiple projectors according to their incident directions. Our method establishes correspondences between the optical centers of the embedded cameras and the projector pixels, allowing the intrinsic and extrinsic parameters of all projectors to be simultaneously estimated. We further intro

418

strong_bridge

medium

DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Although multimodal large language models (MLLMs) have shown remarkable capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs within a unified framework remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework for autonomous driving, performing spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through joint optimization. We term it 4D MLLM as it outputs both 3D occupancy and flow, capturing fine-grained spatial-temporal dynamics. Specifically, to capture both precise geometric information and rich appearance, our approach integrates point clouds, multi-view images and language instructions within a single MLLM architecture. Remarkably, desp

419

strong_bridge

medium

OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective

Data & Evaluation / Benchmark

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Semantic Scene Completion (SSC) is crucial for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial scenarios like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors represent the primary modality for SSC data generation, which poses challenges for most uncrewed aerial vehicles (UAVs) due to flight regulations, mass and energy constraints, and the sparsity of LiDAR-based point clouds from elevated viewpoints. To address these limitations, we introduce OccuFly, the first real-world, camera-based aerial SSC benchmark, captured at altitudes of 50m, 40m, and 30m during spring, summer, fall, and winter. OccuFly covers urban, indu

420

strong_bridge

medium

Spatial Retrieval Augmented Autonomous Driving

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Existing autonomous driving systems rely on onboard sensors (cameras, LiDAR, IMU, etc) for environmental perception. However, this paradigm is limited by the drive-time perception horizon and often fails under limited view scope, occlusion or extreme conditions such as darkness and rain. In contrast, human drivers are able to recall road structure even under poor visibility. To endow models with this "recall" ability, we propose the spatial retrieval paradigm, introducing offline retrieved geographic images as an additional input. These images are easy to obtain from offline caches (e.g, Google Maps or stored autonomous driving datasets) without requiring additional sensors, making it a plug-and-play extension for existing AD stacks.For experiments, we first extend the nuScenes dataset with geographic images retrieved via Google Maps APIs and align the new data with ego-vehicle trajector

421

strong_bridge

medium

LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; depth_correspondence; dynamic_4d; robotics_mapping

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Accurate metric depth is critical for autonomous driving perception and simulation, yet current approaches struggle to achieve high metric accuracy, multi-view and temporal consistency, and cross-domain generalization.To address these challenges, we present MVS-Pro, a novel multi-view stereo framework that reconciles these competing objectives through two key insights: (1) Sparse but metrically accurate LiDAR observations can serve as geometric prompts to anchor depth estimation in absolute scale, and (2) deep fusion of diverse cues is essential for resolving ambiguities and enhancing robustness, while a spatio-temporal decoder ensures consistency across frames.Built upon these principles, MVS-Pro embeds the LiDAR prompt in two ways: as a hard geometric prior anchoring the cost volume, and as soft feature-wise guidance fused by a triple cues combiner.As for temporal consistency, MVS-Pro

422

strong_bridge

medium

Scene Reconstruction as Mapping Priors for 3D Detection

Detection & Tracking / Detection

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

In autonomous driving, mapping is critical for motion planning but remains an under-utilized resource for perception tasks like 3D object detection. Maps can provide robust structural priors of the static environment, suited to resolving ambiguities and correcting for sensor data sparsity or noise — issues especially prevalent for distant objects or during adverse weather conditions. However, conventional High-Definition (HD) maps are resource-intensive to obtain and maintain, which presents a challenge for achieving efficient, large-scale deployment. In this paper, we propose a scalable solution to systemically leverage mapping to improve 3D detection by overcoming two primary challenges. First, we introduce a pipeline to automatically build dense mapping priors from aggregated sensor data, eliminating the need for human labeling. Second, we design a novel Mapping Prior Augmented 3D det

423

strong_bridge

medium

URScenes: A Multi-scenario Dataset for Unstructured Road Environments

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

As autonomous driving technology transitions from small-scale validation to large-scale deployment, its development in unstructured road environments has become a critical and inevitable trend. Autonomous vehicles increasingly rely on high-quality and diverse datasets for perception systems. However, existing public datasets predominantly focus on clear-weather and urban-road scenarios, leaving a significant gap in the coverage of unstructured road environments. To bridge this gap, we construct URScenes, the first multi-scenario, open-source perception dataset for unstructured road environments. The dataset consists of 472 scenes, each lasting 30 seconds, and provides over 28K annotated samples and 119K sweeps. URScenes, for the first time, covers eight typical scenarios, including rainy, snowy, foggy, dusty, glare, night, cloudy, and sunny conditions. Additionally, URScenes supports mul

424

strong_bridge

medium

QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving.Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels.Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability.We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames.The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data.To enable long-range supervision and reasoning under constant memory, we intr

425

strong_bridge

medium

UniGeoRS: A Unified Benchmark for Tri-view Geo-Localization

Remote Sensing & Earth / Remote Sensing

D. adjacent but useful context

pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Cross-view geo-localization (CVGL) aims to estimate an image’s geographic location by matching it with geo-referenced images from different viewpoints, supporting applications such as autonomous driving, UAV navigation, and visual surveillance. However, due to the high cost of image collection, current CVGL datasets often suffer from limited diversity in both drone and ground imagery, which constrains model generalization. Furthermore, existing methods primarily focus on either ground-to-satellite or drone-to-satellite matching, lacking a unified framework capable of handling image matching across all three platforms: satellite, drone, and ground. To this end, we introduce the Unified Geo-localization dataset with Real-world and Synthetic imagery (UniGeoRS), a comprehensive benchmark featuring satellite, drone, and ground-view images, with a particular emphasis on the richness and divers

426

strong_bridge

medium

NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Embodied navigation for long-horizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long-term planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination within a single, unified framework.Our approach empowers a single VLM to concurrently perform planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by decomposing the task, tracking its progress, and formulating the subsequent sub-goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting short-t

427

strong_bridge

medium

OptiMVMap: Offline Vectorized Map Construction via Optimal Multi-vehicle Perspectives

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Offline vectorized maps constitute critical infrastructure for high-precision autonomous driving and mapping services. Existing approaches rely predominantly on single ego-vehicle trajectories, which fundamentally suffer from viewpoint insufficiency: while memory-based methods extend observation time by aggregating ego-trajectory frames, they lack the spatial diversity needed to reveal occluded regions. Incorporating views from surrounding vehicles offers complementary perspectives, yet naive fusion introduces three key challenges: computational cost from large candidate pools, redundancy from near-collinear viewpoints, and noise from pose errors and occlusion artifacts.We present OptiMVMap, which reformulates multi-vehicle mapping as a select-then-fuse problem to address these challenges systematically. An Optimal Vehicle Selection (OVS) module strategically identifies a compact subset

428

strong_bridge

medium

Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

3D occupancy prediction is crucial for autonomous driving perception, offering comprehensive geometric scene understanding and semantic recognition. However, existing methods struggle with geometric misalignment in view transformation due to lack of pixel-level accurate depth estimation, and severe spatial class imbalance where semantic categories exhibit strong spatial anisotropy. To address these challenges, we propose Dr.Occ, a depth- and region-guided occupancy prediction framework. Specifically, we introduce a depth-guided 2D-to-3D View Transformer (D$^2$-VFormer) that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors, thereby enabling precise geometric alignment of voxel features. Moreover, inspired by the Mixture-of-Experts (MoE) framework, we propose a region-guided Expert Transformer (R/R$^2$-EFormer) that adaptively allocate

429

strong_bridge

medium

Complementary Prototype Mapping for Efficient Multimodal Anomaly Detection

Detection & Tracking / Detection

D. adjacent but useful context

pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Multimodal unsupervised anomaly detection has garnered increasing attention for robust defect localization.Recent approaches rely on establishing cross-modal matching relationships under normal conditions without explicit guidance.However, in practice, a single modality may have multiple distinct representations corresponding to another modality, and such unconditional mappings struggle to adaptively capture these variations, resulting in mapping ambiguity and the misclassification of diverse yet normal variations as anomalies.Moreover, existing methods suffer from slow inference speed and high memory overhead, hindering their deployment in real-world production lines.To address these issues, we propose an efficient and effective Complementary Prototype Mapping (\textbf{CPMAD}) framework, which dynamically extracts consensus and supplementary prototypes to serve as complementary prior

430

strong_bridge

medium

PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence

Recognition & Classification / Retrieval

D. adjacent but useful context

pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Cross-view geo-localization is a critical task for UAV navigation, event detection, and aerial surveying, which establish correspondence between drone-captured and satellite imagery. Most existing approaches embed cross-view data into a joint feature space to maximize similarity between paired images. However, these methods typically assume perfect alignment of image pairs in training data, an assumption that rarely holds in practical scenarios. In real-world conditions, factors such as urban canyon effects, electromagnetic interference, and adverse weather frequently induce GPS drift, resulting in systematic **alignment shifts** where only partial correspondences exist between image pairs. Despite its prevalence, this source of noisy correspondence has received limited attention in current research.To our best knowledge, this work presents the first systematic investigation of the **Noi

431

strong_bridge

medium

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensor data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. The recent success of the Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that expert specialization enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. First, we introduce Drive-π0, a Vision-Language-Action (VLA) baseline adapted from Embodied AI for autonomous driving, which serves as the foundation model for DriveMoE. Building on this, we strengthen perception through a carefully designed Vision MoE, where a router adaptively selects context-relevant camera views. This mechanism is inspired by human driving cognition, in which

432

strong_bridge

medium

Think Before You Drive: World Model-Inspired Multimodal Grounding

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

pose_calibration_localization; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods in AD struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we presen

433

strong_bridge

medium

NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks

Learning Algorithms / Optimization

D. adjacent but useful context

depth_correspondence; surface_occupancy; robotics_mapping

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Partially Supervised Multi-Task Learning (PS-MTL) aims to leverage knowledge across tasks when annotations are incomplete. Existing approaches, however, have largely focused on the simpler setting of homogeneous, dense prediction tasks, leaving the more realistic challenge of learning from structurally diverse tasks unexplored. To this end, we introduce NexusFlow, a novel, lightweight, and plug-and-play framework effective in both settings. NexusFlow introduces a set of surrogate networks with invertible coupling layers to align the latent feature distributions of tasks, creating a unified representation that enables effective knowledge transfer. The coupling layers are bijective, preserving information while mapping features into a shared canonical space. This invertibility avoids representational collapse and enables alignment across structurally different tasks without reducing expres

434

strong_bridge

medium

ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

depth_correspondence; surface_occupancy; robotics_mapping

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

3D semantic occupancy prediction is central to autonomous driving, yet current methods are vulnerable to long-tailed class bias and out-of-distribution (OOD) inputs, often overconfidently assigning anomalies to rare classes. We present ProOOD, a lightweight, plug-and-play method that couples prototype-guided refinement with training-free OOD scoring. ProOOD comprises (i) prototype-guided semantic imputation that fills occluded regions with class-consistent features, (ii) prototype-guided tail mining that strengthens rare-class representations to curb OOD absorption, and (iii) EchoOOD, which fuses local logit coherence with local and global prototype matching to produce reliable voxel-level OOD scores. Extensive experiments on five datasets demonstrate that ProOOD achieves state-of-the-art performance on both in-distribution 3D occupancy prediction and OOD detection. On SemanticKITTI, it

435

strong_bridge

medium

Lipschitz Optimization for Formal Verification of Homographies

Robustness & Safety / Safety

D. adjacent but useful context

pose_calibration_localization; robotics_mapping; data_benchmark

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, aerospace, and autonomous vehicles. However, current approaches are confined to incomplete statistical verification, or robustness to $\ell_p$-norm or affine transforms which represent a limited subset of perturbations to the image formation process.In this paper, we present a formal verification approach when the capturing camera undergoes 3D motion perturbations. We first establish a closed-form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. While our formulae are grounded in the vision-based landing problem, they gene

436

strong_bridge

medium

WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

pose_calibration_localization; robotics_mapping

system bridge signal: pose/localization/mapping/world-model plus reconstruction representation

abstract

Collaborative perception is vital for autonomous driving yet remains constrained by tight communication budgets. Earlier work reduced bandwidth by compressing full feature maps with fixed-rate encoders, which adapts poorly to a changing environment, and it further evolved into spatial selection methods that improve efficiency by focusing on salient regions, but this object-centric approach often sacrifices global context, weakening holistic scene understanding. To overcome these limitations, we introduce WhisperNet, a bandwidth-aware framework that proposes a novel, receiver-centric paradigm for global coordination across agents. Senders generate lightweight saliency metadata, while the receiver formulates a global request plan that dynamically budgets feature contributions across agents and features, retrieving only the most informative features. A collaborative feature routing module t

437

adjacent_context

low

AVGGT: Rethinking Global Attention for Accelerating VGGT

Learning Algorithms / Efficient Models

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy

adjacent genus=Efficient Models; useful only if manually connected to reconstruction

abstract

Since DUSt3R, models such as VGGT and $\pi^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $\pi^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by sub

438

adjacent_context

low

CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation

Generative Models / Video Generation

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy; generation_editing; data_benchmark

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional con

439

adjacent_context

low

Group Editing: Edit Multiple Images in One Go

Generative Models / Image Editing

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmark

adjacent genus=Image Editing with no direct reconstruction/SLAM/map signal in title

abstract

In this paper, we tackle the problem of performing consistent and unified modifications across a set of related images. This task is particularly challenging because these images may vary significantly in pose, viewpoint, and spatial layout. Achieving coherent edits requires establishing reliable correspondences across the images, so that modifications can be applied accurately to semantically aligned regions. To address this, we propose GroupEditing, a novel framework that builds both explicit and implicit relationships among images within a group. On the explicit side, we extract geometric correspondences using VGGT, which provides spatial alignment based on visual features. On the implicit side, we reformulate the image group as a pseudo-video and leverage the temporal coherence priors learned by pre-trained video models to capture latent relationships. To effectively fuse these two t

440

adjacent_context

low

MuM: Multi-View Masked Image Modeling for 3D Vision

Learning Algorithms / Self-supervised

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence

adjacent genus=Self-supervised with no direct reconstruction/SLAM/map signal in title

abstract

Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM, extensive

441

adjacent_context

low

VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

Segmentation & Dense Prediction / Segmentation

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark

adjacent genus=Segmentation; useful only if manually connected to reconstruction

abstract

Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in t

442

adjacent_context

medium

Any Resolution Any Geometry: From Multi-View To Multi-Patch

Robustness & Safety / Robustness

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. We address this challenge by adapting the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sa

443

adjacent_context

low

MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation

Multimodal & Language / Grounding

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; surface_occupancy; data_benchmark

adjacent genus=Grounding; useful only if manually connected to reconstruction

abstract

Most existing 3D referring expression segmentation (3DRES) methods rely on dense, high-quality point clouds, while real-world agents such as robots and mobile phones operate with only a few sparse RGB views and strict latency constraints. We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images. Traditional two-stage pipelines, which first reconstruct a point cloud and then perform segmentation, often yield low-quality geometry, produce coarse or degraded target regions, and run slowly. We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning through a dual-branch design. Training in this setting exposes a critical optimization barrier,

444

adjacent_context

low

Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis

Recognition & Classification / Retrieval

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence

adjacent genus=Retrieval; useful only if manually connected to reconstruction

abstract

Cross-view geo-spatial learning consists of two important tasks: Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS), both of which rely on establishing geometric correspondences between ground and aerial views. Recent Geometric Foundation Models (GFMs) have demonstrated strong capabilities in extracting generalizable 3D geometric features from images, but their potential in cross-view geo-spatial tasks remains underexplored. In this work, we present Geo^2, a unified framework that leverages Geometric priors from GFMs (e.g., VGGT) to jointly perform geo-spatial tasks, CVGL and bidirectional CVIS. Despite the 3D reconstruction ability of GFMs, directly applying them to CVGL and CVIS remains challenging due to the large viewpoint gap between ground and aerial imagery. We propose GeoMap, which embeds ground and aerial features into a shared 3D-aware latent space, effect

445

adjacent_context

low

Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers

Learning Algorithms / Efficient Models

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; depth_correspondence; data_benchmark

adjacent genus=Efficient Models with no direct reconstruction/SLAM/map signal in title

abstract

Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT, $\pi^3$ and MapAnything have demonstrated remarkable performance with relatively simple architectures. However, their scalability is fundamentally constrained by the quadratic complexity of global attention, which imposes a significant runtime bottleneck when processing large image sets. In this work, we empirically analyze the global attention matrix of these models and observe that the probability mass concentrates on a small subset of patch-patch interactions corresponding to cross-view geometric correspondences. Building on this insight and inspired by recent advances in large language models, we propose a training-free, block-sparse replacement for dense global attention, implemented with highly optimized kernels. Our method acc

446

adjacent_context

low

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Multimodal & Language / VLM / MLLM

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; surface_occupancy; generation_editing

adjacent genus=VLM / MLLM; useful only if manually connected to reconstruction

abstract

Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations.Experim

447

adjacent_context

low

LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging

Learning Algorithms / Efficient Models

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; data_benchmark

adjacent genus=Efficient Models; useful only if manually connected to reconstruction

abstract

3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10× speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: 1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; 2) token similarity acroses adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging . We analyze each token’s geometric importance, optimizing anchor token selection to better pr

448

adjacent_context

low

Rewis3d: Reconstruction for Weakly-Supervised Semantic Segmentation

Segmentation & Dense Prediction / Segmentation

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; surface_occupancy

adjacent genus=Segmentation; useful only if manually connected to reconstruction

abstract

We present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student–teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geo

449

adjacent_context

medium

Sky2Ground: A Benchmark for Site Modeling under Varying Altitude

Remote Sensing & Earth / Remote Sensing

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; depth_correspondence; surface_occupancy; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

In this work, we propose the problem of localizing cameras and producing renders of a scene, given multiple images captured from ground/aerial/satellite viewpoints. We introduce a dataset called Sky2Ground, which contains synthetic/real images across all 3 viewpoints, along with camera parameters, and dense depth-maps/surface-normals. Recent works have shown that transformer-based nets like VGGT are capable of inferring scene-parameters in a single-forward pass. However, we formally reveal that simply fine-tuning such models reduces performance, and can't be solved simply by bruteforce-scaling. We find the culprit to be satellite images, which inject too much noise during the learning process. Therefore, we propose SkyNet to enable learning using satellite-images. SkyNet is a two-stream neural-net, with one stream explicitly processing satellite, and another processing all modalities tog

450

adjacent_context

medium

3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds

3D Vision & Geometry / Point Cloud

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; surface_occupancy; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Despite recent progress in 3D self-supervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds from unlabeled videos. We first introduce \data, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model. We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D

451

adjacent_context

low

GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation

Segmentation & Dense Prediction / Segmentation

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction; surface_occupancy

adjacent genus=Segmentation with no direct reconstruction/SLAM/map signal in title

abstract

We introduce GeoSAM2, a prompt-controllable framework for 3D part segmentation that casts the task as multi-view 2D mask prediction. Given a textureless object, we render normal and point maps from predefined viewpoints and accept simple 2D prompts—clicks or boxes—to guide part selection. These prompts are processed by a shared SAM2 backbone augmented with LoRA and residual geometry fusion, enabling view-specific reasoning while preserving pretrained priors. The predicted masks are back-projected to the object, aggregated across views.Our method enables fine-grained, part-specific control without requiring text prompts, per-shape optimization, or full 3D labels. In contrast to global clustering or scale-based methods, prompts are explicit, spatially grounded, and interpretable. We achieve state-of-the-art class-agnostic performance on PartObjaverse-Tiny and PartNetE, outperforming both s

452

adjacent_context

medium

SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

Robotics & Embodied AI / Embodied AI

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; dynamic_4d; robotics_mapping

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Vision–Language–Action (VLA) models built on pretrained Vision–Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that incrementally extracts 4D features from 2D images. Then, to enhance the VLM’s ability to exploit both 2D images and 4D features, we introduce \textit{Fusion Tokens}, a set of learnable tokens

453

adjacent_context

low

Co-Me: Confidence Guided Token Merging for Visual Geometric Transformers

Learning Algorithms / Efficient Models

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; general_reconstruction

adjacent genus=Efficient Models with no direct reconstruction/SLAM/map signal in title

abstract

We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me employs a light-weight distilled confidence predictor to rank tokens and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to $11.3\times$ and $7.2\times$ speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.

454

adjacent_context

low

Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models

Learning Algorithms / Efficient Models

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; data_benchmark

adjacent genus=Efficient Models with no direct reconstruction/SLAM/map signal in title

abstract

With the emergence of 3D foundation models, such as DUSt3R, VGGT, and their variants, there is a growing interest in fine-tuning them for various downstream tasks, where using LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in geometry, texture, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA sub-spaces associated with each type of variation? 2) Are these sub-spaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions. We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA sub-space associated with each type of variation. We show that these sub-spaces are approximately disentangled. Integrating them leads to a reduced LoRA

455

adjacent_context

low

Towards Hierarchical 3D Spatial Understanding in Vision-Language Models

Multimodal & Language / VLM / MLLM

A. thesis anchor: VGGT/feed-forward geometry

vggt_lineage; data_benchmark

adjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title

abstract

Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex stages, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that generates over 1 billion 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised finetuning. We also develop an RGB-D VLM that incorporates metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoni

456

adjacent_context

medium

Captain Safari: A Real-time World Engine

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; generation_editing; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers.To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dyn

457

adjacent_context

medium

ESAM++: Efficient Online 3D Perception on the Edge

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Online 3D scene perception in real time is essential for robotics, AR/VR, and autonomous systems, particularly in edge computing scenarios where computational resources are limited and privacy is crucial. Recent state-of-the-art methods like EmbodiedSAM (ESAM) demonstrate the promise of online 3D perception by leveraging the Segment Anything Model (SAM) for real-time, fine-grained, and gen- eralized 3D instance segmentation. However, ESAM still relies on a computationally expensive 3D sparse UNet for point cloud feature extraction, which accounts for the majority of the 3D inference time, hindering its practicality on resource-constrained devices. In this paper, we propose ESAM++, a lightweight and scalable alternative for online 3D scene perception tailored to edge devices without GPU acceleration. Our method introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently capture

458

adjacent_context

medium

Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; generation_editing

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Diffusion models have achieved impressive generative quality across modalities like 2D images, videos, and 3D shapes, but their inference remains computationally expensive due to the iterative denoising process. While recent caching-based methods effectively reuse redundant computations to speed up 2D and video generation, directly applying these techniques to 3D diffusion models can severely disrupt geometric consistency. In 3D synthesis, even minor numerical errors in cached latent features accumulate, causing structural artifacts and topological inconsistencies. To overcome this limitation, we propose Fast3Dcache, a training-free geometry-aware caching framework that accelerates 3D diffusion inference while preserving geometric fidelity. Our method introduces a Predictive Caching Scheduler Constraint (PCSC) to dynamically determine cache quotas according to voxel stabilization pattern

459

adjacent_context

medium

Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

One of the most exciting applications of vision models involve pixel-level reasoning.Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level.Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction.We present a framework that learns pixel-accurate feature descriptors from videos, LILA.The core element of our training framework is linear in-context learning.LILA leverages spatio-temporal cue maps -- depth and motion -- estimated with off-the-shelf networks.Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner.We dem

460

adjacent_context

medium

GeoWorld: Geometric World Models

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive expe

461

adjacent_context

medium

Order Matters: 3D Shape Generation from Sequential VR Sketches

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

VR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD software. However, existing sketch-to-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for 3D shape generation from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates ordered VR sketches from arbitrary shapes, (ii) a dataset comprising over 20k synthetic and 900 hand-drawn sketch–shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work and generalizes effectively from synthetic to real sketches with minimal supervision. All data and models will

462

adjacent_context

medium

RenderFlow: Single-Step Neural Rendering via Flow Matching

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Conventional physically-based rendering (PBR) pipelines generate photorealistic images through computationally expensive light transport simulations. Although recent deep learning approaches leverage diffusion model priors with geometry buffers (G-buffers) to produce visually compelling results without explicit scene geometry or light simulation, they remain constrained by two major limitations. First, the iterative nature of the diffusion process introduces substantial latency. Second, the inherent stochasticity of these generative models compromises physical accuracy and temporal consistency. In response to these challenges, we propose a novel, end-to-end, deterministic single-step neural rendering framework \textit{RenderFlow} built upon a flow matching paradigm. To further strengthen both rendering quality and generalization, we propose an efficient and effective module for sparse ke

463

adjacent_context

medium

Spatial Matters: Position-Guided 3D Referring Expression Segmentation

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

3D Referring Expression segmentation (3D-RES) is an emerging field that segments 3D objects in point cloud scenes based on given referring expressions. Although existing methods have achieved substantial progress, they primarily focus on semantic cues and often overlook spatial relations, which are essential for segmenting the referred objects in complex 3D scenes, especially those containing multiple visually similar instances. In this paper, we propose Position3D, a novel approach that explicitly incorporates spatial relation modeling into 3D-RES. Specifically, we introduce a spatial-aware query generation module that constructs point proxies by aggregating local context and incorporating spatial relations, from which the most text-relevant are selected as queries. Furthermore, we design a position-guided deformable attention in the decoder, which progressively refines attention to con

464

adjacent_context

medium

SRGCD: Stability-Driven Region Growth Framework for 3D Change Detection

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

With the growing accessibility of large-scale 3D point clouds from LiDAR and photogrammetric techniques, 3D change detection (3DCD) has become essential for understanding dynamic scenes. Existing methods typically formulate this as segmentation, treating each point independently for binary classification. This leads to isolated misclassified noise points inside regions. Meanwhile, feature similarity at boundaries causes boundary ambiguity. The more severe class imbalance inherent to change detection further exacerbates this issue. To address these challenges, we propose SRGCD, a Stability-Driven Region Growth Framework that redefines 3DCD as region growing rather than segmentation. Our key insight is that progressively expanding from highly confident seeds avoids pitfalls of point-wise classification while elegantly alleviating class imbalance. Specifically, we first apply strict constra

465

adjacent_context

medium

StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone.To address this challenge, we present **StereoWorld**, an **end-to-end framework** that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a **geometry-aware regularization** to ensure 3D structural fidelity.A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis.To enable large-scale training and evaluation, we curate a **high-definition stereo video dataset** containing over 11M frames aligned to natural human interpupillary distance (IPD).Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generat

466

adjacent_context

medium

TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Current football imitation research primarily aims to optimize reward-based objectives, such as goals scored or win rate proxies, paying less attention to accurately replicating real-world team tactical behaviors. We introduce TacSIm, a large-scale dataset and benchmark for Tactical Style Imitation in football. TacSIm imitates the acitons of all 11 players in one team in the given broadcast footage of Premier League matches under a single broadcast view. Under a offensive or defensive broadcast footage, TacSIm projects the beginning positions and actions of all 22 players from both sides onto a standard pitch coordinate system. TacSIm offers an explicit style imitation task and evaluation protocols. Tactics style imitation is measured by using spatial occupancy similarity and movement vector similarity in defined time, supporting the evaluation of spatial and temporal similarities for on

467

adjacent_context

medium

Tokenizing Vector Animation for Autoregresive Generation

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; generation_editing; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides , 3D meshes , LEGO sequences , and indoor layouts , suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometr

468

adjacent_context

medium

Towards Visual Query Localization in the 3D World

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Visual query localization (VQL) aims to predict a spatial-temporal response of the most recent occurrence from a sequence given a query. Currently, most research focuses on visual query localization from 2D videos, while its counterpart in 3D space has received little attention. In this paper, we make the first attempt to visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. Specifically, 3DVQL contains 2,002 sequences with around 170,000 frames and 6.4K response track segments from 38 object categories. Each sequence in 3DVQL is provided with multiple modalities including point clouds (PC), RGB and depth images to support flexible research. To ensure high-quality annotation, each sequence is manually annotated with multiple rounds of verification and refinement. To our best knowledge, 3DVQL is the first benchmark towards 3D multimodal visual query loc

469

adjacent_context

medium

VABench: A Comprehensive Benchmark for Audio-Video Generation

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization,

470

adjacent_context

medium

VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy; generation_editing

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Though rectified flow models have achieved remarkable performance in image, video, and 3D generation, their practical deployments are challenged by slow inference speeds. Previous acceleration methods rely on caching and reusing, neglecting the growing mismatch between static cached values and evolving input, leading to reduced generated content fidelity.This work proposes Velocity Decomposition and Estimation (VDE), a training-free acceleration method that shifts the paradigm from caching-and-reusing to decomposing-and-estimating.VDE periodically anchors the model’s state with a full forward pass and estimates subsequent outputs analytically. VDE first decomposes the model’s velocity output into components parallel and orthogonal to the input, then exploiting the temporal predictability of the components' coefficients and the consistency of the orthogonal direction for precise, input-ad

471

adjacent_context

medium

Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet truly dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally c

472

adjacent_context

medium

MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: dynamic/4D recon

general_reconstruction; dynamic_4d; surface_occupancy

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

3D morphing remains challenging due to the difficulty of generating semantically consistent and temporally smooth deformations, especially across categories. We present MorphAny3D, a training-free framework that leverages Structured Latent (SLAT) representations for high-quality 3D morphing. Our key insight is that intelligently blending source and target SLAT features within the attention mechanisms of 3D generators naturally produces plausible morphing sequences. To this end, we introduce Morphing Cross-Attention (MCA), which fuses source and target information for structural coherence, and Temporal-Fused Self-Attention (TFSA), which enhances temporal consistency by incorporating features from preceding frames. An orientation correction strategy further mitigates the pose ambiguity within the morphing steps. Extensive experiments show that our method generates state-of-the-art morphing

473

adjacent_context

medium

Dynamic-Static Decomposition for Novel View Synthesis of Dynamic Scenes with Spiking Neurons

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Novel view synthesis for dynamic scenes remains challenging due to complex motion variations.Recent methods represent dynamic and static regions with separate Gaussians to improve efficiency and accuracy, but inaccurate assignment of static and dynamic Gaussian primitive still limits performance.We identify two key issues, namely inaccurate mask priors and improper tag representations, which lead to boundary artifacts, loss of fine-grained motion details, and overfitting on input views, resulting in degraded side-view synthesis.To address these problems, we propose a spatio-temporally fine-grained mask field and a discontinuous dynamic–static tagging field to achieve accurate assignment of dynamic and static Gaussian primitives, enabling high-quality novel view synthesis, especially in fine-grained motions, motion boundary regions, and side viewpoints.Experiments show that our method ac

474

adjacent_context

medium

Feed-forward Gaussian Registration for Head Avatar Creation and Editing

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; generation_editing

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

We present MATCH (Multi-view Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. State-of-the-art multi-view head avatars require time-consuming head tracking, which is followed by an expensive avatar optimization, often resulting in a total creation time that exceeds one day. MATCH instead directly predicts Gaussian splat textures in correspondence from calibrated multi-view images in 0.5 seconds per frame. While the learned intra-subject correspondence across frames allows us to quickly build personalized head avatars, correspondence across subjects enables various applications such as expression transfer, optimization-free tracking, semantic editing, and identity interpolation. We learn to establish such correspondences end-to-end, with a transformer-based model that predicts textures of Gaussian

475

adjacent_context

medium

FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

We present a novel formulation for mesh-free, reduced-order simulation of deformable hyperelastic objects. Existing work in reduced-order elastodynamic simulation represents the input geometry by either meshes, which can be difficult to obtain due to challenges in scanning and triangulating complex shapes, or by neural fields that require per-shape optimization. We propose to adopt a Reproducing Kernel Particle Method (RKPM) representation, which enables the construction of reduced-order skinning weights by solving a generalized eigensystem on the Hessian matrix of the elastic energy. We demonstrate that this formulation not only leads to a 40$\times$ training speedup compared with the per-shape optimization of neural fields, but also achieves lower simulation error when evaluated against the converged results of finite element method. We show our simulation results on a wide variety of

476

adjacent_context

medium

LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; robotics_mapping; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Open-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain impractically slow, memory-intensive, and overly complex due to iterative optimization and dense feature assignments for every Gaussian. To address these limitations, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantics only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. To further streamline inference and ensure semantic consistency, we cluster Gaussians in a single step by linking geometrically and semantically related masks in 3D. In evaluation, we assess our method on diverse benchm

477

adjacent_context

medium

MVInverse: Feed-forward Multi-view Inverse Rendering in Seconds

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Multi-view inverse rendering aims to recover geometry, materials, and illumination consistently across multiple viewpoints. Existing single-view approaches often ignore cross-view relationships, leading to inconsistent results, while multi-view optimization methods rely on slow differentiable rendering and per-scene refinement, making them computationally expensive and hard to scale. To address these limitations, we introduce a feed-forward multi-view inverse rendering framework that directly predicts spatially varying albedo, metallicity, roughness, diffuse shading, and surface normals from sequences of RGB images. By alternating attention across views, our model captures both intra-view long-range lighting interactions and inter-view material consistency, enabling coherent scene-level reasoning within a single forward pass. Due to the scarcity of real-world training data, models traine

478

adjacent_context

medium

OrienPose: Orientation-Guided Novel View Synthesis for Single-Image Unseen Object Pose Estimation

3D Vision & Geometry / Pose Estimation

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Estimating the 3D pose of unseen objects from a single image remains a fundamental yet challenging problem in computer vision, especially under a CAD model-free setting.Pioneering attempts address this issue by matching templates generated through Novel View Synthesis (NVS), which essentially aims to learn the geometric transformation from a reference to a target view. While promising, these methods can only approximate this transformation under pixel-level supervision, as the starting orientation remains undefined. In the absence of explicit geometric constraints to verify the correctness of the predicted transformation, existing methods often synthesize novel views with geometry-distorted structures or severely blurred local textures, leading to unreliable template matching and suboptimal pose estimation results. To this end, we propose OrienPose, a novel object pose estimation framew

479

adjacent_context

medium

Photo3D: Advancing Photorealistic 3D Generation through Structure‑Aligned Detail Enhancement

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; generation_editing; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Although recent 3D‑native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich surface details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non‑rigid motions of objects, and the limited precision of scanners.We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT‑4o‑Image model.Considering that the generated images can distort 3D structures due to their lack of multi‑view consistency, we design a structure‑aligned multi‑view synthesis pipeline and construct a detail‑enhanced multi‑view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages percep

480

adjacent_context

medium

PhysHead: Simulation-Ready Gaussian Head Avatars

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Realistic digital avatars require expressive and dynamic hair motion, yet most existing head avatar methods assume rigid hair movement.These methods often fail to disentangle hair from the head, representing it as a simple outer shell and failing to capture its natural volumetric behavior. In this paper, we address these limitations by introducing PhysHead, a hybrid representation for animatable head avatars with realistic hair dynamics learned from multi-view video. Our approach combines a 3D parametric mesh for the head with strand-based hair, which can be directly simulated using physics engines. For the appearance model, we employ Gaussian primitives attached to both the head mesh and hair segments. This representation enables the creation of photorealistic head avatars with dynamic hair behavior, such as wind-blown motion, overcoming the constraints of rigid hair in existing methods

481

adjacent_context

medium

PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Existing approaches to 3D semantic urban scene generation predominantly rely on voxel-based representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality wit

482

adjacent_context

medium

Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

We present _Relightable Holoported Characters_ (RHC), a novel person-specific method for free-view rendering and relighting of full-body and highly dynamic humans solely observed from sparse-view RGB videos at inference. In contrast to classical one-light-at-a-time (OLAT)-based human relighting, our transformer-based RelightNet predicts relit appearance within a single network pass, avoiding costly OLAT-basis capture and generation. For training such a model, we introduce a new capture strategy and dataset recorded in a multi-view lightstage, where we alternate frames lit by random environment maps with uniformly lit tracking frames, simultaneously enabling accurate motion tracking and diverse illumination as well as dynamics coverage. Inspired by the rendering equation, we derive physics-informed features that encode geometry, albedo, shading, and the virtual camera view from a coarse h

483

adjacent_context

medium

REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; generation_editing; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Recent generative models have shown strong performance in generating diverse 3D assets from 2D images, a fundamental research topic in computer vision and graphics. However, these models still struggle to generate voluminous 3D assets when the input is a flat image that provides limited 3D cues. We introduce REVIVE 3D, a two-stage, plug-and-play pipeline for generating voluminous 3D assets from flat images. In Stage 1, we construct an Inflated Prior by inflating the foreground silhouette to recover global volume and superimposing part-aware details to capture local structure. In Stage 2, 3D Latent Refinement injects Gaussian noise into the Inflated Prior's latent and then denoises it, guided by the prior's geometric cues and the backbone's pretrained 3D knowledge. By initializing the process with the encoded latent of a source mesh instead of the prior, the framework also supports 3D edi

484

adjacent_context

medium

Scaling View Synthesis Transformers

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Recently, geometry-free view synthesis transformers have achieved state-of-the-art results in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. However, the specific factors that govern how their performance scales with compute remain poorly understood. In this work, we conduct a rigorous analysis of the scaling laws for view synthesis transformers and elucidate a series of design choices for training compute-optimal NVS models. Most significantly, we find that an encoder–decoder architecture, which was previously found to be less scalable, can in fact be compute-optimal. We attribute the previously inferior performance of previous encoder–decoder methods to certain architectural choices and inconsistent training compute across comparisons. Across several compute levels, we demonstrate that our encoder–decoder architecture, which we

485

adjacent_context

medium

Scaling4D: Pushing the Frontier of Video Novel View Synthesis through Large-Scale Monocular Videos

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Video Novel View Synthesis (VNVS) aims to render arbitrary novel viewpoints of dynamic scenes from a single-view video, but its algorithmic training faces a major challenge: the lack of large-scale multi-view video datasets. Prior methods often train on monocular data by framing it as an inpainting task, which typically leads to a train-inference gap and visual artifacts. While synthetic multi-view data can partially alleviate the data scarcity issue, its high acquisition costs and limited diversity restrict scalability. To address these problems, we propose Scaling4D, a novel strategy that theoretically avoids the train-inference gap while leveraging large-scale monocular videos for training. Specifically, we take a higher-level perspective on the problem, reformulating VNVS into a general correspondence-guided generation task. Furthermore, in conjunction with extensive real-world data,

486

adjacent_context

medium

Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised globa

487

adjacent_context

medium

ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and Test-time Generative Adaptation

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

We introduce ELITE, an Efficient Gaussian head avatar synthesis from a monocular video via Learned Initialization and TEst-time generative adaptation. Prior works rely either on a 3D data prior or a 2D generative prior to compensate for missing visual cues in monocular videos. However, 3D data prior methods often struggle to generalize in-the-wild, while 2D generative prior methods are computationally heavy and prone to identity hallucination. We identify a complementary synergy between these two priors and design an efficient system that achieves high-fidelity animatable avatar synthesis with strong in-the-wild generalization. Specifically, we introduce a feed-forward Mesh2Gaussian Prior Model (MGPM) that enables fast initialization of a Gaussian avatar. To further bridge the domain gap at test time, we design a test-time generative adaptation stage, leveraging both real and synthetic

488

adjacent_context

medium

DiffBMP: Differentiable Rendering with Bitmap Primitives

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

We introduce **DiffBMP**, a scalable and efficient differentiable rendering engine for a collection of bitmap images. Our work addresses a limitation that traditional differentiable renderers are constrained to vector graphics, given that most images in the world are bitmaps. Our core contribution is a highly parallelized rendering pipeline, featuring a custom CUDA implementation for calculating gradients. This system can, for example, optimize the position, rotation, scale, color, and opacity of thousands of bitmap primitives all in under 1 min using a consumer GPU. We employ and validate several techniques to facilitate the optimization: soft rasterization via Gaussian blur, structure-aware initialization, noisy canvas, and specialized losses/heuristics for videos or spatially constrained images. We demonstrate DiffBMP is not just an isolated tool, but a practical one designed to integ

489

adjacent_context

medium

WonderZoom: Multi-Scale 3D World Generation

3D Vision & Geometry / 3D Reconstruction

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; surface_occupancy

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

We present WonderZoom, a novel approach to generating 3D scenes with contents across multiple spatial scales from a single image. Existing 3D world generation models remain limited to single-scale synthesis and cannot produce coherent scene contents at varying granularities. The fundamental challenge is the lack of a scale-aware 3D representation capable of generating and rendering content with largely different spatial sizes. WonderZoom addresses this through two key innovations: (1) scale-adaptive Gaussian surfels for generating and real-time rendering of multi-scale 3D scenes, and (2) a progressive detail synthesizer that iteratively generates finer-scale 3D contents. Our approach enables users to ``zoom into'' a 3D region and auto-regressively synthesize previously non-existent fine details from landscapes to microscopic features. Experiments demonstrate that WonderZoom significantly

490

adjacent_context

medium

Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation

3D Vision & Geometry / 3D Gaussian Splatting

A. thesis anchor: representation shift

general_reconstruction; gaussian_radiance; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Generating large-scale 3D head avatars of non-existent identities with high-fidelity and strong multi-view consistency (MVC) is essential for applications such as synthetic crowds, digital twins, and large asset libraries. For high scalability, avatars must be generated from minimal resources, without costly MV studio captures or any 3D data. In this work, we target this challenging minimal-resource setting for 3D head generation. Second, we argue that the common strategy of enforcing MVC via intermediate MV image generation is both expensive and fundamentally fragile. Instead, we analyze how MVC can be induced by design, showing that intermediate view synthesis is unnecessary. To this end, we introduce MVCHead — a fast, single-shot state space model that directly predicts Gaussians, without intermediate generation. At its core, we propose a Hierarchical State Space (HiSS) block that enf

491

adjacent_context

medium

Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual–Inertial Odometry

3D Vision & Geometry / Pose Estimation

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; pose_calibration_localization; robotics_mapping; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Visual-Inertial Odometry (VIO) is a critical component for robust ego-motion estimation, enabling foundational capabilities such as autonomous navigation in robotics and real-time 6-DoF tracking for augmented reality.Existing methods face a well-known trade-off: filter-based approaches are efficient but prone to drift, while optimization-based methods, though accurate, rely on computationally prohibitive Visual-Inertial Bundle Adjustment (VIBA) that is difficult to run on resource-constrained platforms.Rather than removing VIBA altogether, we aim to reduce how often and how heavily it must be invoked. To this end, we cast two key design choices in modern VIO, when to run the visual frontend and how strongly to trust its output, as sequential decision problems, and solve them with lightweight reinforcement learning (RL) agents. Our framework introduces a lightweight, dual-pronged RL polic

492

adjacent_context

medium

Haptic Neural Fields: Bringing Tactile Interactions to 3D Rendered Scenes

3D Vision & Geometry / 3D Reconstruction

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

We address the problem of making 3D scenes interactive by asking: what would objects feel like if touched in a virtual environment? State-of-the-art 3D rendering methods provide compelling visual realism, but they fall short in modeling physical interactions, such as haptic feedback. We propose a framework that learns the correspondence between user actions and tactile responses, enabling the generation of touch-based signals directly from simulated interactions in 3D scenes. Our approach leverages a neural field representation conditioned on geometry and action to synthesize material-specific tactile signals. Experiments show that the generated signals reliably convey material properties and interaction dynamics. This paves the way toward interactive, touch-aware virtual environments with realistic haptic feedback.

493

adjacent_context

medium

MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts

3D Vision & Geometry / 3D Reconstruction

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; surface_occupancy; robotics_mapping; generation_editing; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language–driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent that supports five object-level operations. To support photorea

494

adjacent_context

medium

PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

3D Vision & Geometry / 3D Reconstruction

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; surface_occupancy; robotics_mapping; generation_editing; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce \textbf{PhysX-Anything}, the first \textbf{simulation-ready} physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by \textbf{193$\times$}, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning a

495

adjacent_context

medium

SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

3D Vision & Geometry / 3D Reconstruction

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; surface_occupancy; robotics_mapping; generation_editing; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., “pick up a bowl and place it on the table”), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simula

496

adjacent_context

medium

SE(3)-Equivariance with Geometric and Topological Guidance for Category-Level Object Pose Estimation

3D Vision & Geometry / Pose Estimation

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Object pose estimation is a key task for embodied robots, enabling them to interact with objects effectively. Category-level object pose estimation provides a way for robots to estimate the pose of unknown objects. However, estimating object pose from point clouds alone remains challenging. In this paper, we introduce SEGPose, a novel category-level object pose estimation method based on point clouds. Unlike previous methods, SEGPose leverages geometric, topological information, and SE(3)-equivariance, enhancing the network's accuracy in pose prediction. To utilize geometric and topological features, we propose a constraint-based feature extraction and 3D reconstruction method, enabling effective object shape reconstruction. We also design an SE(3)-equivariance feature prediction network to handle pose transformations consistently across viewpoints, improving pose accuracy. Experimental

497

adjacent_context

medium

Volumetric Functional Maps

3D Vision & Geometry / 3D Reconstruction

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

The computation of volumetric correspondences between 3D shapes is a prominent tool for medical and industrial applications. In this work, we pave the way for spectral volume mapping, extending for the first time the functional maps framework from the surface to the volumetric setting. We show that the eigenfunctions of the volumetric Laplace operator define a functional space that is suitable for high-quality signal transfer. We also experiment with various techniques that edit this functional space, porting them to volume domains. We validate our method on novel volumetric datasets and on tetrahedralizations of well-established surface datasets, also showcasing practical applications involving both discrete and continuous signal mapping, for segmentation transfer, mesh connectivity transfer, and solid texturing. Last but not least, we show that considering the volumetric spectrum great

498

adjacent_context

medium

Deep Feature Deformation Weights

3D Vision & Geometry / 3D Reconstruction

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; surface_occupancy; robotics_mapping; generation_editing

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Handle-based mesh deformation has been a long-standing paradigm in computer graphics, enabling intuitive shape edits from sparse controls. Classic techniques offer precise and rapid deformation control. However, they solve an optimization problem with constraints defined by the choice of control handles, requiring a user to know apriori the ideal distribution of handles on the shape to accomplish the desired edit. The mapping from handle set to deformation behavior is often unintuitive and, importantly, non-semantic. Modern data-driven methods, on the other hand, leverage the data prior to obtain semantic edits, at the cost of fine-grained control and speed. We propose a technique that achieves the best of both worlds by leveraging the semantic prior of data and the precise control and speed of traditional frameworks. Our approach is surprisingly simple yet effective: deep feature proxim

499

adjacent_context

medium

HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models

3D Vision & Geometry / 3D Reconstruction

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; surface_occupancy; robotics_mapping; generation_editing

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

The 3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires extensive and tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for automatic 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with ex

500

adjacent_context

medium

SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation

3D Vision & Geometry / 3D Reconstruction

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision-language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer t

501

adjacent_context

medium

UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes

3D Vision & Geometry / 3D Reconstruction

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

We present UniTEX, a novel two-stage 3D texture generation framework to create high-quality, consistent textures for 3D assets. Existing approaches predominantly rely on UV-based models in the second stage to refine textures after reprojecting the generated multi-view images onto the 3D shapes, which introduces challenges related to topological ambiguity. To address this, we bypass the limitations of UV mapping by introducing a Large Texturing Model (LTM) that directly regresses textures in a unified 3D functional space. Moreover, to enable more effective and complete supervision of LTM, we propose to extend surface-defined textures into a continuous volumetric field to serve as an advanced training objective, which we refer to as Texture Functions (TF). Finally, we develop an advanced LoRA-based strategy for efficiently adapting large-scale 2D Diffusion Transformers (DiTs) for high-qual

502

adjacent_context

medium

Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration via Heterogeneous Graphs

3D Vision & Geometry / Pose Estimation

B. bridge: reconstruction becomes mapping/world model

pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Image-to-point-cloud (I2P) registration aims to align 2D images with 3D point clouds by establishing reliable 2D-3D correspondences. The drastic modality gap between images and point clouds makes it challenging to learn features that are both discriminative and generalizable, leading to severe performance drops in unseen scenarios. We address this challenge by introducing a heterogeneous graph framework that jointly refines cross-modal features and correspondences within a unified architecture. The proposed graph represents a mapping between segmented 2D and 3D regions, which enhances cross-modal feature interaction and thus improves feature discriminability. In addition, modeling the consistency among vertices and edges within the graph enables pruning of unreliable correspondences. Building on these insights, we propose a heterogeneous graph embedded I2P registration method, termed Hg-

503

adjacent_context

medium

Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events

3D Vision & Geometry / 3D Gaussian Splatting

B. bridge: reconstruction becomes mapping/world model

gaussian_radiance; dynamic_4d; robotics_mapping

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Novel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We utilize NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A pixel-wise RGB mapping field is introduced to align the above NeRF-rendered HDR pixel values with the sensor-recorded LD

504

adjacent_context

medium

Lifting Unlabeled Internet-scale Data for 3D Scene Understanding

3D Vision & Geometry / 3D Reconstruction

B. bridge: reconstruction becomes mapping/world model

general_reconstruction; surface_occupancy; robotics_mapping

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We systematically identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-level reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonst

505

adjacent_context

medium

CRFT: Consistent–Recurrent Feature Flow Transformer for Cross-Modal Image Registration

3D Vision & Geometry / Pose Estimation

B. bridge: reconstruction becomes mapping/world model

pose_calibration_localization; depth_correspondence; robotics_mapping

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

We present Consistent–Recurrent Feature Flow Transformer (CRFT), a unified coarse-to-fine framework that learns feature flow for robust cross-modal registration. CRFT learns a modality-consistent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning. To enhance geometric adaptability, an iterative discrepancy-guided attention mechanism with a Spatial Geometric Transform (SGT) recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature-level consistency. This design enables accurate alignment under large affine and scale variations while maintaining structural coheren

506

adjacent_context

medium

RoboTAG: End-to-end Robot Pose Estimation via Topological Alignment Graph

3D Vision & Geometry / Pose Estimation

B. bridge: reconstruction becomes mapping/world model

pose_calibration_localization; robotics_mapping; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Estimating robot pose from a monocular RGB image is a challenge in robotics and computer vision. Existing methods typically build networks on top of 2D visual backbones and depend heavily on labeled data for training, which is often scarce in real-world scenarios, causing a sim-to-real gap.Moreover, these approaches reduce the 3D-based problem to 2D domain, neglecting the 3D priors. To address these, we propose Robot Topological Alignment Graph (RoboTAG), which incorporates a 3D branch to inject 3D priors while enabling co-evolution of the 2D and 3D representations, alleviating the reliance on labels. Specifically, the RoboTAG consists of a 3D branch and a 2D branch, where nodes represent the states of the camera and robot system, and edges capture the dependencies between these variables or denote alignments between them. Closed loops are then defined in the graph, on which a consistenc

507

adjacent_context

medium

DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces

3D Vision & Geometry / Pose Estimation

B. bridge: reconstruction becomes mapping/world model

pose_calibration_localization; robotics_mapping; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Articulated object pose estimation is a core task in embodied AI and computer vision. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this paper, we introduce DICArt (DIsCrete Diffusion for Articulated Object Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt progressively denoises a noisy pose representation through a learned reverse diffusion procedure to recover the ground-truth pose.To improve modeling fidelity, we propose a flexible flow decider that dynamically determines whether each token should be denoised or reset, effectively balancing the real and noise distributions during diffusion. Additionally, we incorporate a h

508

adjacent_context

medium

PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models

3D Vision & Geometry / Point Cloud

B. bridge: reconstruction becomes mapping/world model

surface_occupancy; robotics_mapping

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

The development of 3D Vision-Language Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations.To address these limitations, we propose {\mname}, a novel feature-level alignment regularization method. {\mname} explicitly supervises intermediate point cloud representations to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By trainin

509

adjacent_context

medium

DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis

3D Vision & Geometry / Pose Estimation

B. bridge: representation meets metric pose

gaussian_radiance; pose_calibration_localization; dynamic_4d; data_benchmark

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. In this paper, we present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges in image alignment from a new perspective, employing a generation-based solution that showcases strong capabilities and avoids the problems associated with flow-based image warping. Specifically, we propose a Dynamics-aware Diffusion Training approach for learning conditional image generation, synthesizing a novel view for image alignment. This incorporates a Dynamics-aw

510

adjacent_context

medium

FMPose: 3D Pose Estimation via Flow Matching

3D Vision & Geometry / Pose Estimation

B. bridge: representation meets metric pose

gaussian_radiance; pose_calibration_localization; depth_correspondence

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Monocular 3D pose estimation is fundamentally ill-posed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses.In particular, diffusion-based models have demonstrated strong performance, but their iterative denoising process typically requires many time steps for each prediction, making inference computationally expensive.In contrast, Flow Matching (FM) learns an ODE-based velocity field, enabling efficient generation of 3D pose samples with only a few integration steps. Inspired by this capability, we propose a novel generative pose estimation framework, FMPose, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned on 2D inputs. While the ODE trajectories are dete

511

adjacent_context

medium

Landscape-Awareness for Geometric View Diffusion Model

3D Vision & Geometry / Pose Estimation

B. bridge: representation meets metric pose

gaussian_radiance; pose_calibration_localization

editorial thesis/bridge bucket but weaker direct reconstruction signal

abstract

Accuracy camera viewpoint estimation under sparse-view conditions remains challenging, particularly in two-view scenarios. Recent approaches leverage diffusion models such as Zero123, which synthesize novel views conditioned on relative viewpoint, and have demonstrated promising performance when repurposed for viewpoint estimation via optimization with MSE loss. However, existing methods often suffer from non-convex loss landscape with numerous local minima, which makes them sensitive to initialization and reliant on na\"ive multi-start strategies to achieve reasonable results. We analyze these optimization challenges and visualize failure cases, showing that ambiguities in object geometry, such as symmetry and self-similarity, can mislead gradient-based updates toward incorrect viewpoints. To address these limitations, we propose a score-based method that reshapes the optimization lands

512

adjacent_context

low

FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; surface_occupancy; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

Recent work in 3D scene understanding has begun to shift from purely spatial analysis to the more complex challenge of functional scene understanding.However, existing methods often consider functional relationships between object pairs in isolation, failing to capture the scene-wide interdependencies that humans use to resolve ambiguity. We introduce FunFact, a framework for constructing probabilistic open-vocabulary functional 3D scene graphs from posed RGB-D images. FunFact first builds an object- and part-centric 3D map and uses foundation models to propose semantically plausible functional relations. These candidates are converted into factor graph variables and constrained by both LLM-derived common-sense priors and geometric priors. This formulation enables joint probabilistic inference over all functional edges and their uncertainties, yielding substantially better-calibrated con

513

adjacent_context

low

FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy

3D Vision & Geometry with weak but relevant signal

abstract

Registration of multiview point clouds typically depends on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and ill-posed without holistic geometric constraints. In this paper, we propose FUSER, the first feed-forward multi-view registration transformer that processes all scans jointly in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER employs a sparse 3D CNN to encode each scan into low-resolution superpoint features preserving absolute translation cues, followed by a Geometric Alternating Attention module for efficient intra- and inter-scan reasoning. Particularly, we transfer 2D attention priors from off-the-shelf foundation models (i.e., $\pi^3$) to enhance 3D feature attention. Building upon FUSER and its estimates, we further introduce

514

adjacent_context

low

Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; surface_occupancy

3D Vision & Geometry with weak but relevant signal

abstract

Recent advances in self-supervised learning (SSL) for point clouds have substantially improved 3D scene understanding without human annotations. Existing approaches emphasize semantic awareness by enforcing feature consistency across augmented views or by masked scene modeling. However, the resulting representations transfer poorly to localization, and often require full finetuning for strong performance. Accurate localization is a fundamental component of 3D perception, thus bridging this gap is crucial for progressing toward true 3D foundation models that supports all downstream tasks on 3D data. In this work, we introduce PointINS, a localization-oriented self-supervised framework that enriches point cloud representations through geometry-aware learning. PointINS employs an orthogonal localization branch to jointly learn high-level semantic understanding and geometric reasoning, yield

515

adjacent_context

low

MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence

3D Vision & Geometry with weak but relevant signal

abstract

Establishing consistent correspondences across images is essential for 3D vision tasks such as structure-from-motion (SfM), yet most existing matchers operate in a pairwise manner, often producing fragmented and geometrically inconsistent tracks when their predictions are chained across views. We propose \textbf{MV-RoMa}, a multi-view dense matching model that jointly estimates dense correspondences from a source image to multiple co-visible targets. Specifically, we design an efficient model architecture which avoids high computational cost of full cross-attention for multi-view feature interaction: (i) multi-view encoder that leverages pair-wise matching results as a geometric prior, and (ii) multi-view matching refiner that refines correspondences using pixel-wise attention. Additionally, we propose a post-processing strategy that integrates our model's consistent multi-view correspon

516

adjacent_context

low

GazeShift: Unsupervised Gaze Estimation and Dataset for VR

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

Gaze estimation is instrumental in modern virtual reality (VR) systems. Despite significant progress in remote-camera gaze estimation, VR gaze research remains constrained by data scarcity—particularly the lack of large-scale, accurately labeled datasets captured with the off-axis camera configurations typical of modern headsets. Gaze annotation is difficult since fixation on intended targets cannot be guaranteed. To address these challenges, we introduce VRGaze—the first large-scale off-axis gaze estimation dataset for VR—comprising 2.1 million near-eye infrared images collected from 68 participants. We further propose GazeShift, an attention-guided unsupervised framework for learning gaze representations without labeled data. Unlike prior redirection-based methods that rely on multi-view or 3D geometry, GazeShift is tailored to near-eye infrared imagery, achieving effective gaze–appear

517

adjacent_context

low

KV-Tracker: Real-Time Pose Tracking with Transformers

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence

3D Vision & Geometry with weak but relevant signal

abstract

Multi-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos. Our method rapidly selects and manages a set of images as keyframes to map a scene or object via $\pi^3$~\cite{wang2025pi3} with full bidirectional attention. We then cache the global self-attention block's key-value (KV) pairs and use them as the sole scene representation for online tracking. This allows for up to $15\times$ speedup during inference without the fear of drift or catastrophic forgetting. Our caching strategy is model-agnostic and can be applied to other off-the-shelf multi-view networks without retraining.We demonstrate KV-Tracker on both scene-level tracking and the more challenging task of on-the-f

518

adjacent_context

low

MoBind: Motion Binding for Fine-Grained IMU–Video Pose Alignment

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; dynamic_4d

3D Vision & Geometry with weak but relevant signal

abstract

We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchica

519

adjacent_context

low

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; dynamic_4d

3D Vision & Geometry with weak but relevant signal

abstract

Enhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing long-form video analysis, enabling tasks such as temporal localization, action detection, and time-sensitive question answering. While reinforcement learning (RL) has recently been explored for improving temporal reasoning, existing approaches are often confined to limited task types and data, restricting their generalization across diverse temporal understanding scenarios. To address this challenge, we present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs’ temporal comprehension. We curate a multi-task corpus that exposes the model to diverse temporal structures and semantics, and build upon the Group Relative Policy Optimization (GRPO) algorithm to achieve stable and effective cross-task optimization. Specifically, we c

520

adjacent_context

low

Zoo3D: Zero-Shot 3D Object Detection at Scene Level

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing $Zoo3D$, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. $Zoo3D$ operates in two modes: the zero-shot $Zoo3D_{0}$, which requires no training at all, and the self-supervised $Zoo3D_{1}$, which refines 3D box prediction by training a class-agnostic detector on $Zoo3D_{0}$-generated pseu

521

adjacent_context

low

TESO: Online Tracking of Essential Matrix by Stochastic Optimization

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

Reliable perception of autonomous systems relies on fusion of data from multiple sensors, which requires maintaining accurate geometric calibration during operation. This work aims to track the drift of the calibration parameters caused by mechanical stress, thermal effects, or minor accidents. We focus on five parameters of the essential matrix and propose TESO, whose core mechanisms are: 1) a robust loss function based on kernel correlation over tentative correspondences instead of robust matching and estimators, 2) an adaptive online stochastic optimization on the essential manifold. Both contribute to reduced CPU and memory requirements. TESO relies on a few hyperparameters and eliminates the need for data-driven training, enabling use in resource-constrained online perception systems. We evaluated TESO based on the geometric precision of the tracked extrinsic parameters, the rectifi

522

adjacent_context

low

TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; surface_occupancy; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

Tables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows.End-to-end approaches rely heavily on large-scale TR data and struggle in data-constrained scenarios.To address these issues, we propose TDATR (Table Detail-Aware Table Recognition) improves end-to-end TR through table detail-aware learning and cell-level visual alignment.TDATR adopts a “perceive-then-fuse” strategy. The model first performs table detail-aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks designed under a language modeling paradigm. These tasks can naturally leverage document data from diverse scenarios to enhance model robustness.The model then int

523

adjacent_context

low

ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; surface_occupancy

3D Vision & Geometry with weak but relevant signal

abstract

Category-level object pose estimation aims to predict the pose and size of arbitrary objects in specific categories. Existing methods struggle with the inherent incompleteness of observed point clouds, which limits their ability to capture complete object shapes for robust pose reasoning. While point cloud completion offers a promising solution, naively treating it as a separate preprocessing step for partial observations introduces compounding errors and additional computational overhead, ultimately hindering both accuracy and efficiency.To address these challenges, we propose ComPose, a novel unified framework that tightly integrates shape completion to provide complete geometric cues for enhanced pose estimation. At the core of ComPose is a keypoint-based progressive completion module, which recovers full shape representations by progressively predicting a sparse set of keypoints and

524

adjacent_context

low

UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

Even though industrial inspection systems should be capable of recognizing unprecedented defects, most existing approaches operate under a closed-set assumption, which prevents them from detecting novel anomalies. While the visual prompting approach provides a scalable alternative, it struggles in industrial settings where subtle inter-class differences and high intra-class variance make prompt-to-region matching ambiguous and cause prompt embeddings to collapse, limiting the effectiveness of existing methods. To address these challenges, we introduce UniSpector— a Universal Inspector for open-set defect detection and segmentation. To empower defect prompt embeddings for robust recognition of novel defects, it comprises two key components: the Spatial–Spectral Prompt Encoder (SSPE) and the Contrastive Prompt Encoder (CPE). SSPE extracts orientation-invariant frequency cues and fuses them

525

adjacent_context

low

Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

Accurate uncertainty estimation is essential for reliable appearance-based gaze tracking. However, domain shifts between training and testing often lead to incorrect uncertainty estimates, which is a problem overlooked in existing uncertainty-aware gaze tracking models. To overcome this problem efficiently, we formulate uncertainty estimation as a conditional distribution problem and treat the correction process as an output-level conditional distribution matching task. We therefore introduce a data-efficient post-hoc calibration method to align the predicted, high-error conditional distribution with the empirically observed distribution extracted from a small set of calibration samples. To more faithfully assess the accuracy of the resulting uncertainty estimates, we further introduce a new metric, Coverage Probability Error (CPE), to quantify the distribution-level mismatch between pre

526

adjacent_context

low

Global-Aware Edge Prioritization for Pose Graph Initialization

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization

3D Vision & Geometry with weak but relevant signal

abstract

The pose graph is a core component of Structure-from-Motion (SfM), where images act as nodes and edges encode relative poses. Since geometric verification is expensive, SfM pipelines restrict the pose graph to a sparse set of candidate edges, making initialization critical. Existing methods rely on image retrieval to connect each image to its $k$ nearest neighbors, treating pairs independently and ignoring global consistency. We address this limitation through the concept of edge prioritization, ranking candidate edges by their utility for SfM. Our approach has three components: (1) a GNN trained with SfM-derived supervision to predict globally consistent edge reliability; (2) multi-minimal-spanning-tree-based pose graph construction guided by these ranks; and (3) connectivity-aware score modulation that reinforces weak regions and reduces graph diameter. This globally informed initializ

527

adjacent_context

low

Minimal Constraint Relaxation for Multiview Autocalibration

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization

3D Vision & Geometry with weak but relevant signal

abstract

Polynomial systems in multiview geometry are often highly over-constrained, and naïve subsampling or elimination can lead to unstable or inconsistent estimation. We revisit this issue through the lens of \emph{constraint relaxation}—the selective removal of equations to recover a finite and well-conditioned solution space. Focusing on the Kruppa equations for camera autocalibration, we introduce the notion of \emph{minimal relaxation}, a principled framework for identifying constraint subsets that preserve geometric validity while restoring solvability. Through symbolic analysis of the full three-view Kruppa system, we enumerate and classify all relaxation patterns, revealing algebraically minimal families that yield finite, well-conditioned problems.Comprehensive experiments validate this analysis across symbolic and numerical settings.Using homotopy continuation and synthetic perturbat

528

adjacent_context

low

Parallel Rigidity Matters for Bundle Adjustment

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization

3D Vision & Geometry with weak but relevant signal

abstract

Bundle adjustment is a long-standing problem in computer vision that solves for camera parameters and 3D point coordinates from 2D image observations. While there has been much work on various aspects, like adaptation to different camera models and sensors, and considerations for solving the optimization problem, in this paper, we deal with a fundamental and distinct aspect of the uniqueness of its solution. In particular, we examine the unique solvability of the 3D reconstruction problem using parallel rigidity theory. We design an algorithm to ensure that the topology of the bipartite graph formed by the camera-3D point relations in bundle adjustment does not result in independent scaling of the edges in its subgraphs. To tackle the generally large-sized bipartite graph, we leverage camera-camera relationships in 3D reconstruction problems for efficiency. We demonstrate the benefits of

529

adjacent_context

low

QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization

3D Vision & Geometry with weak but relevant signal

abstract

In structure from motion, quadrifocal tensors capture more information than their pairwise counterparts (essential matrices), yet they have often been thought of as impractical and only of theoretical interest. In this work, we challenge such beliefs by providing a new framework to recover $n$ cameras from the corresponding collection of quadrifocal tensors. We form the block quadrifocal tensor and show that it admits a Tucker decomposition whose factor matrices are the stacked camera matrices, and which thus has a multilinear rank of (4,4,4,4) independent of $n$. We develop the first synchronization algorithm for quadrifocal tensors, using Tucker decomposition, alternating direction method of multipliers, and iteratively reweighted least squares. We further establish relationships between the block quadrifocal, trifocal, and bifocal tensors, and introduce an algorithm that jointly syn

530

adjacent_context

low

Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding

3D Vision & Geometry / Point Cloud

C. cluster representative

pose_calibration_localization; surface_occupancy; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

While recent Transformer and Mamba architectures have advanced point cloud representation learning, they are typically developed for single-task or single-domain settings. Directly applying them to multi-task domain generalization (DG) leads to degraded performance. Transformers effectively model global dependencies but suffer from quadratic attention cost and lack explicit structural ordering, whereas Mamba offers linear-time recurrence yet often depends on coordinate-driven serialization, which is sensitive to viewpoint changes and missing regions, causing structural drift and unstable sequential modeling. In this paper, we propose Structure-Aware Domain Generalization (SADG), a Mamba-based In-Context Learning framework that preserves structural hierarchy across domains and tasks. We design structure-aware serialization (SAS) that generates transformation-invariant sequences using cent

531

adjacent_context

low

Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; dynamic_4d

3D Vision & Geometry with weak but relevant signal

abstract

Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle temp

532

adjacent_context

low

UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; dynamic_4d

3D Vision & Geometry with weak but relevant signal

abstract

With the advancement of multi-modal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce $\textbf{UFVideo}$, the first Video LLM with $\textbf{unified multi-grained cooperative understanding}$ capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the $\textbf{UFVideo-Ben

533

adjacent_context

low

Deformation-based In-Context Learning for Point Cloud Understanding

3D Vision & Geometry / Point Cloud

C. cluster representative

pose_calibration_localization; surface_occupancy; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

Recent advances in point cloud In-Context Learning (ICL) have demonstrated strong multitask capabilities. Existing approaches typically adopt a Masked Point Modeling (MPM)-based paradigm for point cloud ICL. However, MPM-based methods directly predict the target point cloud from masked tokens without leveraging geometric priors, requiring the model to infer spatial structure and geometric details solely from token-level correlations via transformers. Additionally, these methods suffer from a training–inference objective mismatch, as the model learns to predict the target point cloud using target-side information that is unavailable at inference time. To address these challenges, we propose DeformPIC, a deformation-based framework for point cloud ICL. Unlike existing approaches that rely on masked reconstruction, DeformPIC learns to deform the query point cloud under task-specific guidanc

534

adjacent_context

low

Fusion of Depth and Semantic for Probabilistic Floorplan Localization

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence

3D Vision & Geometry with weak but relevant signal

abstract

Floorplan localization aims to estimate the camera pose of a query image with respect to a 2D floorplan, providing a lightweight and long-term stable alternative to localization based on 3D maps or large image databases for indoor robotics and AR. Recent methods frame the problem as ray-based matching, representing the image as a set of rays annotated with depth or semantic labels and aligning them with the floorplan. However, they still face challenges in addressing the complexity of indoor environments, which can be decomposed into environmental, geometric, and semantic ambiguities.To address these ambiguities, we propose a floorplan-aware probabilistic fusion framework that models both depth and semantic information within a unified architecture. Our framework also combines a distribution-based ray confidence estimator, which down-weights uncertain geometric hypotheses, with a probabi

535

adjacent_context

low

PlanaReLoc: Camera Relocalization in 3D Planar Primitives via Region-based Structure Matching

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence

3D Vision & Geometry with weak but relevant signal

abstract

While structure-based relocalizers have long strived for *point* correspondences when establish or regress query-map associations, in this paper, we pioneer the use of **planar primitives** and planar 3D maps for lightweight 6-DoF camera relocalization in structured environments.Planar primitives, beyond being fundamental entities in projective geometry, also serve as region-based representations that encapsulate both structural and semantic richness.This motivates us to introduce *PlanaReLoc*, a streamlined "plane-centric" paradigm where a deep matcher associates planar primitives across the query image and the map within a learned unified embedding space, after which the 6-DoF pose is solved and refined under a robust framework.Through extensive experiments on the *ScanNet* and *12Scenes* datasets across hundreds of scenes, our method demonstrates the superiority of planar primitives i

536

adjacent_context

low

LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence

3D Vision & Geometry with weak but relevant signal

abstract

LiDAR relocalization has attracted increasing attention as it can deliver accurate 6-DoF pose estimation in complex 3D environments. Recent learning-based regression methods offer efficient solutions by directly predicting global poses without the need for explicit map storage. However, these methods often struggle in challenging scenes due to their equal treatment of all predicted points, which is vulnerable to noise and outliers. In this paper, we propose **LEADER**, a robust LiDAR-based localization framework enhanced by a simple, yet effective geometric encoder. Specifically, a Robust Projection-based Geometric Encoder architecture which captures multi-scale geometric features is first presented to enhance descriptiveness in geometric representation. A Truncated Relative Reliability loss is then formulated to model point-wise ambiguity and mitigate the influence of unreliable predict

537

adjacent_context

low

Gaze Target Estimation with Concepts

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

Estimating human gaze targets from images in-the-wild is an important and formidable task. Existing approaches primarily employ brittle, multi-stage pipelines that require explicit inputs, like head bounding boxes and human pose, in order to identify the subject of gaze analysis. As a result, detection errors can cascade and lead to failure. Moreover, these prior works lack the flexibility of specifying the gaze analysis task via natural language prompting, an approach which has been shown to have significant benefits in convenience and scalability for other image analysis tasks. To overcome these liimtations, we introduce the **Promptable Gaze Target Estimation (PGE)** task, a new end-to-end, concept-driven paradigm for gaze analysis. PGE conditions gaze prediction on flexible user text or visual prompts (e.g., "the boy in the red shirt" or "person in point [0.52, 0.48]") to identify a

538

adjacent_context

low

AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence

3D Vision & Geometry with weak but relevant signal

abstract

Precise and real-time visual localization is critical for applications like AR/VR and robotics, especially on resource-constrained edge devices such as smart glasses, where battery life and heat dissipation can be primary concerns. While many efficient models exist, further reducing compute without sacrificing accuracy is essential for practical deployment. To address this, we propose asymmetric visual localization: a large Teacher model processes pre-mapped database images offline, while a lightweight Student model processes the query image online. This creates a challenge in matching features from two different models without resorting to heavy, learned matchers.We introduce AsymLoc, a novel distillation framework that aligns a Student to its Teacher through a combination of a geometry-driven matching objective and a joint detector-descriptor distillation objective, enabling fast, para

539

adjacent_context

low

CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation

3D Vision & Geometry / Point Cloud

C. cluster representative

depth_correspondence; surface_occupancy; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce $\textbf{CLIPoint3D}$, the first framework for $\textit{few-shot unsupervised 3D point cloud domain adaptation}$ built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design

540

adjacent_context

low

LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence

3D Vision & Geometry with weak but relevant signal

abstract

Advanced autonomous systems rely on multi-sensor fusion for safer and more robust perception. To enable effective fusion, calibrating directly from natural driving scenes (i.e., target-free) with high accuracy is crucial for precise multi-sensor alignment. Existing learning-based calibration methods are typically designed for only a single pair of sensor modalities (i.e., a bi-modal setup). Unlike these methods, we propose LiREC-Net, a target-free, learning-based calibration network that jointly calibrates multiple sensor modality pairs, including LiDAR, RGB, and event data, within a unified framework. To reduce redundant computation and improve efficiency, we introduce a shared LiDAR representation that leverages features from both its 3D nature and projected depth map, ensuring better consistency across modalities. Trained and evaluated on established datasets, such as KITTI and DSEC,

541

adjacent_context

low

SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

Robust cross-view geo-localization (CVGL) remains challenging despite the surge in recent progress. Existing methods still rely on field-of-view (FoV)-specific training paradigms, where models are optimized under a fixed FoV but collapse when tested on unseen FoVs and unknown orientations. This limitation necessitates deploying multiple models to cover diverse variations. Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across diverse conditions---implicitly assuming all FoVs are equally difficult. To address this gap, we present SinGeo, a simple yet powerful framework that enables a single model to realize robust cross-view geo-localization without additional modules or explicit transformations. SinGeo employs a dual discriminative learning architecture that enhances intra-view discriminability within both ground and satel

542

adjacent_context

low

Latent Action Pretraining Meets Pose Estimation

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

This paper revisits camera pose estimation through the lens of self-supervised pretraining, focusing on inverse-dynamics pretraining as a scalable alternative to the current trend of fully supervised training with 3D annotations. Concretely, we employ inverse- and forward-dynamics models to learn latent action representations, similar to Genie from large-scale driving videos.Our idea is simple yet effective. Existing methods use latent actions in their original capacity, that is, as action conditioning of world-models or as proxies of robot action parameters in policy networks.Our method, dubbed LA-Pose, repurposes the latent action features as inputs to a camera pose estimator, finetuned on a limited set of high-quality 3D annotations.This formulation enables accurate and generalizable pose prediction while maintaining feed-forward efficiency. Extensive experiments on driving benchmarks

543

adjacent_context

low

LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 [89] achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces $\textbf{InsLoD-Loc}$ - the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance-level building annotations. This enables trained models to exhibit remarkable zero-shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance-level silhouette alignment, which significantly reduces pose e

544

adjacent_context

low

Structure-Aware Representation Distillation for Tiny-Dense Object Segmentation

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

Dense scenes containing numerous tiny objects pose a fundamental challenge for segmentation models, where small localization errors can significantly degrade downstream measurements. We present Structure-Aware Representation Distillation (SARD), a teacher-compatible framework that transfers structural knowledge from a large teacher to a compact student via feature-space alignment rather than mask imitation. SARD constructs a structure-importance map that combines boundary salience, local density, and teacher confidence, and uses it to weight a unified representation loss integrating feature consistency, distribution alignment, and structural contrast. This encourages the student to allocate capacity to geometrically informative regions while preserving global context. Experiments on Cityscapes, ADE20K, and a challenging rock fragmentation benchmark (RockFrag) show that SARD consistently

545

adjacent_context

low

UniVerse: A Unified Modulation Framework for Segmentation-Free, Disentangled Multi-Concept Personalization

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; generation_editing

3D Vision & Geometry with weak but relevant signal

abstract

Personalized visual understanding has advanced significantly, yet existing approaches struggle to localize and extract specific concepts when input images contain multiple objects. Many prior methods rely heavily on segmentation-based supervision or exhibit poor compositional generalization, limiting their ability to accurately disentangle and manipulate individual concepts. In this work, we propose UniVerse, a Unified Modulation Framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers. Our method allows for composable and decomposable concept extraction, enabling fine-grained localization and representation of target objects without explicit segmentation masks. UniVerse learns to decompose complex scenes into concept-specific representations and then compose them in a unified manner, enabling robust personalization across diverse visual conte

546

adjacent_context

low

Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization

3D Vision & Geometry with weak but relevant signal

abstract

Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference.To address this limitation, we propose Vision-Guided Attention (VGA), a training-free method that first constructs precise visual grounding by exploiting the semantic content of visual tokens, and then uses this grounding to guide the model’s focus toward relevant visual regions. In image captioning, VGA further refines this guidance dynamically during generation by suppressing regions that have already been described.In VGA, each token undergoes only a single forward pass, introducing a negligible latency overhead of just 4.36\%. In addition, VGA is fully

547

adjacent_context

low

TLMA: Mitigating the Impact of Weakly Labeled Information for Video Anomaly Detection

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization

3D Vision & Geometry with weak but relevant signal

abstract

Weakly Supervised Video Anomaly Detection (WSVAD) aims to localize abnormal segments using only video-level labels during training.Although the paradigm significantly reduces annotation costs, the coarse-grained labels fail to precisely describe the full videos, resulting in the introduction of substantial Weakly Labeled Information (WLI) during training. The presence of WLI makes it difficult for the model to accurately learn the boundary between normal and abnormal behaviors, leading to misclassifications and compromising the precision of anomaly localization.To tackle the challenges posed by WLI, we propose a triplet learning strategy that selects hard segments from normal videos as anchors. By combining contrastive learning with Multiple Instance Learning (MIL) strategy, we increase the projection distance between abnormal segments and anchor samples, to reduce the interference of W

548

adjacent_context

low

Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization

3D Vision & Geometry with weak but relevant signal

abstract

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fin

549

adjacent_context

low

A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection

3D Vision & Geometry / Point Cloud

C. cluster representative

pose_calibration_localization; surface_occupancy

3D Vision & Geometry with weak but relevant signal

abstract

3D anomaly detection targets the detection and localization of defects in 3D point clouds trained solely on normal data. While a unified model improves scalability by learning across multiple categories, it often suffers from Inter-Category Entanglement (ICE)—where latent features from different categories overlap, causing the model to adopt incorrect semantic priors during reconstruction and ultimately yielding unreliable anomaly scores. To address this issue, we propose the Semantically Disentangled Unified Model for 3D Anomaly Detection, which reconstructs features conditioned on disentangled semantic representations. Our framework consists of three key components: (i) Coarse-to-Fine Global Tokenization for forming instance-level semantic identity, (ii) Category-Conditioned Contrastive Learning for disentangling category semantics, and (iii) a Geometry-Guided Decoder for semantically

550

adjacent_context

low

SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization

3D Vision & Geometry with weak but relevant signal

abstract

Existing methods for category-level object articulation from a single 3D observation often rely on dense supervision, multi-frame inputs, or CAD templates, and still struggle to disentangle geometry from articulation or to recover explicit joint parameters. We propose SCAPO , a self-supervised framework that estimates canonical geometry, rigid part segmentation, and joint pivots, axes, and articulation states from a single RGB-D observation without ground-truth labels or category-specific models. Our SCAPO first uses an SE(3)-equivariant vector-neuron autoencoder to factor out global pose and align diverse instances into a shared canonical space. On this aligned shape, a joint-aware blend-skinning module is then designed to model part motion. We learn this representation through cycle reconstruction between observed and canonical shapes and cross-space alignment with a learnable canonica

551

adjacent_context

low

Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection

3D Vision & Geometry / Point Cloud

C. cluster representative

surface_occupancy; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

3D shape anomaly detection is a crucial task for industrial inspection and geometric analysis. Existing deep learning approaches typically learn representations of normal shapes and identify anomalies via out-of-distribution feature separation or decoder-based reconstruction. They often fail to generalize across diverse anomaly types and scales, such as global geometric errors (e.g., planar shifts, surface misalignments), and are sensitive to noisy or incomplete local points during training. To address these limitations, we propose a hierarchical point–patch anomaly scoring network that jointly models regional part features and local point features for robust anomaly reasoning. An adaptive patchification module integrates self-supervised decomposition to capture complex structural deviations. Beyond evaluations on public benchmarks (Anomaly-ShapeNet and Real3D-AD), we release an industri

552

adjacent_context

low

PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration

3D Vision & Geometry / Point Cloud

C. cluster representative

surface_occupancy; data_benchmark

3D Vision & Geometry with weak but relevant signal

abstract

Point clouds are a fundamental 3D representation in computer vision, enabling a wide range of perception tasks. However, real-world point clouds often suffer from degradations such as incompleteness, noise, outliers, and irregular density, caused by sensor limitations or occlusions. Recovering clean and detailed shapes from such degraded data is crucial for downstream applications. While existing learning-based methods achieve progress on individual tasks like completion or denoising, they typically rely on global bottleneck features, which lose fine-grained geometry and remain sensitive to varying input quality. We propose a unified 3D restoration network that directly takes point clouds as input and adaptively reconstructs high-quality geometry under diverse degradation scenarios. At the core of our approach is a Pseudo-Query module, implemented within a Transformer backbone, which ref

553

adjacent_context

low

BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds

3D Vision & Geometry / Point Cloud

C. cluster representative

surface_occupancy

3D Vision & Geometry with weak but relevant signal

abstract

We introduce BuildAnyPoint, a novel generative framework for structured 3D building reconstruction from point clouds with diverse distributions, such as those captured by airborne LiDAR and Structure-from-Motion.To recover artist-created building abstraction in this highly underconstrained setting, we capitalize on the role of explicit 3D generative priors in autoregressive mesh generation.Specifically, we design a Loosely Cascaded Diffusion Transformer (Loca-DiT) that initially recovers the underlying distribution from noisy or sparse points, followed by autoregressively encapsulating them into compact meshes.We first formulate distribution recovery as a conditional generation task by training latent diffusion models conditioned on input point clouds, and then tailor a decoder-only transformer for conditional autoregressive mesh generation based on the recovered point clouds.Our method

554

adjacent_context

low

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmark

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, v

555

adjacent_context

low

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Multimodal & Language / VLM / MLLM

D. adjacent but useful context

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark

adjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title

abstract

Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting.We tackle these issues by introducing:(a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception;(b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and(c) \ourbenchmark, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline.Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchma

556

adjacent_context

low

4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models

Data & Evaluation / Benchmark

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark

adjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title

abstract

World Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high-fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, Embodied Intelligence, and content creation.However, prior benchmarks, however, each emphasize different evaluation dimensions and lack a unified assessment of world-realism capability.To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition–4D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as

557

adjacent_context

low

BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmark

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Text-guided dynamic 3D character generation has advanced rapidly, yet producing high-quality motion that faithfully reflects rich textual descriptions remains challenging. Existing methods tend to generate limited sub-actions or incoherent motion due to fixed-length temporal inputs and discrete frame-wise representations that fail to capture rich motion semantics. We address these limitations by representing motion with continuous differentiable B-spline curves, enabling more effective motion generation without modifying the capabilities of the underlying generative model. Specifically, our closed-form, Laplacian-regularized B-spline solver efficiently compresses variable-length motion sequences into compact representations with a fixed number of control points. Further, we introduce a normal-fusion strategy for input shape adherence along with correspondence-aware and local-rigidity

558

adjacent_context

low

Breaking the 3D Dataset Bottleneck: Fast Scalable Generation of Aligned 3D Assets from Scratch for Category 6D Pose Estimation and Robotic Grasping

Data & Evaluation / Benchmark

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; generation_editing; data_benchmark

adjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title

abstract

While 2D vision has been revolutionized by large-scale datasets like ImageNet, 3D vision remains constrained by the scarcity of high-quality, canonically aligned data. We introduce the first scalable, automated framework that generates complete category-level 6D pose datasets directly from text prompts, bypassing the need for existing 3D assets. Our method overcomes key challenges by: (1) ensuring reliable, scalable asset generation via a controlled text-to-image-to-3D pipeline; (2) enforcing built-in canonical alignment through depth-conditioned generation, achieving a 96\% pose consistency rate; and (3) enabling large-scale 6D annotation via mixed reality rendering. The pipeline produces high-quality, aligned 3D meshes in under 3 minutes per object—a 5–20$\times$ speedup over traditional scanning. We generate over 1,000 instances for each of the 153 categories in the Omni6Dpose benchma

559

adjacent_context

low

Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation

Generative Models / Image Editing

D. adjacent but useful context

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing

adjacent genus=Image Editing with no direct reconstruction/SLAM/map signal in title

abstract

Recent advances in 3D scene editing using NeRF and 3DGS enable high-quality static scene editing. In contrast, dynamic scene editing remains challenging, as methods that directly extend 2D diffusion models to 4D often produce motion artifacts, temporal flickering, and inconsistent style propagation. We introduce Catalyst4D, a framework that transfers high-quality 3D edits to dynamic 4D Gaussian scenes while maintaining spatial and temporal coherence. At its core, Anchor-based Motion Guidance (AMG) builds a set of structurally stable and spatially representative anchors from both original and edited Gaussians. These anchors serve as robust region-level references, and their correspondences are established via optimal transport to enable consistent deformation propagation without cross-region interference or motion drift. Complementarily, Color Uncertainty-guided Appearance Refinement (CUA

560

adjacent_context

low

Charge: A Comprehensive Benchmark and Dataset for Dynamic Novel View Synthesis

Data & Evaluation / Benchmark

D. adjacent but useful context

general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark

adjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title

abstract

This paper presents a new dataset for Novel View Synthesis, generated from a high-quality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting-edge 4D scene reconstruction and novel view generation models. In addition to high-fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi-view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high-qualit

561

adjacent_context

low

DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; generation_editing; data_benchmark

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Generating dynamic and interactive 3D trees has wide applications in virtual reality, games, and world simulation. However, existing methods still face various challenges in generating structurally consistent and realistic 4D motion for complex real trees. In this paper, we propose DynamicTree, the first framework that can generate long-term, interactive 3D motion for 3DGS reconstructions of real trees. Unlike prior optimization-based methods, our approach generates dynamics in a fast feed-forward manner. The key success of our approach is the use of a compact sparse voxel spectrum to represent the tree movement. Given a 3D tree from Gaussian Splatting reconstruction, our pipeline first generates mesh motion using the sparse voxel spectrum and then binds Gaussians to deform the mesh. Additionally, the proposed sparse voxel spectrum can also serve as a basis for fast modal analysis under

562

adjacent_context

low

Ego-1K – A Large-Scale Multiview Video Dataset for Egocentric Vision

Data & Evaluation / Benchmark

D. adjacent but useful context

general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

adjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title

abstract

We present Ego-1K, a large-scale, time-synchronized collection of egocentric multiview videos designed to advance neural 3D video synthesis, dynamic scene understanding, and embodied perception. The dataset contains nearly 1,000 short egocentric videos taken with a custom rig with 12 synchronous cameras surrounding a VR headset worn by the user. Scene content focuses on hand motions and hand-object interactions in different settings. We describe rig design, data processing, and calibration. Our dataset enables new ways to benchmark egocentric scene reconstruction methods. We believe this is an important area of research as smart glasses with multiple cameras become omnipresent. Our experiments demonstrate that our dataset presents unique challenges for existing 3D and 4D novel view synthesis methods due to high disparities and image motion caused by close dynamic objects and rig ego

563

adjacent_context

low

Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy; generation_editing

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Scene-consistent video generation aims to create videos that explore 3D scenes based on a camera trajectory. Previous methods rely on video generation models with external memory for consistency, or iterative 3D reconstruction and inpainting, which accumulate errors during inference due to incorrect intermediary outputs, non-differentiable processes, and separate models. To overcome these limitations, we introduce ``geometry-as-context". It iteratively completes the following steps using an autoregressive camera-controlled video generation model: (1) estimates the geometry of the current view necessary for 3D reconstruction, and (2) simulates and restores novel view images rendered by the 3D scene. Under this multi-task framework, we develop the camera gated attention module to enhance the model's capability to effectively leverage camera poses. During the training phase, text contexts a

564

adjacent_context

low

GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry

Segmentation & Dense Prediction / Segmentation

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy

adjacent genus=Segmentation with no direct reconstruction/SLAM/map signal in title

abstract

Motion segmentation in dynamic scenes is highly challenging, as conventional methods heavily rely on estimating camera poses and point correspondences from inherently noisy motion cues. Existing statistical inference or iterative optimization techniques that struggle to mitigate the cumulative errors in multi-stage pipelines often lead to limited performance or high computational cost. In contrast, we propose a fully learning-based approach that directly infers moving objects from latent feature representations via attention mechanisms, thus enabling end-to-end feed-forward motion segmentation. Our key insight is to bypass explicit correspondence estimation and instead let the model learn to implicitly disentangle object and camera motion. Supported by recent advances in 4D scene geometry reconstruction (e.g., $\pi^3$), the proposed method leverages reliable camera poses and rich spatial

565

adjacent_context

low

GM-R$^2$: Generative Matching Learning for Unsupervised Geometric Representation and Registration

Learning Algorithms / Self-supervised

D. adjacent but useful context

gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping

adjacent genus=Self-supervised with no direct reconstruction/SLAM/map signal in title

abstract

This paper proposes GM-R^2, a novel Generative Matching Learning framework for unsupervised geometric descriptor learning and correspondence matching. By reformulating descriptor learning as geometry-conditioned cross-view image generation, GM-R^2 leverages the proxy supervisory signal from structurally aligned view synthesis to implicitly enforce feature consistency across correspondence, enabling robust 3D matching. To instantiate GM-R^2, we introduce Denoising-Agnostic Coupled ControlNet conditioned on depth maps as the required geometry-conditioned cross-view generator. It effectively extends the single-view generation of naive ControlNet to the cross-view via coupled depth-map input design and further remove the latent noise dependency to support geometry-only inference (expected by 3D matching). Moreover, we present Zoomable Equirectangular Projection for intrinsics-free point clou

566

adjacent_context

low

MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration

Medical & Scientific Imaging / Medical Imaging

D. adjacent but useful context

general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy

adjacent genus=Medical Imaging with no direct reconstruction/SLAM/map signal in title

abstract

Deformable image registration (DIR) remains a fundamental yet challenging problem in medical image analysis, largely due to the prohibitively high-dimensional deformation space of dense displacement fields and the scarcity of voxel-level supervision. Existing reinforcement learning frameworks often project this space into coarse, low-dimensional representations, limiting their ability to capture spatially variant deformations. We propose MorphSeek, a fine-grained representation-level policy optimization paradigm that reformulates DIR as a spatially continuous optimization process in the latent feature space. MorphSeek introduces a stochastic Gaussian policy head atop the encoder to model a distribution over latent features, facilitating efficient exploration and coarse-to-fine refinement. The framework integrates unsupervised warm-up with weakly supervised fine-tuning through Group Relat

567

adjacent_context

low

MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data

Data & Evaluation / Benchmark

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; generation_editing; data_benchmark

adjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title

abstract

Existing 4D human datasets fall short for fashion-specific research, lacking either realistic garment dynamics or task-specific annotations. Synthetic datasets suffer from a realism gap, whereas real-world captures lack the detailed annotations and paired data required for virtual try-on (VTON) and size estimation tasks. To bridge this gap, we introduce MV-Fashion, a large-scale, multi-view video dataset engineered for domain-specific fashion analysis. MV-Fashion features 3,273 sequences (72.5 million frames) from 80 diverse subjects wearing 3-10 outfits each. It is designed to capture complex, real-world garment dynamics, including multiple layers and varied styling (e.g., tucked shirts, rolled sleeves). A core contribution is a rich data representation that includes pixel-level semantic annotations, ground-truth material properties like elasticity, and 3D point clouds. Crucially for VT

568

adjacent_context

low

PhysGaia: A Physics-aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis

Data & Evaluation / Benchmark

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark

adjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title

abstract

We introduce PhysGaia, a novel physics-aware dataset specifically designed for Dynamic Novel View Synthesis (DyNVS), encompassing both structured objects and unstructured physical phenomena.Unlike existing datasets that primarily focus on photorealistic reconstruction, PhysGaia is created to actively support physics-aware dynamic scene modeling.Our dataset provides complex dynamic scenarios with rich interactions among multiple objects, where they realistically collide with each other and exchange forces.Furthermore, it contains a diverse range of physical materials, such as liquid, gas, textile, and rheological substances, which moves beyond the rigid bodies prevalent in existing datasets.All scenes in PhysGaia are faithfully generated to strictly adhere to physical laws, leveraging carefully selected material-specific physics solvers. To enable quantitative evaluation of physical mode

569

adjacent_context

low

PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; generation_editing; data_benchmark

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Despite advances in physics-based 3D motion synthesis, current methods face key limitations: reliance on pre-reconstructed 3D Gaussian Splatting (3DGS) built from dense multi-view images with time-consuming per-scene optimization; physics integration via either inflexible, hand-specified attributes or unstable, optimization-heavy guidance from video models using Score Distillation Sampling (SDS); and naïve concatenation of prebuilt 3DGS with physics modules, which ignores physical information embedded in appearance and yields suboptimal performance. To address these issues, we propose PhysGM, a feed-forward framework that jointly predicts 3D Gaussian representation and physical properties from a single image, enabling immediate simulation and high-fidelity 4D rendering. Unlike slow appearance-agnostic optimization methods, we first pre-train a physics-aware reconstruction model that dire

570

adjacent_context

low

RAYNOVA: Geometry-Free Auto-Regressive 4D World Modeling with Unified Spatio-Temporal Representation

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RayNova, a geometry-free world model that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RayNova constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Plücker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RayNova achieves state-of-the-art multi-vi

571

adjacent_context

low

ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

Segmentation & Dense Prediction / Segmentation

D. adjacent but useful context

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark

adjacent genus=Segmentation with no direct reconstruction/SLAM/map signal in title

abstract

Indoor environments evolve as objects move, appear, or disappear. Capturing these dynamics requires maintaining consistent instance identities across intermittently captured 3D scans with unobserved change or, equivalently, performing 4D indoor semantic instance segmentation (SIS)---the joint task of segmenting, identifying, and temporally associating object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and 4D LiDAR approaches, which show limited performance due to their reliance on continuous temporal measurements that is uncommon in indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. It explores temporal fusion strategies to share information across observations, demonstrating that this shared context

572

adjacent_context

low

STUR3D: Spatio-Temporal Unified Representation Learning for 3D Object Detection

Detection & Tracking / Detection

D. adjacent but useful context

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

adjacent genus=Detection with no direct reconstruction/SLAM/map signal in title

abstract

Surrounding-view 3D object detection is a fundamental task in autonomous driving, which aims to locate 3D objects from multiple camera views. Existing methods predominantly followed a 2D-to-3D pipeline, leveraging 2D detectors to enhance 3D detection performance. However, these methods ignored the inherent disparities in both temporal and feature dimensional representations between 2D and 3D detection, resulting in the positional deviations in 3D space. Furthermore, the absence of temporal information in 2D detection leads to object omission in occluded scenarios. To address these limitations, we propose STUR3D, a unified framework that builds spatio-temporal alignment between 2D and 3D perception. First, we project historical 3D detection features onto the 2D image plane, guiding the 2D detector to distill the requisite representations for 3D detection, thereby harmonizing feature repre

573

adjacent_context

low

SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

Data & Evaluation / Benchmark

D. adjacent but useful context

general_reconstruction; depth_correspondence; dynamic_4d; generation_editing; data_benchmark

adjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title

abstract

The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as

574

adjacent_context

low

Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal

Data & Evaluation / Benchmark

D. adjacent but useful context

pose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

adjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title

abstract

LiDAR has become an essential sensing modality in autonomous driving, robotics, and smart-city applications. However, ghost points (or ghost), which are false reflections caused by multi-path laser returns from glass and reflective surfaces, severely degrade 3D mapping and localization accuracy. Prior ghost removal rely on geometric consistency in dense point clouds, failing on mobile LiDAR's sparse, dynamic data. We address this by exploiting full-waveform LiDAR (FWL), which captures complete temporal intensity profiles rather than just peak distances, providing crucial cues for distinguishing ghosts from genuine reflections in mobile scenarios. As this is a new task, we present Ghost-FWL, the first and largest annotated mobile FWL dataset for ghost detection and removal. Ghost-FWL comprises 24K frames across 10 diverse scenes with 7.5 billion peak-level annotations, which is 100$\times

575

adjacent_context

low

Learning Multi-View Spatial Reasoning from Cross-View Relations

Multimodal & Language / VLM / MLLM

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark

adjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title

abstract

Vision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across different viewpoints. In this work, we introduce Cross-View Relations (XVR), a large-scale dataset designed to teach VLMs spatial reasoning across multiple views. XVR comprises 100K vision-question-answer samples derived from 18K diverse 3D scenes and 70K robotic manipulation trajectories, spanning three fundamental spatial reasoning tasks: Correspondence (matching objects across views), Verification (validating spatial relationships), and Localization (identifying object positions). VLMs fine-tuned on XVR achieve substantial improvements on established multi-view and robotic spatial reasoning benchmarks (MindCube and RoboSpatial). When integrated as

576

adjacent_context

low

MooCap: A Multi-View Benchmark for Cow-Object-Human Interaction and Behavior Dynamics

Data & Evaluation / Benchmark

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; data_benchmark

adjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title

abstract

Understanding animal behavior requires modeling how bodies, objects, and other agents interact over time, not simply detecting isolated actions or estimating pose frame by frame. Existing animal video datasets target pose estimation or coarse, passively observed actions, and rarely provide the structured, multi-entity interaction annotations needed to study behavioral dynamics. We introduce MooCap, a multi-view video benchmark for animal-object-human interaction understanding under controlled experimental protocols. MooCap contains 42 hours of synchronized multi-camera video from 43 individually tested cows across seven standardized interaction scenarios, including novel environment, novel object, novel human, human approach, unfamiliar conspecifics (restricted and unrestricted) and Dam reunion (restricted and unrestricted). Recordings are densely annotated with 23 fine-grained behavior

577

adjacent_context

low

EMMA: Extracting Multiple physical parameters from Multimodal Data

Multimodal & Language / VLM / MLLM

D. adjacent but useful context

gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy; data_benchmark

adjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title

abstract

We introduce EMMA, a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model. EMMA leverages a Liquid Time-Constant (LTC) network to learn latent dynamics from heterogeneous modalities while a physics-constrained loss enforces consistency with the governing differential equations. A unified feature pipeline enables consistent alignment across video trajectories, acoustic signatures, and chart-derived measurements, allowing EMMA to estimate parameters under for

578

adjacent_context

low

Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos

Low-level Vision / IQA

D. adjacent but useful context

gaussian_radiance; pose_calibration_localization; depth_correspondence; dynamic_4d; data_benchmark

adjacent genus=IQA with no direct reconstruction/SLAM/map signal in title

abstract

High Dynamic Range (HDR) user-generated (UGC) videos are rapidly proliferating across social platforms, yet most perceptual video quality assessment (VQA) systems remain tailored to Standard Dynamic Range (SDR). HDR’s higher bit depth, wide color gamut, and elevated luminance range expose distortions such as near-black crushing, highlight clipping, banding, and exposure flicker that amplify UGC artifacts and challenge SDR models. To catalyze progress, we curate \textbf{HDR-UGC-44K}, a large-scale subjective dataset of $\sim$44K videos from 6.5K sources with >1.5M crowd ratings, spanning diverse scenes, capture conditions, and compression settings. We further introduce \textbf{HDR-Q}, the first Multimodal Large Language Model (MLLM) for HDR-UGC VQA. We propose (i) a novel HDR-aware vision encoder to produce HDR-sensitive embeddings, and (ii) HDR-Aware Policy Optimization (HAPO), an RL fin

579

adjacent_context

low

XR-Poser: Accurate Egocentric Human Motion Estimation for AR/VR

Video & Motion / Human Motion

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; data_benchmark

adjacent genus=Human Motion with no direct reconstruction/SLAM/map signal in title

abstract

Egocentric 3D human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present XR-Poser, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training.The proposed model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget.The proposed auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher–st

580

adjacent_context

low

Choreographing a World of Dynamic Objects

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; generation_editing; data_benchmark

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Dynamic objects in our physical 4D (3D + time) world are constantly evolving, deforming, and interacting with other objects, leading to diverse 4D scene dynamics. In this paper, we study a universal generative pipeline for synthesizing this type of phenomena. Traditional rule-based graphics pipelines to create these dynamics are based on category-specific heuristics, yet are labor-intensive and not scalable. Recent learning-based methods typically demand large-scale datasets, which may not cover all object categories in interest. Our approach instead inherits the universality from the video generative models by proposing a distillation-based pipeline to extract the rich Lagrangian motion information hidden in the Eulerian representations of 2D videos. Our method is universal, versatile, and category-agnostic. We demonstrate its effectiveness by conducting experiments to generate a divers

581

adjacent_context

low

R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

Detection & Tracking / Detection

D. adjacent but useful context

pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping

adjacent genus=Detection with no direct reconstruction/SLAM/map signal in title

abstract

4D radar–camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated T

582

adjacent_context

low

HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; generation_editing; data_benchmark

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Recent methods have made notable progress in the visual quality of hand-object interaction video synthesis. However, most approaches rely on 2D control signals that lack spatial expressiveness and limit the utilization of synthetic 3D conditional data. To address these limitations, we propose HVG-3D, a unified framework for 3D-aware hand-object interaction (HOI) video synthesis conditioned on explicit 3D representations. To achieve a diffusion-based architecture augmented with a 3D ControlNet, which encodes geometric and motion cues from 3D inputs to enable explicit 3D reasoning during video synthesis, as well as the corresponding training and inference setting. To achieve high-quality synthesis, HVG-3D is designed with two core components: (i) a 3D-aware HOI video generation diffusion architecture that encodes geometric and motion cues from 3D inputs for explicit 3D reasoning; and (i

583

adjacent_context

low

WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments

Generative Models / Diffusion

D. adjacent but useful context

general_reconstruction; gaussian_radiance; pose_calibration_localization; data_benchmark

adjacent genus=Diffusion with no direct reconstruction/SLAM/map signal in title

abstract

We present **WildRayZer**, a self-supervised framework for novel view synthesis (NVS) in dynamic environments, where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, causing ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-

584

adjacent_context

low

Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos

Generative Models / Video Generation

D. adjacent but useful context

gaussian_radiance; dynamic_4d; robotics_mapping; generation_editing; data_benchmark

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Egocentric ``walking tour'' videos provide a rich source of image data to develop rich and diverse visual models of environments around the world. However, the significant presence of humans in frames of these videos due to crowds and eye-level camera perspectives mitigates their usefulness in environment modeling applications. We focus on addressing this challenge by developing a generative algorithm that can realistically remove (i.e., inpaint) humans and their associated shadow effects from walking tour videos. Key to our approach is the construction of a rich semi-synthetic dataset of video clip pairs to train this generative model. Each pair in the dataset consists of an environment-only background clip, and a composite clip of walking humans with simulated shadows overlaid on the background. We randomly sourced both foreground and background components from real egocentric walking

585

adjacent_context

low

GauMVC: Generative Decoupled Gaussian Representation for Human-centric Multi-view Video Compression

Learning Algorithms / Efficient Models

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; generation_editing

adjacent genus=Efficient Models with no direct reconstruction/SLAM/map signal in title

abstract

Human-centric multi-view video has a clear semantic structure: a static background and dynamic human motion. We propose a generative compression framework that explicitly decouples these components. The background is modeled once with 3D Gaussian Splatting, while the human is represented by a personalized Gaussian avatar reconstructed from a sparse set of key views that are transmitted only once and driven by compact per-frame pose parameters from the Skinned Multi-Person Linear (SMPL) model. The encoder sends only three elements: the background, the key views, and the SMPL parameters, enabling high-fidelity multi-viewpoint synthesis at dramatically reduced bitrates. This shifts compression from low-level redundancy removal to semantics-aware generative modeling. Experiments across multiple human-centric datasets demonstrate superior rate–distortion performance, particularly for long and

586

adjacent_context

low

PAM: A Pose–Appearance–Motion Engine for Sim-to-Real HOI Video Generation

Generative Models / Video Generation

D. adjacent but useful context

pose_calibration_localization; depth_correspondence; dynamic_4d; robotics_mapping; generation_editing

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Hand–object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of previous work, we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose–Appearance–Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, w

587

adjacent_context

low

GH-NAF: Grid-Adaptive Hash-Level–Attended Neural Attenuation Fields for Discrepancy-Aware CBCT

Medical & Scientific Imaging / Medical Imaging

D. adjacent but useful context

general_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark

adjacent genus=Medical Imaging with no direct reconstruction/SLAM/map signal in title

abstract

The advent of hash encodings has evolved neural radiance fields (NeRF)-based methods into fast and efficient 3D reconstruction techniques. In medical imaging, this framework has been extended to CT/CBCT reconstruction through neural attenuation fields (NAF), which directly model attenuation properties from projection data. Existing NeRF-based attenuation fields typically assume an idealized monoenergetic CBCT setting and therefore fail to model real-world projection inconsistencies such as scatter and noise contamination. Moreover, uniformly concatenating multi-resolution hash-grid features blends heterogeneous frequency components and noise into a single representation, causing artifacts: homogeneous regions acquire spurious high-frequency patterns, structural boundaries become blurred, and projection-induced bias propagates throughout the learned field. Given these limitations, we intr

588

adjacent_context

low

Beyond Single-View Sufficiency: CVBench for Cross-View Human Understanding

Multimodal & Language / VLM / MLLM

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; data_benchmark

adjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title

abstract

Human perception of social environments is inherently a multi-view synthesis problem, requiring the integration of complementary and often occluded information across space and time. However, existing benchmarks for Multimodal Large Language Models (MLLMs) are overwhelmingly predicated on a "sufficient-view" assumption, rewarding single-view pattern recognition while failing to evaluate cross-view fusion. To address this critical gap, we introduce \textbf{CVBench}, a large-scale, multi-task benchmark for cross-view human understanding. CVBench comprises 3,000 challenging questions across 12 spatial and temporal tasks, where every item is designed with \textit{verifiable single-view insufficiency}, mandating that models synthesize disparate evidence to resolve ambiguities. Our comprehensive evaluation of state-of-the-art open and closed-source MLLMs (from InternVL to Gemini 2.5 Pro) revea

589

adjacent_context

low

Personalized Audio-driven Whole-body Talking Avatars

Video & Motion / Human Motion

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; generation_editing

adjacent genus=Human Motion with no direct reconstruction/SLAM/map signal in title

abstract

Prior conversational 3D avatar systems map audio to parametric poses and then render, creating a lossy bottleneck where quantization, retargeting, and tracking errors accumulate. This degrades audio–motion synchronization and suppresses micro-articulations critical for realism—such as bilabial closures, cheek inflation, nasolabial motion, blinks, and fine hand gestures—especially under single-image personalization. We propose an end-to-end framework that builds a full-body, photorealistic 3D conversational avatar from a single image and drives it directly from audio, bypassing intermediate pose prediction. The avatar is modeled as a particle-based deformation field of 3D Gaussian primitives in a canonical space, with an audio-conditioned dynamics module that outputs per-particle trajectories for face, hands, and body, enabling localized high-frequency control with globally coherent motio

590

adjacent_context

low

Changes in Real Time: Online Scene Change Detection with Multi-View Fusion

Detection & Tracking / Detection

D. adjacent but useful context

general_reconstruction; gaussian_radiance; pose_calibration_localization; data_benchmark

adjacent genus=Detection with no direct reconstruction/SLAM/map signal in title

abstract

Online Scene Change Detection (SCD) is an extremely challenging problem that requires an agent to detect relevant changes on the fly while observing the scene from unconstrained viewpoints. Existing online SCD methods are significantly less accurate than offline approaches. We present the first online SCD approach that is pose-agnostic, label-free, and ensures multi-view consistency, while operating at over 10 FPS and achieving new state-of-the-art performance, surpassing even the best offline approaches. Our method introduces a new self-supervised fusion loss to infer scene changes from multiple cues and observations, PnP-based fast pose estimation against the reference scene, and a fast change-guided update strategy for the 3D Gaussian Splatting scene representation. Extensive experiments on complex real-world datasets demonstrate that our approach outperforms both online and offlin

591

adjacent_context

low

Describe Anything Anywhere At Any Moment

Multimodal & Language / VLM / MLLM

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark

adjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title

abstract

Computer vision and robotics applications ranging from agumented reality to robot autonomy in large-scale environments require spatio-temporal memory frameworks that capture both geometric structure for accurate language-grounding as well as semantic detail. Existing methods face a tradeoff, where producing rich open-vocabulary descriptions comes at the expense of real-time performance when these descriptions have to be grounded in 3D.To address these challenges, we propose Describe Anything, Anywhere, at Any Moment (DAAAM), a novel spatio-temporal memory framework for large-scale and real-time 4D scene understanding.DAAAM introduces a novel optimization-based frontend to infer detailed semantic descriptions from localized captioning models, such as the Describe Anything Model (DAM), leveraging batch processing to speed up inference by an order of magnitude for online processing.It leve

592

adjacent_context

low

DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video

Detection & Tracking / Detection

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark

adjacent genus=Detection with no direct reconstruction/SLAM/map signal in title

abstract

Reliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing open-set 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations. To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-t

593

adjacent_context

low

Enhancing Vision Language Models for 4D Perception

Multimodal & Language / VLM / MLLM

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark

adjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title

abstract

Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about 3D motion, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection on 2D images. Second, existing datasets fail to disentangle object and camera motion. To address these, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate large-scale 400K training samples and a 2.2K-sample benchmark. Training existing models on our dataset yields performance improvements on an external benchmark

594

adjacent_context

low

Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals

Generative Models / Video Generation

D. adjacent but useful context

dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Recent advancements in video generation have enabled the development of ``world models'' capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives—such as elastic collisions and falling dominos—teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, inc

595

adjacent_context

low

Grounded Latents for Entity-Centric 4D Scene Generation

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; generation_editing

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Although recent work has explored generative modeling of 3D or 4D driving scenes, most approaches operate on dense voxel-based representations, which are computationally expensive and struggle to maintain temporal or structural consistency. These methods often produce blurred or merged entities (i.e., cars, trucks, pedestrians) and lack fine-grained control over individual scene elements. We propose to perform generative modeling in a compact, entity-centric latent space, where each grounded 3D latent represents a semantically meaningful local region of the scene. This formulation enables precise, consistent control of both foreground and background elements while preserving geometric detail. We further extend this representation to 4D by learning a motion diffusion model for both ego and dynamic actors, conditioned on the generated 3D scene, and by propagating the grounded latents throu

596

adjacent_context

low

ORBIT: Benchmarking SfM in the Wild with 360° Video

Data & Evaluation / Benchmark

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; dynamic_4d; data_benchmark

adjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title

abstract

Structure-from-Motion (SfM) is a cornerstone of 3D perception, yet current methods often fail when applied to complex videos involving challenging camera motions or dynamic scenes.Compounding the problem, the field lacks reliable ground-truth benchmarks for such difficult scenarios, making it hard to gauge real-world progress, or pinpoint where improvements are most needed.To address this gap, we introduce a new benchmark for evaluating camera pose estimation.Our key insight is to leverage online panoramic 360° as a source of data from which to construct challenging clips, while still enabling robust ground-truth trajectory recovery.The panoramic nature of these videos provides richer visual context for tracking camera motion, even when parts of the view are affected by blur, motion, or dynamic objects.By tracking camera motion across full 360° videos, we crop and reproject selected port

597

adjacent_context

low

Curvature-Aware Captioning: Leveraging Geodesic Attention for 3D Scene Understanding

Multimodal & Language / Grounding

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; surface_occupancy; robotics_mapping

adjacent genus=Grounding with no direct reconstruction/SLAM/map signal in title

abstract

Accurate 3D scene description is fundamental to robotic navigation and augmented reality, yet current dense captioning methods face significant limitations in processing sparse point cloud data. Existing approaches that apply Euclidean embedding spaces struggle to simultaneously preserve fine-grained local geometric details and model exponentially growing global semantic hierarchies, leading to either inaccurate localization or disjointed, shallow scene descriptions. In this work, we propose a novel \textbf{\textsc{Curvature-Aware Captioning}} framework, integrating novel non-Euclidean geodesic attention mechanisms, to resolve the localization-contextualization conflict. Specifically, self-attention within Oblique space enforces dimensional homogeneity while establishing long-range dependencies. Bidirectional geodesic cross-attention within Lorentz space models hierarchical semantic rela

598

adjacent_context

low

Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves

Video & Motion / Human Motion

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; data_benchmark

adjacent genus=Human Motion with no direct reconstruction/SLAM/map signal in title

abstract

Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information, such as contact forces and motion dynamics, and are prone to frequent occlusions. To address these challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove data in HOI videos into photorealistic bare-hand representations, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures both temporal and multi-view rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we introduce HandSense, the first multi-modal HOI dataset featuring multi-view

599

adjacent_context

low

ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; generation_editing

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two

600

adjacent_context

low

Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; generation_editing

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Generating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial–temporal consistency—constraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multi-view trajectories, amplifying cross-view inconsistency and temporal drift.We argue that 360° video generation offers a natural solution: panoramic coverage simplifies trajectory design and provides strong global context for maintaining coherence. We introduce Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion, a controllable 360° video generation framework that synthesizes high-fidelity videos from sparse 360° inputs. The key idea is an explicit 3D Cache, reconstructed from the input, which serves as a geometric scaffold for any user-defined camera path. This allows the diffu

601

adjacent_context

low

PerpetualWonder: Long-horizon Action-conditioned 4D Scene Generation

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; generation_editing

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from lon

602

adjacent_context

low

SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; generation_editing

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Images and videos are discrete 2D projections of the 4D world (3D space + time). Most visual understanding, prediction, and generation operate directly on 2D observations, leading to suboptimal performance. We propose SeeU, a novel approach that learns the continuous 4D dynamics and generate the unseen visual contents. The principle behind SeeU is a new 2D$\to$4D$\to$2D learning framework. SeeU first reconstructs the 4D world from sparse and monocular 2D frames (2D$\to$4D). It then learns the continuous 4D dynamics on a low-rank representation and physical constraints (discrete 4D$\to$continuous 4D). Finally, SeeU rolls the world forward in time, re-projects it back to 2D at sampled times and viewpoints, and generates unseen regions based on spatial-temporal context awareness (4D$\to$2D). By modeling dynamics in 4D, SeeU achieves continuous and physically-consistent novel visual generati

603

adjacent_context

low

Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance

Computational Imaging / Computational Imaging

D. adjacent but useful context

general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy

adjacent genus=Computational Imaging with no direct reconstruction/SLAM/map signal in title

abstract

We present a differentiable framework to adaptively compute 4D illumination conditions with respect to an object, for efficient, high-quality simultaneous acquisition of its shape and reflectance, with a unified spatial-angular structured light and a single camera. Using a simple histogram-based pixel-level probability model for depth and reflectance, we differentiably link the next illumination condition(s) with a loss that encourages the reduction in depth uncertainty. As new structured illumination is cast, corresponding image measurements are used to update the uncertainty at each pixel. Finally, a fine-tuning-based approach reconstructs the depth map and reflectance parameter maps, by minimizing the differences between all physical measurements and their simulated counterparts. The effectiveness of our framework is demonstrated on physical objects with wide variations in shape and a

604

adjacent_context

low

S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance

Multimodal & Language / VLM / MLLM

D. adjacent but useful context

general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping

adjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title

abstract

3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG. However, MLLMs primarily process 2D visual inputs and struggle with understanding 3D spatial structure of scenes solely from these limited perspectives. Existing methods mainly utilize viewpoint-dependent rendering of reconstructed point clouds to provide explicit structural guidance for MLLMs in 3DVG tasks, leading to inefficiency and limited spatial reasoning. To address this issue, we propose S$^2$-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning. We introduce a spatial guidance strategy that leverages the structure awareness of feed-forward 3D reconstr

605

adjacent_context

low

Endless World: Real-Time 3D-Aware Long Video Generation

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; generation_editing

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Producing long, coherent video sequences with stable 3D structure remains a major challenge, particularly in streaming scenarios. Motivated by this, we introduce Endless World, a real-time framework for infinite, 3D-consistent video generation. To support infinite video generation, we introduce a conditional autoregressive training strategy that aligns newly generated content with existing video frames. This design preserves long-range dependencies while remaining computationally efficient, enabling real-time inference on a single GPU without additional training overhead. Moreover, our Endless World integrates global 3D-aware attention to provide continuous geometric guidance across time. Our 3D injection mechanism enforces physical plausibility and geometric consistency throughout extended sequences, addressing key challenges in long-horizon and dynamic scene synthesis. Extensive experi

606

adjacent_context

low

LaRP: Efficient Multi-View Inpainting with Latent Reprojection Priors

Generative Models / Image Editing

D. adjacent but useful context

general_reconstruction; gaussian_radiance; depth_correspondence; generation_editing

adjacent genus=Image Editing with no direct reconstruction/SLAM/map signal in title

abstract

The task of multi-view inpainting necessitates 3D consistency in the inpainted images. Most prior methods first employ single-view 2D inpainting and then enforce multi-view consistency in a post-hoc 3D optimization stage, which leads to undesirable artifacts and lengthy optimization times. The existing single-stage method, MVInpainter, uses video priors and is pose-free, making it less suitable for inputs beyond video sequences. In this paper, we propose a framework that trains an inpainting model to condition on the explicit and reliable multi-view correspondences from a 3D foundation model. Central to our framework is a cross-view conditioning architecture, LaRP, carefully designed to utilize both the generative prior of a pretrained diffusion inpainting model and the reprojected cross-view appearance latents. We additionally propose a scalable data pipeline for stable training of LaRP

607

adjacent_context

low

ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; dynamic_4d; generation_editing

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

We present ReDirector, a novel camera-controlled video retake generation method for dynamically captured variable-length videos. In particular, we rectify a common misuse of RoPE in previous works by aligning the spatiotemporal positions of the input video and the target retake. Moreover, we introduce Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that captures and integrates multi-view relationships within and across the input and target videos. By integrating camera conditions into RoPE, our method generalizes to out-of-distribution camera trajectories and video lengths, yielding improved dynamic object localization and static background preservation. Extensive experiments further demonstrate significant improvements in camera controllability, geometric consistency, and video quality across various trajectories and lengths.

608

adjacent_context

low

Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; generation_editing

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Video diffusion models have rich world priors, but their use in spatial tasks is limited by poor control, spatial-temporal inconsistent results, and entangled scene-camera dynamics. Current approaches, such as per-task fine-tuning or post-process warping strategies, are insufficient, often introducing visual artifacts, failing to generalize, or incurring high computational costs. We introduce a novel, training-free framework that operates purely at inference time to resolve these issues. Our method is comprised of three synergistic components. First, an intra-step refinement loop injects fine-grained motion guidance during the denoising process, iteratively correcting the output to ensure strict adherence to the target camera path. Second, an optical flow-based analysis identifies and isolates motion-related channels within the latent space. This allows our framework to selectively apply

609

adjacent_context

low

ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; generation_editing; data_benchmark

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V g

610

adjacent_context

low

From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction

Video & Motion / Human Motion

D. adjacent but useful context

general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping

adjacent genus=Human Motion; useful only if manually connected to reconstruction

abstract

Two-hand reconstruction from monocular images is hampered by complex poses and severe occlusions, which often cause interaction misalignment and two–hand penetration. We address this by decoupling the problem into 2D structural alignment and 3D spatial interaction alignment, each handled by a tailored component. For 2D alignment, we pioneer the attempt to unify heterogeneous structural priors (keypoints, segmentation, and depth) from vision foundation models as complementary structured guidance for two-hand recovery. Instead of extracting priors prediction as explicit inputs, we propose a fusion-alignment encoder that absorbs their structural knowledge implicitly, achieving foundation-level guidance without foundation-level cost. For 3D spatial alignment, we propose a two-hand diffusion model that learns a generative mapping from interpenetrated poses to realistic, collision-free configu

611

adjacent_context

low

Align Images Before You Generate

Generative Models / Diffusion

D. adjacent but useful context

general_reconstruction; depth_correspondence; dynamic_4d; generation_editing

adjacent genus=Diffusion with no direct reconstruction/SLAM/map signal in title

abstract

Multi-image diffusion models can generate images like multi-views or videos to describe static or dynamic scenes, yet texture and structure drift persist, severely undermining the spatiotemporal consistency. Addressing this issue remains challenging, especially without any external geometric or semantic priors during the pure generative inference. In this paper, we introduce CorrAdapter, a plug-and-play adapter that discovers and exploits an innate property of the multi-image diffusion itself, aligning all output images before they are in fact generated. Specifically, CorrAdapter designs a bypass branch for transformer blocks in the multi-image diffusion model, encompassing a native correspondence constructor that builds reliable correspondences from the diffusion model's intermediate features, and an aligned area aggregator that integrates messages from only matching regions to avoid am

612

adjacent_context

low

Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

Recognition & Classification / Retrieval

D. adjacent but useful context

general_reconstruction; gaussian_radiance; depth_correspondence; data_benchmark

adjacent genus=Retrieval with no direct reconstruction/SLAM/map signal in title

abstract

Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model’s dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel viewpoint-pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates "hard queries" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. F

613

adjacent_context

low

SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

Medical & Scientific Imaging / Medical Imaging

D. adjacent but useful context

pose_calibration_localization; dynamic_4d; robotics_mapping; data_benchmark

adjacent genus=Medical Imaging with no direct reconstruction/SLAM/map signal in title

abstract

Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce **SurgCoT,** a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across **7 surgical specialties** and **35 diverse procedures**. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue–Action Alignment, Affordance Mapping, Micro‑Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (*Question → Option → Knowledge → Clue → Answer*), where the *Knowledge* field provides essential background context and *Clue* provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants

614

adjacent_context

low

PAVAS: Physics-Aware Video-to-Audio Synthesis

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; generation_editing; data_benchmark

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical r

615

adjacent_context

low

Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields

Computational Imaging / Computational Imaging

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d

adjacent genus=Computational Imaging with no direct reconstruction/SLAM/map signal in title

abstract

With the success of static black-hole imaging, the next frontier is the dynamic and 3D imaging of black holes. Recovering the dynamic 3D gas near a black hole would reveal previously-unseen parts of the universe and inform new physics models. However, only sparse radio measurements from a single viewpoint are possible, making the dynamic 3D reconstruction problem significantly ill-posed. Previously, BH-NeRF addressed the ill-posed problem by assuming Keplerian dynamics of the gas, but this assumption breaks down near the black hole, where the strong gravitational pull of the black hole and increased electromagnetic activity complicate fluid dynamics. To overcome the restrictive assumptions of BH-NeRF, we propose *PINeRF*, a physics-informed approach that uses differentiable neural rendering to fit a 4D (time + 3D) emissivity field given EHT measurements. Our approach jointly reconstructs

616

adjacent_context

low

FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; generation_editing; data_benchmark

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference

617

adjacent_context

low

SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; generation_editing; data_benchmark

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter both the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video’s motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This simple yet crucial strategy enables the model to learn temporal control, directly producing the observed spa

618

adjacent_context

low

Generalizable Radio-Frequency Radiance Fields for Spatial Spectrum Synthesis

Computational Imaging / Computational Imaging

D. adjacent but useful context

general_reconstruction; gaussian_radiance; surface_occupancy

adjacent genus=Computational Imaging with no direct reconstruction/SLAM/map signal in title

abstract

We present GRaF, Generalizable Radio-Frequency (RF) Radiance Fields, a framework that models RF signal propagation to synthesize spatial spectra at arbitrary transmitter or receiver locations, where each spectrum measures signal power across all surrounding directions at the receiver. Unlike state-of-the-art methods that adapt vanilla Neural Radiance Fields (NeRF) to the RF domain with scene-specific training, GRaF generalizes across scenes to synthesize spectra. To enable this, we prove an interpolation theory in the RF domain: the spatial spectrum from a transmitter can be approximated using spectra from geographically proximate transmitters. Building on this theory, GRaF comprises two components: (i) a geometry-aware Transformer encoder that captures spatial correlations from neighboring transmitters to learn a scene-independent latent RF radiance field, and (ii) a neural ray tracing

619

adjacent_context

low

Real2Sim2Real: RetinalDepth-64K for Depth Estimation in Posterior Segment Ophthalmic Surgery

Medical & Scientific Imaging / Medical Imaging

D. adjacent but useful context

general_reconstruction; depth_correspondence; robotics_mapping; data_benchmark

adjacent genus=Medical Imaging with no direct reconstruction/SLAM/map signal in title

abstract

Accurate depth estimation is crucial for 3D reconstruction and precise navigation in ophthalmic fundus surgery. However, acquiring annotated data remains challenging due to the impracticality of depth sensors under surgical microscopes.To overcome this limitation, we introduce RetinalDepth-64K, a novel synthetic dataset comprising 64,000 stereo image pairs across 1,280 diverse scenes, developed through a Real2Sim2Real pipeline that transforms real-world fundus surgery videos into synthetic data and facilitates model deployment in real scenarios. We analyzed key characteristics such as intricate retinal textures from real-world videos to guide the Real-to-Sim phase, enabling realistic data synthesis.To improving dataset fidelity for depth estimation, we created 3D eye models using Blender with ultra-wide-field retinal textures, glass-modeled aqueous humor, and dynamic instrument trajector

620

adjacent_context

low

Wavelet-Driven 3D Anomaly Detection under Pose-Agnostic and Sparse-View

Detection & Tracking / Detection

D. adjacent but useful context

general_reconstruction; gaussian_radiance; pose_calibration_localization

adjacent genus=Detection with no direct reconstruction/SLAM/map signal in title

abstract

Pose-agnostic anomaly detection (PAD) achieves strong performance in localizing anomalies from arbitrary viewpoints when trained on densely sampled normal data. However, under sparse-view conditions, existing methods face two key challenges: (1) sparse observations lead to overfitting and geometric detail loss in 3D reconstruction; (2) limited visual cues lead to inaccurate pose estimation, compromising the reliability of subsequent anomaly localization. To address these challenges, we propose Wave-Pose3D, a wavelet-driven 3D anomaly detection framework tailored for PAD under sparse-view conditions. First, we design a structure-aware and wavelet-optimized Gaussian modeling strategy that dynamically filters unreliable regions via structural priors to mitigate overfitting and leverages high-frequency supervision to restore fine-grained geometric details. Second, to improve pose estimation

621

adjacent_context

low

RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation

Multimodal & Language / Grounding

D. adjacent but useful context

pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark

adjacent genus=Grounding with no direct reconstruction/SLAM/map signal in title

abstract

Visual-language grounding aims to establish semantic correspondences between natural language and visual entities, enabling models to accurately identify and localize target objects based on textual instructions. Existing VLG approaches focus on coarse-grained, object-level localization, while traditional robotic grasping methods rely predominantly on geometric cues and lack language guidance, which limits their applicability in language-driven manipulation scenarios. To address these limitations, we propose the RealVLG framework, which integrates the RealVLG-11B dataset and the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks. RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering approximately 165,000 images, ove

622

adjacent_context

low

RPGFusion: 4D Radar Prior-Guided Multi-Modal Fusion for 3D Detection

Detection & Tracking / Detection

D. adjacent but useful context

depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping

adjacent genus=Detection with no direct reconstruction/SLAM/map signal in title

abstract

Accurate 3D object detection in autonomous driving relies on effectively combining complementary information from multiple sensors. 4D millimeter-wave radar provides sparse yet physically reliable measurements, whose potential for enhancing sensor fusion has not been fully utilized. In this work, we propose \textbf{R}adar \textbf{P}rior \textbf{G}uided \textbf{Fusion} (\textbf{RPGFusion}), a practical 4D radar–camera fusion framework. We first generate radar prior maps that encode spatial confidence and depth cues. These priors guide image feature sampling while preventing the uneven BEV feature distribution (near-dense, far-sparse) caused by Lift-Splat-Shoot view transformation. To address the sparsity and noise inherent in point clouds, we adopt a hybrid robust encoding and sparse-to-dense feature propagation. We further introduce spatial alignment and semantic fusion modules to reconc

623

adjacent_context

low

Scene Grounding in the Wild

Multimodal & Language / Grounding

D. adjacent but useful context

general_reconstruction; gaussian_radiance; data_benchmark

adjacent genus=Grounding with no direct reconstruction/SLAM/map signal in title

abstract

Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry.In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains

624

adjacent_context

low

Correspondence-Attention Alignment for Multi-view Diffusion Models

Generative Models / Diffusion

D. adjacent but useful context

general_reconstruction; gaussian_radiance; depth_correspondence

adjacent genus=Diffusion with no direct reconstruction/SLAM/map signal in title

abstract

Multi-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view consistency remains unclear. In this work, we first verify that the attention maps of these models acquire geometric correspondence throughout training, attending to the geometrically corresponding regions across reference and target views for view-consistent generation. However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. Notably, supervising a single attention layer is sufficient to guide the model toward learning precise co

625

adjacent_context

low

Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors

Detection & Tracking / Detection

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy

adjacent genus=Detection with no direct reconstruction/SLAM/map signal in title

abstract

Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated sc

626

adjacent_context

low

RAM: Recover Any 3D Human Motion in-the-Wild

Video & Motion / Human Motion

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy

adjacent genus=Human Motion with no direct reconstruction/SLAM/map signal in title

abstract

Recovering 3D human motion from monocular videos in-the-wild remains challenging due to occlusions, rapid movements, and viewpoint variations. To address these challenges, we introduce **Recover-Anyone Module (RAM)**, a unified framework for real-time and accurate 3D human motion reconstruction. RAM incorporates a motion-aware semantic tracker with adaptive Kalman filtering to achieve robust identity association under severe occlusions and dynamic interactions. A memory-augmented Temporal HMR module further enhances human motion reconstruction by injecting spatio-temporal priors for consistent and smooth motion estimation. Moreover, a lightweight Predictor module forecasts future poses to maintain reconstruction continuity, while a gated combiner adaptively fuses reconstructed and predicted features to ensure coherence and robustness. Experiments on in-the-wild multi-person benchmarks su

627

adjacent_context

low

Unified Video Editing as Temporal Reasoner

Generative Models / Image Editing

D. adjacent but useful context

pose_calibration_localization; dynamic_4d; robotics_mapping; generation_editing

adjacent genus=Image Editing with no direct reconstruction/SLAM/map signal in title

abstract

Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a "seeing, reasoning, then editing" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning toke

628

adjacent_context

low

HUMAPS-4D : A Multimodal Dataset for HUman Motion Analysis with Physiological and Semantic informations

Data & Evaluation / Benchmark

D. adjacent but useful context

general_reconstruction; dynamic_4d; data_benchmark

adjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title

abstract

Current advancements in human motion understanding are strongly reliant on video data. Nevertheless, privacy regulations and operational constraints increasingly restrict the use of visual data in real-world scenarios. Inferring posture through wearable sensors, such as instrumented insoles measuring plantar activation, presents itself as a promising alternative. However, the absence of large-scale multimodal datasets hinders the rigorous benchmarking of these methodologies. We introduce HUMAPS-4D, a novel multimodal dataset designed for human motion analysis, effectively bridging computer vision and biomechanics. This dataset integrates synchronized motion capture, multi-view video, IMUs, plantar pressure signals, sEMG activation patterns, and high-level semantic annotations. The data was collected from 32 subjects performing 30 actions over a total duration of 14 hours. Participants de

629

adjacent_context

low

Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining

Video & Motion / Human Motion

D. adjacent but useful context

general_reconstruction; gaussian_radiance; data_benchmark

adjacent genus=Human Motion with no direct reconstruction/SLAM/map signal in title

abstract

Human motion recovery for real-world interaction demands both precise action details and metric-scale trajectories. Recovering absolute human pose from monocular input presents a viable solution, but faces two main challenges: (1) models' reliance on 3D training data from constrained environments limits their out-of-distribution generalization; and (2) the inherent difficulty of estimating metric-scale poses from monocular observations. This paper introduces Mocap-2-to-3, a novel framework that differs from prior HMR methods by recovering absolute poses from monocular input and leveraging abundant 2D data to enhance 3D motion recovery. To effectively utilize the action priors and diversity in large-scale 2D datasets, we reformulate 3D motion as a multi-view synthesis process and divide the training into two stages: a single-view diffusion model is first pre-trained on extensive 2D data,

630

adjacent_context

low

SHands: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training

Medical & Scientific Imaging / Medical Imaging

D. adjacent but useful context

general_reconstruction; dynamic_4d; data_benchmark

adjacent genus=Medical Imaging with no direct reconstruction/SLAM/map signal in title

abstract

In surgical training for medical students, proficiency development relies on expert-led skill assessment, which is costly, time-limited, difficult to scale, and its expertise remains confined to institutions with available specialists. Automated AI-based assessment offers a viable alternative, but progress is constrained by the lack of datasets containing realistic trainee errors and the multi-view variability needed to train robust computer vision approaches. To address this gap, we present Surgical-Hands (SHands), a large-scale multi-view video dataset for surgical hand-gesture and error recognition for medical training. SHands captures linear incision and suturing using five RGB cameras from complementary viewpoints, performed by 52 participants (20 experts and 32 trainees) each completing three standardized trials per procedure. The videos are annotated at the frame level with 15 ges

631

adjacent_context

low

240FPS Stereo Vision from Monocular Mixed Spikes

Computational Imaging / Computational Imaging

D. adjacent but useful context

general_reconstruction; depth_correspondence; dynamic_4d

adjacent genus=Computational Imaging with no direct reconstruction/SLAM/map signal in title

abstract

Stereo vision is fundamental for enabling machines to perceive and interact with the world. While monocular stereo methods offer hardware compactness, they struggle with generalization due to reliance on data-driven priors. Binocular and multi-view systems improve accuracy but incur higher hardware complexity and data inefficiency. In this paper, we introduce a monocular solution for high-frame-rate stereo vision via temporal optical modulation. The modulation directs light from two views in a mixed manner while periodically attenuates one view at 60Hz. To capture the temporal variations introduced by this modulation, we employ a high-speed spike camera that records the mixed scene as temporally dense spikes. And the high temporal resolution of these spikes enables the construction of a linear system for efficient binocular video decoupling.Consequently, we introduce a two-stage decoding

632

adjacent_context

low

EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence

Multimodal & Language / VLM / MLLM

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping

adjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title

abstract

Recent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with black-box reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for “thinking with images’’ (e.g., ChatGPT–o3 and DeepEyes) show that stepwise multimodal reasoning can emerge by interleaving hypothesis formation with active acquisition of visual evidence, but they do not address three key challenges in spatial Chain-of-Thought (CoT): building global space perception under strict token budgets, explicitly associating 3D hypotheses with video frames for verification, and designing spatially grounded rewards for reinforcement learning. To address these issues, we present \textbf{EagleVision}, a dual-stage framework for progressive spatial cognition through macro percep

633

adjacent_context

low

Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis

Medical & Scientific Imaging / Medical Imaging

D. adjacent but useful context

general_reconstruction; depth_correspondence; dynamic_4d

adjacent genus=Medical Imaging with no direct reconstruction/SLAM/map signal in title

abstract

Endoscopic video analysis is crucial for early gastrointestinal screening, but its progress is constrained by limited high-quality annotations. While self-supervised video pre-training shows promise, existing methods designed for natural videos tend to prioritize dense spatio-temporal modeling and exhibit motion bias, neglecting the static, structured semantics that are critical for clinical decision-making. To address this challenge, we propose **F**ocus-to-**P**erceive **R**epresentation **L**earning (***FPRL***), a cognition-inspired hierarchical framework that emulates the clinical examination process of endoscopic videos. ***FPRL*** first focuses on intra-frame lesion-centric regions to learn static semantics, and then perceives their evolution across frames to model contextual semantics. To achieve this, ***FPRL*** employs a hierarchical semantic modeling mechanism that explicitly

634

adjacent_context

low

Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals

Computational Imaging / Computational Imaging

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping

adjacent genus=Computational Imaging with no direct reconstruction/SLAM/map signal in title

abstract

Robust 3D environmental perception is critical for applications like autonomous navigation and robotics, yet existing optical sensors like cameras and LiDAR fail in adverse conditions such as smoke, fog, and non-ideal lighting. While specialized radar systems can operate in these conditions, their reliance on bespoke, ultra-wideband hardware and licensed spectrum limits their scalability and cost-effectiveness. This paper introduces Rascene, a novel framework that enables high-fidelity 3D imaging by repurposing ubiquitous mmWave OFDM communication signals. Recognizing that a single-frame RF signal is inherently sparse, noisy, and highly ambiguous, the key innovation of Rascene is a multi-frame 3D imaging framework designed to fuse information from signals captured across multiple, arbitrary poses. This framework leverages a spatially adaptive fusion mechanism to find geometric consensus

635

adjacent_context

low

Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models

Multimodal & Language / VLM / MLLM

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping

adjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title

abstract

Spatial reasoning is the process of locating target objects based on spatial relations in 3D scenes, which plays a crucial role in developing intelligent embodied agents. Due to the limited availability of 3D scene-language paired data, it is challenging to train models with strong reasoning ability from scratch. Previous approaches have attempted to inject 3D scene representations into the input space of Large Language Models (LLMs) and leverage the pretrained comprehension and reasoning abilities for spatial reasoning. However, models encoding absolute positions struggle to extract spatial relations from prematurely fused features, while methods explicitly encoding all spatial relations (which is quadratic in the number of objects) as input tokens suffer from poor scalability. To address these limitations, we propose QuatRoPE, a novel positional embedding method with an input length th

636

adjacent_context

low

Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements

Video & Motion / Human Motion

D. adjacent but useful context

general_reconstruction; dynamic_4d; data_benchmark

adjacent genus=Human Motion with no direct reconstruction/SLAM/map signal in title

abstract

Effective human behavior modeling requires a representation of the human body movement that capitalizes on its compositionality. We propose a hierarchical representation consisting of Action Atoms which capture the atomic joint movements and Action Motifs which are formed by their temporal compositions and encode similar body movements found across different overall human actions. We derive A4Mer, a nested latent Transformer to learn this hierarchical representation from human pose data in a fully self-supervised manner. A4Mer splits a 3D pose sequence into variable-length segments and represents each segment as a single latent token (Action Atoms). Through bottom-up representation learning, temporal patterns composed of these Action Atoms, which capture meaningful temporal spans of reusable, semantic segments of body movements, naturally emerge (Action Motifs). A4Mer achieves this with

637

adjacent_context

low

Gloria: Consistent Character Video Generation via Content Anchors

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; generation_editing

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the "memory", leading to suboptimal consistency.Recognizing that character video generation inherently resembles an ``outside-looking-in" scenario. In this work, we propose represent the character’s visual attributes through a compact set of anchor frames.This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Conditio

638

adjacent_context

low

Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection

Detection & Tracking / Detection

D. adjacent but useful context

pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark

adjacent genus=Detection with no direct reconstruction/SLAM/map signal in title

abstract

Integrating LiDAR and camera inputs into a unified Bird’s-Eye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, existing methods suffer from spatial misalignment between LiDAR and camera features, which causes inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from calibration inaccuracies and rolling shutter effect.The key insight of this work is that locations of these projection errors are not random but highly predictable, as they are concentrated at object-background boundaries which 2D detectors can reliably identify. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth Calib

639

adjacent_context

low

In Pursuit of Pixel Supervision for Visual Pre-training

Learning Algorithms / Self-supervised

D. adjacent but useful context

general_reconstruction; depth_correspondence; robotics_mapping

adjacent genus=Self-supervised with no direct reconstruction/SLAM/map signal in title

abstract

Pixels provide a lightweight, scalable way to encode the physical world, preserving rich visual information with minimal human inductive bias. We demonstrate that visual pre-training using pixel supervision alone can learn desirable visual properties and produce strong representations, while remaining simple, stable, and efficient. We present Pixo, a capable self-supervised model trained by purely predicting pixels. It is instantiated on the masked autoencoding (MAE) framework, but enhances MAE with a deeper decoder, larger-block masking, and additional class tokens. It is trained on 2B web-crawled images with a self-curated strategy. Pixo performs well on many downstream tasks, covering monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), object segmentation (e.g., SAM 2), and embodied AI. We will release the training code and pre-traine

640

adjacent_context

low

Fast Reasoning Segmentation for Images and Videos

Segmentation & Dense Prediction / Segmentation

D. adjacent but useful context

depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark

adjacent genus=Segmentation with no direct reconstruction/SLAM/map signal in title

abstract

Reasoning segmentation enables open-set object segmentation via implicit text queries, therefore serving as a foundation for embodied agents that should operate autonomously in real-world environments. However, existing methods for reasoning segmentation require multimodal large language models with billions of parameters that exceed the computational capabilities of edge devices that typically deploy the embodied AI systems. Distillation offers a pathway to compress these models while preserving their capabilities. Yet, existing distillation approaches fail to transfer the multi-step reasoning capabilities that reasoning segmentation demands, as they focus on matching output predictions and intermediate features rather than preserving reasoning chains. The emerging paradigm of reasoning over digital twin representations presents an opportunity for more effective distillation by re-frami

641

adjacent_context

low

GenMatter: Perceiving Physical Objects with Generative Matter Models

Segmentation & Dense Prediction / Segmentation

D. adjacent but useful context

general_reconstruction; dynamic_4d; data_benchmark

adjacent genus=Segmentation with no direct reconstruction/SLAM/map signal in title

abstract

Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion and appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates

642

adjacent_context

low

InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization

Generative Models / Image Editing

D. adjacent but useful context

general_reconstruction; dynamic_4d; generation_editing

adjacent genus=Image Editing with no direct reconstruction/SLAM/map signal in title

abstract

We address the task of multi-view image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify the scene according to a textual instruction while preserving consistency across all views.Existing methods, based on per-scene neural fields or temporal attention mechanisms, struggle in this setting, often producing artifacts and incoherent edits. We propose InstructMix2Mix (I-Mix2Mix), a framework that distills the editing capabilities of a 2D diffusion model into a pretrained multi-view diffusion model, leveraging its data-driven 3D prior for cross-view consistency. A key contribution is replacing the conventional neural field consolidator in Score Distillation Sampling (SDS) with a multi-view diffusion student, which requires novel adaptations: incremental student updates across timesteps, a spec

643

adjacent_context

low

MatLat: Material Latent Space for PBR Texture Generation

Generative Models / Diffusion

D. adjacent but useful context

depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark

adjacent genus=Diffusion with no direct reconstruction/SLAM/map signal in title

abstract

We propose a generative framework for producing high-quality PBR textures on a given 3D mesh. As large-scale PBR texture datasets are scarce, our approach focuses on effectively leveraging the embedding space and diffusion priors of pretrained latent image generative models while learning a material latent space, **MatLat**, through targeted fine-tuning. Unlike prior methods that freeze the embedding network and thus lead to distribution shifts when encoding additional PBR channels and hinder subsequent diffusion training, we fine-tune the pretrained VAE so that new material channels can be incorporated with minimal latent distribution deviation. We further show that correspondence-aware attention alone is insufficient for cross-view consistency unless the latent-to-image mapping preserves locality. To enforce this locality, we introduce a regularization in the VAE fine-tuning that crops

644

adjacent_context

low

Plenoptic Video Generation

Generative Models / Video Generation

D. adjacent but useful context

general_reconstruction; dynamic_4d; generation_editing

adjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title

abstract

Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against l

645

adjacent_context

low

MambaSIC: Mamba-based Stereo Image Compression with Bi-directional Multi-reference Entropy Model

Learning Algorithms / Efficient Models

D. adjacent but useful context

general_reconstruction; depth_correspondence; robotics_mapping

adjacent genus=Efficient Models with no direct reconstruction/SLAM/map signal in title

abstract

Stereo image compression (SIC) has become increasingly vital with its applications surging in fields such as 3D reconstruction and autonomous navigation. Previous methods leverage cross-attention to model inter-view redundancy and employ autoregressive entropy models to predict probability distributions, achieving impressive rate-distortion performance. However, they suffer from slow coding speed due to the quadratic complexity of cross-attention mechanisms and the spatial autoregressive iterations of the entropy models. To address these limitations, we propose MambaSIC, which introduces two key innovations. First, we propose a Mamba-based stereo visual state space block (stereo VSSB) that leverages its linear complexity and long-range modeling capabilities to more rapidly and efficiently capture redundancy information between the two views. Second, to accelerate the compression process

646

adjacent_context

low

Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization

Medical & Scientific Imaging / Medical Imaging

D. adjacent but useful context

pose_calibration_localization; dynamic_4d; robotics_mapping

adjacent genus=Medical Imaging with no direct reconstruction/SLAM/map signal in title

abstract

Diffeomorphic image registration (DIR) seeks topology-preserving transformations and is fundamental in medical imaging. Existing DIR methods rely on integration schemes (e.g., scaling-and-squaring) and multiple regularizers to enforce invertibility. We introduce **SGDIR**, a continuous-time registration framework, parameterized by known time-embedded backbones, that models diffeomorphisms using only a single semigroup-based regularization, eliminating explicit integration and auxiliary constraints. We mathematically prove that this formulation directly learns the flow of an underlying ODE, inherently enforcing inverse and cycle consistencies. We evaluate on eight 2D and 3D MR and CT datasets. Under strict semigroup enforcement, our model achieves near-perfect diffeomorphism (near-zero folding) and significantly outperforms existing diffeomorphic methods, while remaining competitive with

647

adjacent_context

low

SPREAD: Spatial-Physical Reasoning via gEometry Aware Diffusion

Generative Models / Diffusion

D. adjacent but useful context

surface_occupancy; robotics_mapping; generation_editing; data_benchmark

adjacent genus=Diffusion with no direct reconstruction/SLAM/map signal in title

abstract

Automated 3D scene generation is pivotal for applications spanning virtual reality, digital content creation, and Embodied AI. While computer graphics prioritizes aesthetic layouts, vision and robotics demand scenes that mirror real-world complexity which current data-driven methods struggle to achieve due to limited unstructured training data and insufficient spatial and physical modeling. We propose SPREAD, a diffusion-based framework that jointly learns spatial and physical relationships through a graph transformer, explicitly conditioning on posed scene point clouds for geometric awareness. Moreover, our model integrates differentiable guidance for collision avoidance, relational constraint, and gravity, ensuring physically coherent scenes without sacrificing relational context. Our experiments on 3D-FRONT and ProcTHOR datasets demonstrate state-of-the-art performance in spatial-rela

648

adjacent_context

low

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

Multimodal & Language / VLM / MLLM

D. adjacent but useful context

general_reconstruction; robotics_mapping; data_benchmark

adjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title

abstract

Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents.We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input.To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning.Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an

649

adjacent_context

low

Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation

Generative Models / Diffusion

D. adjacent but useful context

gaussian_radiance; robotics_mapping; generation_editing

adjacent genus=Diffusion with no direct reconstruction/SLAM/map signal in title

abstract

In stroke-based rendering, search methods often get trapped in local minima due to discrete stroke placement, while differentiable optimizers lack structural awareness and produce unstructured layouts. To bridge this gap, we propose a dual representation that couples discrete polylines with continuous Bézier control points via a bidirectional mapping mechanism. This enables collaborative optimization: local gradients refine global stroke structures, while content-aware stroke proposals help escape poor local optima. Our representation further supports Gaussian-splatting-inspired initialization, enabling highly parallel stroke optimization across the image. Experiments show that our approach reduces the number of strokes by 30–50\%, achieves more structurally coherent layouts, and improves reconstruction quality, while cutting optimization time by 30–40\% compared to existing differentiab

650

adjacent_context

low

AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects

Data & Evaluation / Benchmark

D. adjacent but useful context

pose_calibration_localization; robotics_mapping; data_benchmark

adjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title

abstract

Assembling objects from parts requires understanding multimodal instructions, linking them to 3D components, and predicting physically plausible 6-DoF motions for each assembly step. Existing datasets focus on simplified scenarios, overlooking shape complexities and assembly trajectories in industrial assemblies. We introduce AssemblyBench, a synthetic dataset of 2,789 industrial objects with multimodal instruction manuals, corresponding 3D part models, and part assembly trajectories. We also propose a transformer-based model, AssemblyDyno, which uses the instructional manual and the 3D shape of each part to jointly predict assembly order and part assembly trajectories. AssemblyDyno outperforms prior works in both assembly pose estimation and trajectory feasibility, where the latter is evaluated by our physics-based simulations.

651

adjacent_context

low

SceMoS: Local Scene-Aware Human Motion Synthesis by Planning with Geometry-Grounded Tokens

Video & Motion / Human Motion

D. adjacent but useful context

surface_occupancy; robotics_mapping; data_benchmark

adjacent genus=Human Motion with no direct reconstruction/SLAM/map signal in title

abstract

Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent (“walk to the couch”) and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a top-down bird’s-eye-view (BEV) image of the scene, encoded with DINOv2 features, as the scene representation, a

652

adjacent_context

low

PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis

Low-level Vision / IQA

D. adjacent but useful context

general_reconstruction; gaussian_radiance

adjacent genus=IQA with no direct reconstruction/SLAM/map signal in title

abstract

Diffusion models are promising for sparse-view novel view synthesis (NVS), as they can generate pseudo-ground-truth views to aid 3D reconstruction pipelines like 3D Gaussian Splatting (3DGS). However, these synthesized images often contain photometric and geometric inconsistencies, and their direct use for supervision can impair reconstruction. To address this, we propose Partial-Reference Image Quality Assessment (PR-IQA), a framework that evaluates diffusion-generated views using reference images from different poses, eliminating the need for ground truth. PR-IQA first computes a geometrically consistent partial quality map in overlapping regions. It then performs quality completion to inpaint this partial map into a dense, full-image map. This completion is achieved via a cross-attention mechanism that incorporates reference-view context, ensuring cross-view consistency and enabling t

653

adjacent_context

low

Turbo-GS: Accelerating 3D Gaussian Fitting for High-Resolution Radiance Fields

Learning Algorithms / Efficient Models

D. adjacent but useful context

general_reconstruction; gaussian_radiance

adjacent genus=Efficient Models with no direct reconstruction/SLAM/map signal in title

abstract

Novel-view synthesis plays a crucial role in computer vision with applications in 3D reconstruction, mixed reality, and robotics. Recent approaches, such as 3D Gaussian Splatting (3DGS), have emerged as state-of-the-art solutions, offering high-quality novel view synthesis in real time. However, training 3DGS models remains slow, particularly for high-resolution images, often requiring hours to fit a scene with 200 views. In this work, we aim to accelerate the fitting process by reducing computational overhead and improving learning efficiency. Specifically, we introduce a dilated rendering technique that renders only a subset of pixels instead of the full image, significantly reducing computational costs. To enhance learning efficiency, we develop a convergence-aware budget control mechanism that balances the addition of new Gaussians with the optimization of existing ones. Additionally

654

adjacent_context

low

ENC-Bench: A Benchmark for Evaluating Multimodal Large Language Models in Electronic Navigational Chart Understanding

Data & Evaluation / Benchmark

D. adjacent but useful context

pose_calibration_localization; robotics_mapping; data_benchmark

adjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title

abstract

Electronic Navigational Charts (ENCs) are the safety-critical backbone of modern maritime navigation, yet it remains unclear whether multimodal large language models (MLLMs) can reliably interpret them. Unlike natural images or conventional charts, ENCs encode regulations, bathymetry, and route constraints via standardized vector symbols, scale-dependent rendering, and precise geometric structure---requiring specialized maritime expertise for interpretation. We introduce ENC-Bench, the first benchmark dedicated to professional ENC understanding. ENC-Bench contains 20,490 expert-validated samples from 840 authentic National Oceanic and Atmospheric Administration (NOAA) ENCs, organized into a three-level hierarchy: Perception (symbol and feature recognition), Spatial Reasoning (coordinate localization, bearing, distance), and Maritime Decision-Making (route legality, safety assessment, em

655

adjacent_context

low

Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments

Detection & Tracking / Detection

D. adjacent but useful context

surface_occupancy; robotics_mapping; data_benchmark

adjacent genus=Detection with no direct reconstruction/SLAM/map signal in title

abstract

Incremental 3D object perception is a critical step toward embodied intelligence in dynamic indoor environments. However, existing incremental 3D detection methods rely on extensive annotations of novel classes for satisfactory performance. To address this limitation, we propose FI3Det, a Few-shot Incremental 3D Detection framework that enables efficient 3D perception with only a few novel samples by leveraging vision-language models (VLMs) to learn knowledge of unseen categories. FI3Det introduces a VLM-guided unknown object learning module in the base stage to enhance perception of unseen categories. Specifically, it employs VLMs to mine unknown objects and extract comprehensive representations, including 2D semantic features and class-agnostic 3D bounding boxes. To mitigate noise in these representations, a weighting mechanism is further designed to re-weight the contributions of poin

656

adjacent_context

low

Translating Signals to Languages for sEMG-Based Activity Recognition

Recognition & Classification / Classification

D. adjacent but useful context

surface_occupancy; robotics_mapping; data_benchmark

adjacent genus=Classification with no direct reconstruction/SLAM/map signal in title

abstract

Surface electromyography (sEMG) signal-based activity recognition has attracted increasing research attention in recent years. To develop accurate sEMG signal-based activity recognizers, numerous approaches have been proposed. Some studies focus on designing larger and more expressive model architectures to enhance the representational capacity of sEMG signals, while others aim to enrich model priors through large-scale pretraining, thereby improving recognition performance. Recently, large language models (LLMs) have shown remarkable generalization and reasoning capabilities in natural language processing, whose implicit knowledge, learned from extensive linguistic descriptions of actions, opens new possibilities for interpreting sEMG signals and inferring activity intentions. Motivated by this, we propose LLM-sEMG, a novel framework that leverages LLMs as sEMG activity recognizers. Wit

657

adjacent_context

low

Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

Generative Models / Diffusion

D. adjacent but useful context

gaussian_radiance; robotics_mapping

adjacent genus=Diffusion with no direct reconstruction/SLAM/map signal in title

abstract

We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground-level AND by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photos to predict Gaussian splats in a unified 3D coordinate frame. By aligning ground and bird's-eye feature representations, our model improves scene coverage and novel-view synthesis, compared to ground imagery alone. We train on curated georeferenced data sets and paired satellite--terrain data, mined from open mapping services.

658

adjacent_context

low

PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation

Medical & Scientific Imaging / Medical Imaging

D. adjacent but useful context

gaussian_radiance; pose_calibration_localization

adjacent genus=Medical Imaging with no direct reconstruction/SLAM/map signal in title

abstract

Brain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion detection and radiotherapy target delineation. However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networks often overlook clinically observed spatial priors of tumor occurrence, leading to redundant feature computation over extensive background regions. To address this issue, we propose PGR-Net (Prior-Guided Region Network)—an explicit ROI-aware framework that incorporates a data-driven spatial prior set to capture the distribution and scale characteristics of tumor lesions, providing global guidance for more stable segmentation. Leveraging these priors, PGR-Net introduces a hierarchical Top-K ROI decision mechanism that progressively selects the most confident lesion candidate regions

659

adjacent_context

low

Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation

Generative Models / Diffusion

D. adjacent but useful context

gaussian_radiance; robotics_mapping

adjacent genus=Diffusion with no direct reconstruction/SLAM/map signal in title

abstract

We propose Decoupled Residual Denoising Diffusion models (DRDD) for unified and data-efficient image-to-image (I2I) translation. While diffusion models have advanced I2I translation in terms of quality and diversity, we uncover a previously under-explored property in diffusion models. Crucially, beyond its conventional role of manifold lifting (i.e., moving data off low-dimensional manifolds), injecting Gaussian noise facilitates domain harmonization by implicitly aligning feature distributions across domains, a property particularly advantageous for unified I2I translation. However, existing diffusion models prematurely erode this harmonization effect, as noise and residuals are simultaneously removed in a single coupled diffusion process. To address this, DRDD decouples the diffusion process into two sequential and independent diffusion stages: (1) a stochastic noise diffusion for doma

660

likely_noise

low

3D-Object Perception Transformer (3PT)

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Current approaches to zero-shot 3D-object perception typically rely on ensembles of frozen foundation models.This limits deep object understanding and cross-domain generalization, making performance inadequate for real-world deployment. The 3D-Object Perception Transformer (3PT) addresses this limitation by unifying detection, segmentation, and 6DoF pose estimation in a single framework, directly trained for 3D-object perception. Based on two large-scale trained Transformers that specialize in 2D and 3D object-centric scene understanding respectively, 3PT continuously refines its object representations without depth input, enhancing 3D understanding by incorporating multi-view information. 3PT surpasses task-specialized models for detection and pose estimation, often achieving double-digit percentage improvements on the diverse BOP-benchmarks. Achieving high accuracy and robustness, \alg

661

likely_noise

low

ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Symmetry detection is a fundamental problem in computer vision, and symmetries serve as powerful priors for downstream tasks. However, existing learning-based methods for detecting 3D symmetries from single images have been almost exclusively trained and evaluated on object-centric or synthetic datasets, and thus fail to generalize to real-world scenes. Furthermore, due to the inherent scale ambiguity of monocular inputs, which makes localizing the 3D plane an ill-posed problem, many existing works only predict the plane's orientation. In this paper, we address these limitations by presenting the first framework for detecting *3D-grounded reflectional symmetries* from single, in-the-wild RGB images, focusing on architectural landmarks. We introduce two key innovations: (1) a scalable data annotation pipeline to automatically curate a large-scale dataset of architectural symmetries, ArchS

662

likely_noise

low

Extend3D: Town-scale 3D Generation

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy; generation_editing

weak or indirect keyword match

abstract

In this paper, we propose Extend3D, a novel training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces of object-centric models in representing wide scenes, we extend the latent space in $x$ and $y$ directions. Then, by dividing the extended latent into overlapping patches, we use the object-centric 3D generative model on each patch and couple them at each time step. Since object-centric models are sub-optimal for sub-scene generation, we use the input image and point cloud extracted from a depth estimator as priors to enable this process. Using the point cloud prior, we initialize the scene structure and refine the occluded region iteratively with under-noised SDEdit. Also, both priors are used to optimize the extended latent during the denoising process so that the denoisi

663

likely_noise

low

Fast Markov Random Field Optimisation for Topologically Noisy 3D Shape Matching

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

In many real world applications of non-rigid shape matching, the shapes are subject to topological noise (i.e. varying genus). In this paper, we propose a novel formulation based on Markov Random Fields (MRF) that can handle these cases with topological noise. The solutions to our optimisation problem can be approximated efficiently using the alpha expansion algorithm, which gives rise to theoretical approximation guarantees. In particular, we cast non-rigid 3D shape matching as a multi-labelling problem in which each triangle of the source shape is assigned a label that represents the matching to a specific surface element on the target shape. We propose a novel pairwise term that imposes that our matching prefers solutions in which neighbouring triangles on the source shape remain close on the target shape. Further, by exploiting the specific structure of our label space, we show that

664

likely_noise

low

From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy

weak or indirect keyword match

abstract

Shape matching is a fundamental task in computer graphics and vision, with deep functional map methods emerging as a preferred solution. However, existing approaches primarily focus on learning informative feature representations by constraining both pointwise and functional maps, while overlooking the optimization of a crucial component: the spectral basis, which plays a key role in the (deep) functional maps pipeline. This oversight leads to suboptimal matching performance. Furthermore, these approaches mostly rely on conventional functional map techniques, such as time-consuming functional map solvers, which incur substantial computational overhead. To address those, we introduce Advanced Functional Maps, which generalizes standard functional maps from fixed basis functions to learnable basis functions, supported by rigorous theoretical guarantees. In this framework, the spectral basi

665

likely_noise

low

Native and Compact Structured Latents for 3D Generation

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy; generation_editing; data_benchmark

keyword noise pattern without direct reconstruction signal

abstract

Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models compris

666

likely_noise

low

Pano360: Perspective to Panoramic Vision with Geometric Consistency

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Prior panorama stitching approaches heavily rely on pairwise feature correspondences and are unable to leverage geometric consistency across multiple views. This leads to severe distortion and misalignment, especially in challenging scenes with weak textures, large parallax, and repetitive patterns.Given that multi-view geometric correspondences can be directly constructed in 3D space, making them more accurate and globally consistent, we extend the 2D alignment task to the 3D photogrammetric space. We adopt a novel transformer-based architecture to achieve 3D awareness and aggregate global information across all views. It directly utilizes camera poses to guide image warping for global alignment in 3D space and employs a multi-feature joint optimization strategy to compute the seams.Additionally, to establish an evaluation benchmark and train our network, we collected a large-scale data

667

likely_noise

low

PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; data_benchmark

weak or indirect keyword match

abstract

6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and ge

668

likely_noise

low

Rethinking 2D-3D Registration: A Novel Network for High-Value Zone Selection and Representation Consistency Alignment

3D Vision & Geometry / Pose Estimation

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy

weak or indirect keyword match

abstract

Both detection-then-match and detection-free methods have been extensively studied for image-to-point cloud registration, yet they still face significant challenges. The detection-then-match approach emphasizes high-quality correspondences but is limited by the availability of repeatable keypoints, making it susceptible to errors from incorrect matches. In contrast, detection-free methods aim for dense correspondences using a coarse-to-fine strategy to mitigate matching errors. However, non-overlapping regions and low-quality matches still introduce inaccuracies, and the differences between image texture and point cloud structure cause inconsistent region representations, increasing the likelihood of incorrect matches.To address these challenges, we propose two innovative modules: the High-Value Zone Reinforced Selection Module (HZRS) and the Zone Representation Consistency Alignment Mod

669

likely_noise

low

SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; surface_occupancy; generation_editing; data_benchmark

weak or indirect keyword match

abstract

We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor

670

likely_noise

low

UniCorn: Unified Correspondence Transformer Across 2D and 3D

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy

weak or indirect keyword match

abstract

Visual correspondence across image-to-image (2D-2D), image-to-point cloud (2D-3D), and point cloud-to-point cloud (3D-3D) geometric matching forms the foundation for numerous 3D vision tasks. Despite sharing a similar problem structure, current methods use task-specific designs with separate models for each modality combination. We present UniCorn, the first correspondence model with shared weights that unifies geometric matching across all three tasks. Our key insight is that Transformer attention naturally captures cross-modal feature similarity. We propose a dual-stream decoder that maintains separate appearance and positional feature streams. This design enables end-to-end learning through stack-able layers while supporting flexible query-based correspondence estimation across heterogeneous modalities. Our architecture employs modality-specific backbones followed by shared encoder an

671

likely_noise

low

Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing; data_benchmark

weak or indirect keyword match

abstract

Realistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualize the city through a top-down planning strategy that defines a hierarchical “City–District–Grid” structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a pr

672

likely_noise

low

SonoWorld: From One Image to a 3D Audio-Visual Scene

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing; data_benchmark

weak or indirect keyword match

abstract

Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation.

673

likely_noise

low

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world. Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dat

674

likely_noise

low

Lafite : A Generative Latent Field for 3D Native Texturing

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing; data_benchmark

weak or indirect keyword match

abstract

Generating detailed and seamless textures for 3D meshes remains an open challenge. Recent image and video generation models, empowered by large-scale visual priors, are capable of producing highly detailed images and are thus promising for multi-view texture synthesis. However, evaluating texture quality involves multiple dimensions beyond visual fidelity. Multi-view back-projection often introduces seams and inconsistencies between different views or near occluded regions, while direct generation on UV-unwrapped maps suffers from UV distortions and ambiguities.Generating textures directly in 3D space offers an inherent advantage in ensuring continuity and spatial coherence, making it a critical and worthwhile research direction. Therefore, we systematically investigate 3D-native texture generation from the perspectives of representation and generation, and present current best practices

675

likely_noise

low

MatE: Material Extraction from Single-Image via Geometric Prior

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

The creation of high-fidelity, physically-based rendering (PBR) materials remains a bottleneck in many graphics pipelines, typically requiring specialized equipment and expert-driven post-processing. To democratize this process, we present MatE, a novel method for generating tileable PBR materials from a single image taken under unconstrained, real-world conditions. Given an image and a user-provided mask, MatE first performs coarse rectification using an estimated depth map as a geometric prior, and then employs a dual-branch diffusion model. Leveraging a learned consistency from rotation-aligned and scale-aligned training data, this model further rectify residual distortions from the coarse result and translate it into a complete set of material maps, including albedo, normal, roughness and height. Our framework achieves invariance to the unknown illumination and perspective of the inp

676

likely_noise

low

Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Establishing semantic correspondence without supervision is essential for handling diverse in-the-wild images where annotations are scarce.While recent 2D foundation models offer powerful features, adapting them for unsupervised learning via nearest-neighbor pseudo-labels has key limitations: it operates locally, ignoring structural relationships, and consequently its reliance on 2D appearance fails to resolve geometric ambiguities arising from symmetries or repetitive features.In this work, we address this by reformulating pseudo-label generation as a Fused Gromov-Wasserstein (FGW) problem, which jointly optimizes inter-feature similarity and intra-structural consistency. Our framework, Shape-of-You (SoY), leverages a 3D foundation model to define this intra-structure in the geometric space, resolving abovementioned ambiguity. However, since FGW is a computationally prohibitive quadrati

677

likely_noise

low

3D Instance Models are Implicit Generalizable Spatial Learners

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing; data_benchmark

weak or indirect keyword match

abstract

Generalization remains the central challenge for interactive 3D scene generation. Existing learning‑based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts.We instead reprogram a pre‑trained 3D instance generator to act as a scene‑level learner via, replacing dataset-bounded supervision with model-centric spatial supervision.This reprogramming unlocks the generator's transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions.Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator’s transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues.Replacing widely used canonical space, we instantiate this insight with a view‑centric formul

678

likely_noise

low

Velox: Learning Representations of 4D Geometry and Appearance

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; dynamic_4d; surface_occupancy

weak or indirect keyword match

abstract

We introduce a framework for learning latent representations of 4D objects which are descriptive, faithfully capturing object geometry and appearance; compressive, aiding in downstream efficiency; and accessible, requiring minimal input, i.e., an unstructured dynamic point cloud, to construct. Specifically, Velox trains an encoder to compress spatiotemporal color point clouds into a set of *dynamic shape tokens*. These tokens are supervised using two complementary decoders: a 4D surface decoder, which models the time-varying surface distribution capturing the geometry; and a Gaussian decoder, which maps the tokens to 3D Gaussians, helping learn appearance.To demonstrate the utility of our representation, we evaluate it across three downstream tasks—video-to-4D generation, 3D tracking, and cloth simulation via image-to-4D generation—and observe strong performances in all settings.

679

likely_noise

low

Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing; data_benchmark

weak or indirect keyword match

abstract

Editing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing open-vocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task.We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space.Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explici

680

likely_noise

low

HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing; data_benchmark

weak or indirect keyword match

abstract

The emergence of virtual reality has necessitated the generation of detailed and customizable 3D hand models for interaction in the virtual world. However, the current methods for 3D hand model generation are both expensive and cumbersome, offering very little customizability to the users. While recent advancements in zero-shot text-to-3D synthesis have enabled the generation of diverse and customizable 3D models using Score Distillation Sampling (SDS), they do not generalize very well to 3D hand model generation, resulting in unnatural hand structures, view-inconsistencies and loss of details. To address these limitations, we introduce HandDreamer, the first method for zero-shot 3D hand model generation from text prompts. Our findings suggest that view-inconsistencies in SDS is primarily caused due to the ambiguity in the probability landscape described by the text prompt, resulting in

681

likely_noise

low

MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

3D pose transfer aims to transfer the pose-style of a source mesh to a target character while preserving both the target's geometry and the source's pose characteristic. Existing methods are largely restricted to characters with similar structures and fail to generalize to category-free settings (e.g., transferring a humanoid's pose to a quadruped). The key challenge lies in the structural and transformation diversity inherent in distinct character types, which often leads to mismatched regions and poor transfer quality. To address these issues, we first construct a million-scale pose dataset across hundreds of distinct characters. We further propose MimiCAT, a cascade-transformer model designed for category-free 3D pose transfer. Instead of relying on strict one-to-one correspondence mappings, MimiCAT leverages semantic keypoint labels to learn a novel soft correspondence that enables f

682

likely_noise

low

Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; dynamic_4d; surface_occupancy

weak or indirect keyword match

abstract

Predicting scene dynamics from visual observations is challenging. Existing methods capture dynamics only within observed boundaries failing to extrapolate far beyond the training sequence. Node-RF (Neural ODE-based NeRF) overcomes this limitation by integrating Neural Ordinary Differential Equations (NODEs) with dynamic Neural Radiance Fields (NeRFs), enabling a continuous-time, spatiotemporal representation that generalizes beyond observed trajectories at constant memory cost. From visual input, Node-RF learns an implicit scene state that evolves over time via an ODE solver, propagating feature embeddings via differential calculus. A NeRF-based renderer interprets calculated embeddings to synthesize arbitrary views for long-range extrapolation. Training on multiple motion sequences with shared dynamics allows for generalization to unseen conditions. Our experiments demonstrate that Nod

683

likely_noise

low

Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing; data_benchmark

weak or indirect keyword match

abstract

Constructing computer-aided design (CAD) models is labor-intensive but essential for engineering and manufacturing. Recent advances in Large Language Models (LLMs) have inspired the LLM-based CAD generation by representing CAD as command sequences. But these methods struggle in practical scenarios because command sequence representation does not support entity selection (e.g. faces or edges), limiting its ability to support complex editing operations such as chamfer or fillet. Further, the discretization of a continuous variable during sketch and extrude operations may result in topological errors. To address these limitations, we present Pointer-CAD, a novel LLM-based CAD generation framework that leverages a pointer-based command sequence representation to explicitly incorporate the geometric information of B-rep models into sequential modeling. In particular, Pointer-CAD decomposes CA

684

likely_noise

low

MHopReg: Efficient Hierarchical Multi-Hop Graph Search for Point Cloud Registration

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Outlier rejection for correspondence-based point cloud registration confronts two fundamental challenges in real-world scenarios. First, low-overlap regions yield sparse and fragmented inlier distributions that are difficult to discover using conventional one-step global search strategies. Second, large-scale scenes present dense correspondence inputs that impose stringent requirements on the accuracy-efficiency trade-off of search algorithms. To this end, we propose a hierarchical multi-hop graph search framework that progressively refines correspondences to address these challenges. Our method constructs a compatibility graph with transformation-invariant embeddings to predict correspondence confidence, establishing the foundation for cluster-balanced seed sampling that ensures comprehensive coverage across fragmented regions. These strategically selected seeds subsequently drive hiera

685

likely_noise

low

Universal 3D Shape Matching via Coarse-to-Fine Language Guidance

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy

weak or indirect keyword match

abstract

Establishing dense correspondences between shapes is a crucial task in computer vision and graphics, while prior approaches depend on near-isometric assumptions and homogeneous subject types (i.e., only operate for human shapes). However, building semantic correspondences for cross-category objects remains challenging and has received relatively little attention. To achieve this, we propose UniMatch, a semantic-aware, coarse-to-fine framework for constructing dense semantic correspondences between strongly non-isometric shapes without restricting object categories. The key insight is to lift "coarse" semantic cues into "fine" correspondence, which is achieved through two stages. In the "coarse" stage, we perform class-agnostic 3D segmentation to obtain non-overlapping semantic parts and prompt multimodal large language models (MLLMs) to identify part names. Then, we employ pretrained vis

686

likely_noise

low

Homaloidal parametrization for detecting critical two-view configurations

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

We consider the problem of identifying degenerate configurations while estimating the fundamental matrix from (at least) 8 point correspondences. It is known that such configurations correspond to an ill-posed estimation of the fundamental matrix, so it is important to identify them in practice. So far, a practical degeneracy test is only available for the cases of planar scenes and pure rotation, while the case of the general critical surface (e.g., a hyperboloid/cone/cylinder containing 3D points and camera centres) is less studied, and the only available method is highly unstable, involving a pre-computed fundamental matrix. In this paper, we propose a novel degeneracy test for detecting points on the critical surface. By exploiting the geometry of the so-called ``homaloidal net of conics'', we are able to design a simple and very practical test that requires the linear estimation o

687

likely_noise

low

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing

weak or indirect keyword match

abstract

Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization,

688

likely_noise

low

Beyond Reassembly: Fractured Object Recovery with Missing Parts

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

We propose a novel learning-based task named fractured object recovery. Unlike previous fractured object reassembly task that only targets aligning existing parts with overlaps, our task aims to not only reassemble irrelevant parts but also predict missing parts, resulting in a complete shape recovery immediately. Our task coincides with practical experiences, where the prior knowledge of similar shapes can be leverage in the reassembly process, such that even non-overlapping parts can be reasoned into adequate locations. We also present the first learning model for the proposed task by correlating features of both existing and missing parts using a transformer, where the latter is naturally represented as missing tokens. Hence, our model can jointly estimate the poses of the existing parts and predict the shapes of the missing parts. To facilitate the task, we introduce a new dataset ba

689

likely_noise

low

CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing

weak or indirect keyword match

abstract

Controllable, high-fidelity mesh editing remains a significant challenge in the domain of 3D content creation. Existing generative methods often struggle with complex geometries and fail to preserve fine-scale details. We propose CraftMesh, a novel framework for high-fidelity generative mesh manipulation based on Poisson Seamless Fusion. Our key insight is to decompose mesh editing into a pipeline that leverages the strengths of 2D image editing and 3D generative modeling: we first edit a 2D reference image, then generate a 3D mesh corresponding to the edited region, and fuse it seamlessly into the original mesh through a Joint Geometry and Appearance Fusion framework built on a hybrid SDF/Mesh representation to enable Poisson Geometry Blending and Poisson Texture Harmonization. Experimental results demonstrate that CraftMesh outperforms state-of-the-art methods, delivering improved stru

690

likely_noise

low

Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we pre

691

likely_noise

low

Parallelised Differentiable Straightest Geodesics for 3D Meshes

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy

weak or indirect keyword match

abstract

Machine learning has been progressively generalised to operate within non-Euclidean domains, but geometrically accurate methods for learning on surfaces are still falling behind. The lack of closed-form Riemannian operators, the non-differentiability of their discrete counterparts, and poor parallelisation capabilities have been the main obstacles to the development of the field on meshes. A principled framework to compute the exponential map on Riemannian surfaces discretised as meshes is straightest geodesics, which also allows to trace geodesics and parallel-transport vectors as a by-product. We provide a parallel GPU implementation and derive two different methods for differentiating through the straightest geodesics, one leveraging an extrinsic proxy function and one based upon a geodesic finite differences scheme. After proving our parallelisation performance and accuracy, we demon

692

likely_noise

low

PhysGen: Physically Grounded 3D Shape Generation for Industrial Design

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy

weak or indirect keyword match

abstract

Existing generative models for 3D shapes can synthesize high-fidelity and visually plausible shapes. For certain classes of shapes that have undergone an engineering design process, the realism of the shape is tightly coupled with the underlying physical properties, e.g., aerodynamic efficiency for automobiles. Since existing methods lack knowledge of such physics, they are unable to use this knowledge to enhance the realism of shape generation. Motivated by this, we propose a unified physics-based 3D shape generation pipeline, with a focus on industrial design applications. Specifically, we introduce a new flow matching model with explicit physical guidance, consisting of an alternating update process. We iteratively perform a velocity-based update and a physics-based refinement, progressively adjusting the latent code to align with the desired 3D shapes and physical properties. We furt

693

likely_noise

low

XPaintNet: An eXtreme Lightweight Framework for Stereoscopic Conversion without Inpainting Network

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; depth_correspondence; surface_occupancy

weak or indirect keyword match

abstract

With the rapid growth of stereoscopic 3D devices, real-time stereoscopic conversion has become increasingly essential. However, most existing approach rely on depth estimation, forward warping, and heavy inpainting network, resulting in high computational cost and artifacts near occlusion boundaries. Diffusion-based models have also been explored, but they suffer from iterative sampling and geometric inconsistency, making them unsuitable for real-time deployment. To address these issues, we propose Bi-Warp, a simple yet effective approach that synthesizes the right view without inpainting network by leveraging warping operations. Our approach estimates backward flow, approximates the corresponding forward flow, and generates two candidate right views via bidirectional warping. A learnable mask adaptively fuses the candidates, preserving left–right geometric consistency. Building on Bi-Wa

694

likely_noise

low

FILTR: Extracting Topological Features from Pretrained 3D Models

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Recent advances in pretraining 3D point cloud encoders (e.g., Point-BERT, Point-MAE) have produced powerful models, whose abilities are typically evaluated on geometric or semantic tasks. At the same time, topological descriptors have been shown to provide informative summaries of a shape's multiscale structure. In this paper we pose the question whether topological information can be derived from features produced by 3D encoders. To address this question, we first introduce DONUT, a synthetic benchmark with controlled topological complexity, and propose FILTR (Filtration Transformer), a learnable framework to predict persistence diagrams directly from frozen encoders. FILTR adapts a transformer decoder to treat diagram generation as a set prediction task. Our analysis on DONUT reveals that existing encoders retain only limited global topological signals, yet FILTR successfully leverages

695

likely_noise

low

FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Autoregressive models can generate high-quality 3D meshes by sequentially producing vertices and faces, but their token-by-token decoding results in slow inference, limiting practical use in interactive and large-scale applications.We present FlashMesh, a fast and high-fidelity mesh generation framework that rethinks autoregressive decoding through a predict-correct-verify paradigm. The key insight is that mesh tokens exhibit strong structural and geometric correlations that enable confident multi-token speculation. FlashMesh leverages this by introducing a speculative decoding scheme tailored to the commonly used hourglass transformer architecture, enabling parallel prediction across face, point, and coordinate levels.Extensive experiments show that FlashMesh achieves up to a 2$\times$ speedup over standard autoregressive models while also improving generation fidelity. Our results demo

696

likely_noise

low

FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and fl

697

likely_noise

low

LATTICE: Democratize High-Fidelity 3D Generation at Scale

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing

keyword noise pattern without direct reconstruction signal

abstract

We present LATTICE, a new framework for high-fidelity 3D asset generation that bridges the quality and scalability gap between 3D and 2D generative models. While 2D image synthesis benefits from fixed spatial grids and well-established transformer architectures, 3D generation remains fundamentally more challenging due to the need to predict both spatial structure and detailed geometric surfaces from scratch. These challenges are exacerbated by the computational complexity of existing 3D representations and the lack of structured and scalable 3D asset encoding schemes. To address this, we propose VoxSet, a semi-structured representation that compresses 3D assets into a compact set of latent vectors anchored to a coarse voxel grid, enabling efficient and position-aware generation. VoxSet retains the simplicity and compression advantages of prior VecSet methods while introducing explicit st

698

likely_noise

low

POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Face relighting aims to synthesize realistic portraits under novel illumination while preserving identity and geometry. However, progress remains constrained by the limited availability of large-scale, physically consistent illumination data. To address this, we introduce POLAR, a large-scale and physically calibrated One-Light-at-a-Time (OLAT) dataset containing over 200 subjects captured under 156 lighting directions, multiple views, and diverse expressions. Building upon POLAR, we develop a flow-based generative model POLARNet that predicts per-light OLAT responses from a single portrait, capturing fine-grained and direction-aware illumination effects while preserving facial identity. Unlike diffusion or background-conditioned methods that rely on statistical or contextual cues, our formulation models illumination as a continuous, physically interpretable transformation between lighti

699

likely_noise

low

Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing

weak or indirect keyword match

abstract

We often aim to generate images that are both photorealistic and 3D-consistent, adhering to precise geometry, material, and viewpoint controls.Typically, this is achieved by fine-tuning an image generator, pre-trained on billions of real images, using renders of synthetic 3D assets, where annotations for control signals are available.While this approach can learn the desired controls, it often compromises the realism of the images due to domain gap between photographs and renders. We observe that this issue largely arises from the model learning an unintended association between the presence of control signals and the synthetic appearance of the images.To address this, we introduce Realiz3D, a lightweight framework that decouples controls and visual domain.The key idea is to explicitly learn visual domain, real or synthetic, separately from other control signals by introducing a co-varia

700

likely_noise

low

Think-Then-Generate: Structural Chain-of-Thought Reasoning for Consistent 3D Generation

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing

weak or indirect keyword match

abstract

Recently, generating 3D assets using visual priors from pretrained diffusion models has shown remarkable results. However, due to the inherent lack of 3D geometric priors in 2D diffusion, the synthesized results often suffer from spatial hallucination and multi-view inconsistency. To address this limitation, we propose Thoughtful3D, a novel framework that enhances 3D content generation quality by introducing structural chain-of-thought (CoT) reasoning to alleviate inconsistent issues and mitigate hallucinations. Specifically, we design a dual-phase structural CoT strategy: (1) 3DBlueprint-CoT explicitly plans the 3D generation process through textual semantic parsing and logical deduction during the initialization phase. (2) 3DRefine-CoT dynamically evaluates latent inconsistencies by analyzing multiple renderings, employing a multi-round iterative refinement mechanism to suppress halluc

701

likely_noise

low

UniPart: Part-Level 3D Generation with Unified 3D Geom–Seg Latents

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing

weak or indirect keyword match

abstract

Part-level 3D generation is essential for applications requiring decomposable and structured 3D synthesis. However, existing methods either rely on implicit part segmentation with limited granularity control or depend on strong external segmenters trained on large annotated datasets. In this work, we observe that part awareness emerges naturally during whole-object geometry learning and propose Geom-Seg VecSet, a unified geometry–segmentation latent representation that jointly encodes object geometry and part-level structure. Building on this representation, we introduce UniPart, a two-stage latent diffusion framework for image-guided part-level 3D generation. The first stage performs joint geometry generation and latent part segmentation, while the second stage conditions part-level diffusion on both whole-object and part-specific latents. A dual-space generation scheme further enhances

702

likely_noise

low

VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Human mesh recovery (HMR) from a single RGB image is inherently ambiguous, as multiple 3D poses can correspond to the same 2D observation. Recent probabilistic and diffusion-based methods tackle this ambiguity by generating various hypotheses, but often sacrifice accuracy. They yield predictions that are either physically implausible or drift from the input image, especially under occlusion or in cluttered, in-the-wild scenes. To address this issue, we introduce a dual-memory augmented HMR critique agent with self-reflection to produce context-aware quality scores for predicted meshes. These scores distill fine-grained cues about 3D human motion structure, physical feasibility, and alignment with the input image. We use these scores to build a group-wise HMR preference dataset. Building upon this dataset, we propose a group preference alignment framework for finetuning diffusion-based HM

703

likely_noise

low

Learning Hierarchical Hyperbolic Mixture Model for Part-aware 3D Generation

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing

weak or indirect keyword match

abstract

3D shape generation has become increasingly important for graphics and vision applications. Current part-aware 3D generation usually overlooks hierarchical part relations or inefficiently encodes multi-level semantics in Euclidean space. Thus we propose a novel framework for hierarchical and efficient part-aware 3D generation in hyperbolic space. Our contributions are three-fold: (1) Hierarchical Hyperbolic Mixture Model (H$^2$MM): We propose part-aware semantic representation of objects within a hyperbolic manifold, providing a high-fidelity hierarchical part-aware representation of object details and semantics. (2) Hyperbolic Semantically Consistent Diffusion Model: We design the geodesic diffusion process that preserves the hierarchical and semantic structure of H$^{2}$MM, and progressively generates semantics from conditions and generates object under their joint guidance. We use an

704

likely_noise

low

Nestwork: Conditional 3D Furnished House Layout Generation through Latent Heterogeneous Graph Diffusion

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

This paper introduces Nestwork, a unified latent-diffusion framework for conditional 3D furnished house layout generation using a heterogeneous graph of rooms and furniture. Designing reasonable and controllable 3D layouts that reflect the underlying semantic structure of a house is a key challenge in AI-assisted architectural design. Existing graph-based methods either produce unfurnished multi-room layouts or generate furnished scenes one room at a time, preventing joint reasoning over room structure and furniture placement. Nestwork represents an entire house as a heterogeneous graph with typed room and furniture nodes and multiple spatial relations. A single unconditional autoencoder based on a heterogeneous graph attention network embeds this graph into a compact latent space, and a low-rank relational field compensates for missing geometric edge information at test time. A diffusio

705

likely_noise

low

SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies hindering the model’s ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate–based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-c

706

likely_noise

low

Thermal is Always Wild: Characterizing and Addressing Challenges in Thermal-Only Novel View Synthesis

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; depth_correspondence; data_benchmark

weak or indirect keyword match

abstract

Thermal cameras provide reliable visibility in darkness and adverse conditions, but thermal imagery remains significantly harder to use for novel view synthesis (NVS) than visible-light images. This difficulty stems primarily from two characteristics of affordable thermal sensors. First, thermal images have extremely low dynamic range, which weakens appearance cues and limits the gradients available for optimization. Second, thermal data exhibit rapid frame-to-frame photometric fluctuations together with slow radiometric drift, both of which destabilize correspondence estimation and create high-frequency floater artifacts during view synthesis, particularly when no RGB guidance is available. Guided by these observations, we introduce a lightweight preprocessing and splatting pipeline that expands usable dynamic range and stabilizes per-frame photometry. Our approach achieves state-of-the

707

likely_noise

low

Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; generation_editing

weak or indirect keyword match

abstract

Architectural floor plan design demands joint reasoning over geometry, semantics, and spatial hierarchy, which remains a major challenge for current AI systems. Although recent diffusion and language models improve visual fidelity, they still struggle with coherent spatial reasoning and controllable generation. We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. We introduce discrete room-instance tokens to construct a unified vocabulary that bridges layouts and symbolic reasoning. With multimodal alignment and instruction tuning, the model synthesizes coherent, controllable layouts from text instructions. Experiments show how the framework achieves superior geometric validity and controllability while remaining efficient and locally deployable.

708

likely_noise

low

Towards Intrinsic-Aware Monocular 3D Object Detection

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Monocular 3D object detection (Mono3D) aims to infer object locations and dimensions in 3D space from a single RGB image.Despite recent progress, existing methods remain highly sensitive to camera intrinsics and struggle to generalize across diverse settings, since intrinsic changes reshape how 3D scenes are projected onto the image plane.We propose MonoIA, a unified intrinsic-aware framework that models and adapts to intrinsic variation through a language-grounded representation.The key insight is that intrinsic variation is not a numeric difference but a perceptual transformation that alters apparent scale, perspective, and spatial geometry.To capture this effect, MonoIA employs large language models and vision–language models to generate intrinsic embeddings that encode the visual and geometric implications of camera parameters.These embeddings are hierarchically integrated into the d

709

likely_noise

low

VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; dynamic_4d

weak or indirect keyword match

abstract

The rapid advances in deep learning have significantly enhanced the accuracy of multimodal 3D human pose estimation (HPE). However, the state-of-the-art (SOTA) HPE pipelines still rely on Transformers, whose quadratic complexity makes real-time processing for long sequences impractical. Mamba addresses this issue through selective state-space modeling, enabling efficient sequence processing without sacrificing representational power. Nevertheless, it struggles to capture complex spatial dependencies in multimodal settings. To bridge this gap, we propose VIMCAN, a hybrid architecture that combines the efficient sequence modeling of Mamba with the spatial reasoning of Cross-Attention, and performs robust visual–inertial fusion and human pose estimation between RGB keypoints and wearable IMU data. By leveraging Mamba’s dynamic parameterization for temporal modeling and Attention for spatial

710

likely_noise

low

Synthetic Knowledge-Guided Learning via Target-Region Gradients

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; dynamic_4d; generation_editing

keyword noise pattern without direct reconstruction signal

abstract

Training with synthetic data has become a standard strategy for improving robustness to distribution shifts. However, most existing approaches exploit synthetic samples only indirectly---for example, by enriching backgrounds, contexts, or negative examples---while providing no explicit signal about where the true target content resides.As a result, models can continue to rely on spurious correlations, which ultimately limit their robustness. In this work, we convert a basic but under-utilized provenance of synthetic data into explicit supervision: during synthesis, we know which pixels or elements originate from which source instances. We formalize this provenance as synthetic knowledge and propose a Synthetic Knowledge-Guided (SKG) training framework that uses it to shape gradients toward target regions and away from irrelevant ones via a Gradient Guide Loss. Our framework is generic an

711

likely_noise

low

ExPose: Reinforcing Video Generation Models for Extreme Pose Estimation

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; dynamic_4d; generation_editing

weak or indirect keyword match

abstract

Pose estimation remains challenging under sparse views, especially when visual overlap across images is extremely limited. Recent advances in video generation models offer a promising solution by enabling keyframe interpolation, which can enrich contextual cues and improve pose estimation performance. However, existing video generation models often lack 3D consistency, producing temporally plausible but spatially inconsistent frames that degrade downstream pose estimation. In this paper, we propose a framework ExPose that directly addresses 3D inconsistency when applying video generation to pose estimation in extreme-view settings. Specifically, we fine-tune a video generation model using Group Relative Preference Optimization (GRPO), aligning its outputs with 3D-consistent supervisory signals derived from pose estimation objectives. Our approach not only enhances the quality of temporal

712

likely_noise

low

Globally Optimal Pose from Silhouettes

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; surface_occupancy

weak or indirect keyword match

abstract

We solve the problem of determining the pose of known shapes in $\mathbb{R}^3$ from their unoccluded silhouettes. The pose is determined up to global optimality using a simple yet under-explored property of the area-of-silhouette: its continuity w.r.t trajectories in the rotation space. The proposed method utilises pre-computed silhouette-signatures, modelled as a response surface of the area-of-silhouettes. Querying this silhouette-signature response surface for pose estimation leads to a strong branching of the rotation search space, making resolution-guided candidate search feasible. Additionally, we utilise the aspect ratio of 2D ellipses fitted to projected silhouettes as an auxiliary global shape signature to accelerate the pose search. This combined strategy forms the first method to efficiently estimate globally optimal pose from just the silhouettes, without being guided by corr

713

likely_noise

low

MGDHand: Multi-Granularity Prior-to-Inertial Distillation Framework for Sequential 3D Hand Pose Estimation from Sparse IMUs

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; dynamic_4d; data_benchmark

weak or indirect keyword match

abstract

3D hand pose estimation (HPE) from sparse inertial measurement units (IMUs) has shown great potential in human-computer interaction. However, due to the significant semantic gap between sparse local motion information and structured global pose information, estimating the hand poses from sparse IMU signals is ambiguous and challenging. Knowledge distillation can transfer rich knowledge from the stronger teacher to the student, so that the student enhances performance. Existing approaches distill morphological priors into the IMU-based student model, effectively improving its accuracy in complex scenarios. Nevertheless, overlooking the visual-inertial inherent semantic mismatch and information density difference leads to difficulties for students to learn coupled priors. In this paper, we propose a \textbf{M}ulti-\textbf{G}ranularity Prior-to-Inertial \textbf{D}istillation Framework for S

714

likely_noise

low

Δynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; data_benchmark

weak or indirect keyword match

abstract

Inferring rigid-body physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, which are unable to generalize to complex real-world settings. We introduce Δynamics, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, Δynamics generates scene configurations in a structured text format for physics simulation. We enhance the model's generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, Δynamics achieves a segmentation IoU of $0.30$, a $7\times$ improvement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Further, test-time sampling a

715

likely_noise

low

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; data_benchmark

weak or indirect keyword match

abstract

Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities.In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S)

716

likely_noise

low

Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; data_benchmark

weak or indirect keyword match

abstract

Multimodal image registration is a fundamental task for multimodal imagery and a prerequisite for downstream cross-modal analysis. Despite recent progress with shared feature extraction and multi-scale architectures, two key limitations remain. First, some methods use disentanglement to learn shared features but mainly regularize the shared part, so modality-private cues can still leak into the shared space. Second, most multi-scale frameworks support only one transformation type, which limits their applicability in real-world scenarios where global misalignment and local deformation coexist.To address these issues, we view hybrid multimodal registration as jointly constructing a stable shared feature space and a unified hybrid transformation within that space. Building on this perspective, we introduce HRNet, a Hybrid Registration Network that couples representation disentanglement with

717

likely_noise

low

UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence; data_benchmark

weak or indirect keyword match

abstract

Low-visibility scenarios, such as low-light conditions, pose significant challenges to human pose estimation due to the scarcity of annotated low-light datasets and the loss of visual information under poor illumination. Recent domain adaptation techniques attempt to utilize well-lit labels by augmenting well-lit images to mimic low-light conditions.But handcrafted augmentations oversimplify noise patterns, while learning-based methods often fail to preserve high-frequency low-light characteristics, producing unrealistic images that lead pose models to generalize poorly to real low-light scenes.Moreover, recent pose estimators rely on image cues through image-to-keypoint cross-attention, but these cues become unreliable under low-light conditions.To address these issues, we propose Unsupervised Domain Adaptation for Pose Estimation (UDAPose), a novel framework that synthesizes realistic

718

likely_noise

low

S$^2$AM3D: Scale-controllable Part Segmentation of 3D Point Clouds

3D Vision & Geometry / Point Cloud

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Part-level point cloud segmentation has recently attracted significant attention in 3D computer vision.Nevertheless, existing research is constrained by two major challenges: native 3D models lack generalization due to data scarcity, while introducing 2D pre-trained knowledge often leads to inconsistent segmentation results across different views.To address these challenges, we propose S$^2$AM3D, which incorporates 2D segmentation priors with 3D consistent supervision. We design a point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, producing globally consistent point features. A scale-aware prompt decoder is then proposed to enable real-time adjustment of segmentation granularity via continuous scale signals. Simultaneously, we introduce a large-scale, high-quality part-level point cloud dataset with more than 100k samples, providi

719

likely_noise

low

Generalized-CVO: Fast and Correspondence-Free Point Cloud Registration in RKHS with Second Order Riemannian Optimization

3D Vision & Geometry / Point Cloud

C. cluster representative

pose_calibration_localization; depth_correspondence; surface_occupancy

weak or indirect keyword match

abstract

We propose a fast and correspondence-free point cloud registration method that leverages local geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The proposed method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order methods used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR registration task in the driving domain, we achieve a reduction of $>55\%$ in both translati

720

likely_noise

low

VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement

3D Vision & Geometry / Point Cloud

C. cluster representative

general_reconstruction; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

We propose VIAFormer, a \textbf{V}oxel-\textbf{I}mage \textbf{A}lignment Trans-\textbf{former} model designed for Multi-view Conditioned Voxel Refinement—the task of repairing incomplete noisy voxels using calibrated multi-view images as guidance. Its effectiveness stems from a synergistic design: an Image Index that provides explicit 3D spatial grounding for 2D image tokens, a Correctional Flow objective that learns a direct voxel-refinement trajectory, and a Hybrid Stream Transformer that enables robust cross-modal fusion. Experiments show that VIAFormer establishes a new state of the art in correcting both severe synthetic corruptions and realistic artifacts on the voxel shape obtained from powerful Vision Foundation Models. Beyond benchmarking, we demonstrate VIAFormer as a practical and reliable bridge in real-world 3D creation pipelines, paving the way for voxel-based methods to th

721

likely_noise

low

Voxify3D: Pixel Art Meets Volumetric Rendering

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

weak or indirect keyword match

abstract

Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with co

722

likely_noise

low

Image-Guided Geometric Stylization of 3D Meshes

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

weak or indirect keyword match

abstract

Recent generative models can create visually plausible 3D representations of objects. However, the generation process often allows for implicit control signals, such as contextual descriptions, and rarely supports bold geometric distortions beyond existing data distributions. We propose a geometric stylization framework that deforms a 3D mesh, allowing it to express the style of an image. While style is inherently ambiguous, we utilize pre-trained diffusion models to extract an abstract representation of the provided image. Our coarse-to-fine stylization pipeline can drastically deform the input 3D model to express a diverse range of geometric variations while retaining the valid topology of the original mesh and part-level semantics. We also propose an approximate VAE encoder that provides efficient and reliable gradients from mesh renderings. Extensive experiments demonstrate that our

723

likely_noise

low

OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

weak or indirect keyword match

abstract

We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates

724

likely_noise

low

Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

weak or indirect keyword match

abstract

Implicit Neural Representations (INRs) have emerged as a powerful paradigm for various signal processing tasks, but their inherent spectral bias limits the ability to capture high-frequency details. Existing methods partially mitigate this issue by using Fourier-based features, which usually rely on fixed frequency bases. This forces multi-layer perceptrons (MLPs) to inefficiently compose the required frequencies, thereby constraining their representational capacity. To address this limitation, we propose Content-Aware Frequency Encoding (CAFE), which builds upon Fourier features through multiple parallel linear layers combined via a Hadamard product. CAFE can explicitly and efficiently synthesize a broader range of frequency bases, while the learned weights enable the selection of task-relevant frequencies. Furthermore, we extend this framework to CAFE+, which incorporates Chebyshev fea

725

likely_noise

low

FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

keyword noise pattern without direct reconstruction signal

abstract

Learning directly from boundary representations (B-reps) has significantly advanced 3D CAD analysis. However, state-of-the-art B-rep learning methods rely on absolute coordinates and normals to encode global context, making them highly sensitive to rotations. Our experiments reveal that models achieving over 95% accuracy on aligned benchmarks can collapse to as low as 10% under arbitrary SO(3) rotations. To address this, we introduce FoV-Net, the first B-rep learning framework that captures both local surface geometry and global structural context in a rotation-invariant manner. Each face is represented by a Local Reference Frame (LRF) UV-grid that encodes its local surface geometry, and by Field-of-View (FoV) grids that capture the surrounding 3D context by casting rays and recording intersections with neighboring faces. Lightweight CNNs extract per-face features, which are propagated o

726

likely_noise

low

HiFi-Brep: High-Fidelity B-Rep Latent Representation and Robust Generation

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

weak or indirect keyword match

abstract

Boundary representation (B-rep) generation is a fundamental task in Computer-Aided Design (CAD), enabling automated modeling of 3D geometries. However, the direct synthesis of valid and high-quality B-reps remains a major challenge.Existing deep generative methods suffer from brittle representation and generation paradigms, due to: (1) representation noise from padding variable-length sequences and feature contamination between distant primitives, and (2) fragile generation pipelines marked by cascaded decoding error propagation and a train-inference mismatch from deferred validity enforcement.To address this, we propose HiFi-Brep. Our core insight is that robust, high-validity generation requires: first, building upon a compact and high-fidelity latent representation; and second, reformulating validity constraints as differentiable inductive biases within a single-stage generation proce

727

likely_noise

low

HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

weak or indirect keyword match

abstract

In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. Achieving both photorealism and generalization is challenging due to the scarcity of diverse, high-quality 3D human data. To address this, we build a scalable data generation pipeline that follows two strategies. The first one is to leverage existing rigged assets and animate them with extensive poses from daily life. The second strategy is to utilize existing multi-camera captures of humans and employ fitting to generate more diverse views for training. These two strategies enable us to scale up to 100k assets, significantly enhancing both the quantity and the diversity of data for robust model training. In terms of the architecture, HumanNOVA adopts a feed-forward, token-conditioned avatar modeling framework that allows fast inference in less than

728

likely_noise

low

Mirror Illusion Art

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

weak or indirect keyword match

abstract

Mirror Illusion Art is a novel reflection-conditioned 3D illusion where one object yields two target appearances (front and mirror). The task is formulated as inverse design from two target 2D images (front and mirror) to a printable 3D object with geometry and texture. Prior topology-driven and shadow-based approaches demand substantial manual effort, optimize shape only, and often yield non-smooth or incomplete geometry. To address these challenges, we propose AutoMIA, an automated Mirror Illusion Art design pipeline that jointly optimizes shape and color. To stabilize optimization and suppress artifacts, four mechanisms are introduced: (1) projection-alignment component (PAC) selection to reduce surface noise, (2) position-weighted adaptive (PWA) suppression for background noise, (3) internal voxel preservation (IVP) to prevent internal fractures, and (4) shape-color decoupled (SCD)

729

likely_noise

low

PaNDaS: Learnable Shape Interpolation Modeling with Localized Control

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

weak or indirect keyword match

abstract

We present PaNDaS, a novel deep learning framework for Partial Non-Rigid Deformations and interpolations of Surfaces (PaNDaS). PaNDaS learns a per-face feature field on the source mesh and fuses it with a global encoding of the target. A deformation generator predicts a Jacobian field and recovers a smooth displacement, enabling precise regional control, pose mixing, and transferable local edits. Unlike previous approaches, our method can restrict the deformations to specific parts of the shape in a versatile way. Across various human body part datasets, PaNDaS achieves state-of-the-art interpolation accuracy and stronger locality than methods based on global shape codes or handles, while remaining robust to remeshing. We demonstrate several localized shape manipulation tasks and show that our method can generate new shapes by combining different input deformations.

730

likely_noise

low

ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; surface_occupancy

weak or indirect keyword match

abstract

In practical real-time XR and telepresence applications, network and computing resources fluctuate frequently. Therefore, a progressive, streamable 3D representation method is needed that can be immediately deployed and continuously optimized as resources increase. To this end, we propose ProgressiveAvatars, a progressive avatar representation built on a hierarchy of 3D Gaussians grown byadaptive implicit subdivision on a template mesh. 3D Gaussians are defined in face‑local coordinates to remain animatable under varying expressions and head motion across multiple detail levels. The hierarchy expands when screen-space signals indicate a lack of detail, allocating resources to important areas. ProgressiveAvatars supports incremental loading rendering, adding new Gaussians as they arrive while preserving previous content, thus achieving smooth quality improvements across varying bandwidths

731

likely_noise

low

Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision–Language Understanding

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

weak or indirect keyword match

abstract

Recent advances in 3D vision–language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning.However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec).Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next,

732

likely_noise

low

SuP: Sub-cloud Driven Point Cloud Registration

3D Vision & Geometry / Point Cloud

C. cluster representative

pose_calibration_localization; depth_correspondence; surface_occupancy

weak or indirect keyword match

abstract

While existing point-cloud-registration methods can well handle high-overlap scenarios of two point clouds, they often struggle with low-overlap scenarios, due to inevitable geometric/semantic ambiguities in the non-overlapping regions. In this paper, we introduce SuP, a novel framework that reformulates low-overlap registration as a high-overlap sub-cloud pairs (anchor pairs) mining problem. Central to SuP is our Dual-phase Sub-cloud Anchor Mining (DSAM) module, which first subdivides the source and target point clouds into multiple sub-clouds, followed by introducing a dual-phase weighting pipeline: 1) an efficient overlap-guided prior-weighting scheme (OPS) that leverages feature salience to identify candidate anchor pairs, and 2) a multi-scale post-weighting network (MPN) that exploits neighborhood feature consensus to further identify anchor pairs. Subsequently, final corresponde

733

likely_noise

low

UZ3DVG: Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

weak or indirect keyword match

abstract

Zero-Shot 3D Visual Grounding (Zero-Shot 3DVG) aims to localize target objects in 3D scenes from natural language descriptions without relying on instance-wise description annotations. Existing methods rely on extra 2D images during inference and/or require multi-turn interactions with large language models (LLMs) or vision-language models (VLMs), which increase latency, computational cost, and deployment complexity. To overcome these limitations, we propose Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions (UZ3DVG), which is fed with 3D point clouds and textual descriptions only during inference and does not depend on external models. This is a new training paradigm: a VLM is employed solely to produce object-wise descriptions (pseudo labels) and reasoning chains for training a lightweight 3DVG model with robust spatial reasoning. Specifically, the introduced Open

734

likely_noise

low

AutoRegressive Generation with B-rep Holistic Token Sequence Representation

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

weak or indirect keyword match

abstract

Previous representation and generation approaches for the B-rep relied on graph-based representations that disentangle geometric and topological features through decoupled computational pipelines, thereby precluding the application of sequence-based generative frameworks, such as transformer architectures that have demonstrated remarkable performance. In this paper, we propose BrepARG, the first attempt to encode B-rep's geometry and topology into a holistic token sequence representation, enabling sequence-based B-rep generation with an autoregressive architecture. Specifically, BrepARG encodes B-rep into 3 types of tokens: geometry and position tokens representing geometric features, and face index tokens representing topology. Then the holistic token sequence is constructed hierarchically, starting with constructing the geometry blocks (i.e., faces and edges) using the above tokens, fo

735

likely_noise

low

CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

weak or indirect keyword match

abstract

Despite major advances brought by diffusion-based models, current 3D texture generation systems remain hindered by cross-view inconsistency -- textures that appear convincing from one viewpoint often fail to align across others. We find that this issue arises from attention ambiguity, where unstructured full attention is applied indiscriminately across tokens and modalities, causing geometric confusion and unstable appearance-structure coupling.To address this, we introduce CaliTex, a framework of geometry-calibrated attention that explicitly aligns attention with 3D structure.It introduces two modules: Part-Aligned Attention that enforces spatial alignment across semantically matched parts, and Condition-Routed Attention which routes appearance information through geometry-conditioned pathways to maintain spatial fidelity.Coupled with a two-stage diffusion transformer, CaliTex makes geo

736

likely_noise

low

Fast SceneScript: Accurate and Efficient Structured Language Model via Multi-Token Prediction

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

weak or indirect keyword match

abstract

Recent perception-generalist approaches based on language models have achieved state-of-the-art results across diverse tasks, including 3D scene layout estimation, via unified architecture and interface. However, these approaches rely on autoregressive next-token prediction, which is inherently slow. In this work, we introduce Fast SceneScript, a novel structured language model for accurate and efficient 3D scene layout estimation. Our method employs multi-token prediction (MTP) to reduce the number of autoregressive iterations and significantly accelerate inference. While MTP improves speed, unreliable token predictions can significantly reduce accuracy. To filter out unreliable tokens, we adapt self-speculative decoding (SSD) for structural language models and introduce confidence-guided decoding (CGD) with an improved scoring mechanism for token reliability.Furthermore, we design a pa

737

likely_noise

low

GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance; depth_correspondence

weak or indirect keyword match

abstract

Recent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-data transitions, which obscure deterministic structures and yield inconsistent view predictions.We propose a Data-to-Data Flow Matching framework that learns deterministic transformations directly between paired views, enhancing view-consistent synthesis through explicit data coupling.To further enhance geometric coherence, we introduce Probability Density Geodesic Flow Matching (PDG-FM), which constrains flow trajectories using geodesic interpolants derived from probability density metrics of pretrained diffusion models. Such alignment with high-density regions of the data manifold promotes more realistic interpolants between samples.Empirically, our method surpasses diffusion-based

738

likely_noise

low

LAM: Language Articulated Object Modelers

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

weak or indirect keyword match

abstract

We introduce LAM, a system that explores the collaboration of large-language mod-els and vision-language models to generate articulated objects from text prompts.Our approach differs from previous methods that either rely on input visual structure(e.g., an image) or assemble articulated models from pre-built assets. In contrast,we formulate articulated object generation as a unified code generation task, wheregeometry and articulations can be co-designed from scratch. Given an input text,LAM coordinates a team of specialized modules to generate code to represent thedesired articulated object procedurally. The LAM first reasons about the hierarchi-cal structure of parts (links) with Link Designer, then writes code, compiles it, anddebugs it with Geometry & Articulation Coders and self-corrects with Geometry& Articulation Checkers. The code serves as a structured and interpretable bridgebe

739

likely_noise

low

PP-Brep: Few-Shot B-rep Classification with Hybrid Graph Representation

3D Vision & Geometry / 3D Reconstruction

C. cluster representative

general_reconstruction; surface_occupancy

keyword noise pattern without direct reconstruction signal

abstract

In industrial settings, classification of 3D CAD models are critical for efficient manufacturing. However, the limited availability of annotated CAD models presents an obstacle to achieving rapid adaptation in few-shot part classification scenarios. In this paper, we propose a hybrid graph representation and a pre-training and graph prompt framework for B-rep few-shot classification. Specifically, hybrid graph representation captures comprehensive and multi-level structural information of B-rep models by constructing local topology graph, global parallel graph and regional association hypergraph. A hierarchical graph network then fuses component-level structures with topological details in the hybrid graph. Reinforcement-augmented contrastive pre-training produces robust universal representations while in-place perturbation reduces training time. Structure-aware graph prompts finally pro

740

likely_noise

low

Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation

3D Vision & Geometry / Point Cloud

C. cluster representative

pose_calibration_localization; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

3D human pose estimation (3D HPE) has emerged as a prominent research topic, particularly in the realm of RGB-based methods. However, the use of RGB images is often limited by issues such as occlusion and privacy constraints. Consequently, multi-modal sensing, which leverages non-intrusive sensors, is gaining increasing attention. Nevertheless, multi-modal 3D HPE still faces challenges, including modality imbalance. In this work, we introduce a novel balanced multi-modal learning method for 3D HPE, which harnesses the power of RGB, LiDAR, mmWave, and WiFi. Specifically, we propose a Shapley value-based contribution algorithm to assess the contribution of each modality and detect modality imbalance. To address this imbalance, we design a modality learning regulation strategy that decelerates the learning process during the early stages of training. We conduct extensive experiments on the

741

likely_noise

low

4D Local Modeling Toward Dynamic Global Perception for Ambiguity-free Rotation-Invariant Point Cloud Analysis

3D Vision & Geometry / Point Cloud

C. cluster representative

dynamic_4d; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Rotation invariance remains a core challenge in point cloud analysis, where existing methods often struggle with structural ambiguities and insufficient global context. Most rotation-invariant (RI) representations are derived from local coordinate systems, which inherently suffer from point-pair ambiguities and fail to capture discriminative features in symmetric or repetitive structures, while discarding informative global pose cues. To overcome these limitations, we propose Ga4DPF, a novel framework that offers a robust, global-aware RI representation by converting rotation-equivariant geometric representations into invariant ones, while concurrently integrating global pose awareness. Specifically, Ga4DPF introduces a learnable steerable transform that equivariantly lifts point clouds into 4D space, facilitating robust local feature construction and mitigating point-pair ambiguities. C

742

likely_noise

low

Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis

3D Vision & Geometry / Point Cloud

C. cluster representative

pose_calibration_localization; surface_occupancy; data_benchmark

keyword noise pattern without direct reconstruction signal

abstract

A symmetry on rigid motion is one of the salient factors in efficient learning of 3D point cloud problems. Group convolution has been a representative method to extract equivariant features, but its realizations have struggled to retain both rigorous symmetry and scalability simultaneously. We advocate utilizing the intertwiner framework to resolve this trade-off, but previous works on it, which did not achieve complete SE(3) symmetry or scalability to large-scale problems, necessitate a more advanced kernel architecture. We present Equivariant Coordinate-based Kernel Convolution, or ECKConv. It acquires SE(3) equivariance from the kernel domain defined in a double coset space, and its explicit kernel design using coordinate-based networks enhances its learning capability and memory efficiency. The experiments on diverse point cloud tasks, e.g., classification, pose registration, part se

743

likely_noise

low

Topology-aware Feature Propagation for Unsupervised Non-rigid Point Cloud Correspondence

3D Vision & Geometry / Point Cloud

C. cluster representative

depth_correspondence; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Unsupervised non-rigid point cloud correspondence aims to predict point-to-point correspondences without annotations. Existing methods leverage the spatial-relation-based feature propagation strategy that includes non-physical connections, which are sensitive to non-rigid deformation. To address this issue, we advocate to learn shape topology robust to non-rigid deformation, and propose the topology-aware feature propagation module integrated into a coarse-to-fine propagation and optimization pipeline. To extract point features robust to non-rigid deformation, we estimate keypoints as superpoints and encode superpoint features with topology weights, which learns reasonable topologies under non-rigid deformation. The vector quantization codebook is leveraged to enhance the original superpoint features with stored representative features across the dataset, improving feature robustness aga

744

likely_noise

low

Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; dynamic_4d

weak or indirect keyword match

abstract

Human motion analysis tasks, such as temporal 3D pose estimation, motion prediction, and motion in-betweening, play an essential role in computer vision. However, current paradigms suffer from severe fragmentation. First, the field is split between "perception" models that understand motion from video but only output text, and "generation" models that cannot perceive from raw visual input. Second, generative MLLMs are often limited to single-frame, static poses using dense, parametric SMPL models, failing to handle temporal motion. Third, existing motion vocabularies are built from skeleton data alone, severing the link to the visual domain. To address these challenges, we introduce Superman, a unified framework that bridges visual perception with temporal, skeleton-based motion generation. Our solution is twofold. First, to overcome the modality disconnect, we propose a Vision-Guided Mo

745

likely_noise

low

COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence

weak or indirect keyword match

abstract

Estimating the 6DoF pose of a novel object with a single reference view is challenging due to occlusions, viewpoint changes, and outliers.A core difficulty lies in finding robust cross-view correspondences, as existing methods often rely on discrete one-to-one matching that is non-differentiable and tends to collapse onto sparse keypoints.We propose Confidence-aware Optimal Geometric Correspondence (COG), an unsupervised framework that formulates correspondence estimation as a confidence-aware optimal transport problem.COG produces balanced soft correspondences by predicting point-wise confidences and injecting them as target marginals, naturally suppressing non-overlapping regions.Semantic priors from vision foundation model features further regularize the correspondences, leading to stable pose estimation.This design integrates confidence into the end-to-end correspondence finding and

746

likely_noise

low

Exploring 6D Object Pose Estimation with Deformation

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; data_benchmark

weak or indirect keyword match

abstract

We present DeSOPE, a large-scale dataset designed for Deformed Six-DoF Object Pose Estimation. Most existing 6D object pose approaches assume rigid or articulated objects, leaving deformed daily objects largely unexplored. This gap limits the realism and robustness of current pose estimation methods, which often fail when objects deviate from their canonical shapes due to wear, collision, or deformation. To address this issue, we present DeSOPE, a large-scale real-world dataset specifically designed for deformed object pose estimation. DeSOPE contains two major components: (1) a collection of high-fidelity 3D scans of 26 common object categories, each captured in one canonical and three deformed states using a non-rigid alignment framework; and (2) a real-scene RGB-D dataset comprising 133K frames and 665K pose annotations across 104 deformed instances, recorded in both static and dynami

747

likely_noise

low

DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence

weak or indirect keyword match

abstract

Noisy, partially overlapping data and the need for real-time processing pose major challenges for rigid registration. Considering that feature-based matching can handle large transformation differences but suffers from limited accuracy, while local geometry-based matching can achieve fine-grained local alignment but relies heavily on a good initial transformation, we propose a novel dual-space paradigm to fully leverage the strengths of both approaches. First, we introduce an efficient filtering mechanism that incorporates a computationally lightweight single-point RANSAC algorithm followed by a refinement module to eliminate unreliable feature-based correspondences. Subsequently, we treat filtered correspondences as anchor points, extract geometric proxies, and formulate an effective objective function with a tailored solver to estimate the transformation. Experiments verify our method'

748

likely_noise

low

mmWaveFlow: Unified Enhancement and Generation of mmWave Human Point Clouds

3D Vision & Geometry / Point Cloud

C. cluster representative

depth_correspondence; surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Millimeter-wave (mmWave) point clouds have attracted growing interest in human sensing due to their robustness, privacy preservation, and low cost. However, their practical use is hindered by the inherent sparsity of data and the lack of large-scale data. We revisit generative modeling for mmWave point clouds and propose a unified flow-matching framework mmWaveFlow that unifies enhancement and generation by learning an invertible transport between dense and sparse point clouds. We leverage paired data and a latent-alignment module to enforce semantic alignment and bridge the modality gap. We find that condition-free flow matching is more vulnerable to latent path crossings, which impair bidirectional transport. Therefore, we propose Origin-Aware Flow Matching (OA-Flow), which conditioning transport on the origin of the path mitigates ambiguity in bidirectional transport. Results of exper

749

likely_noise

low

RINO: Rotation-Invariant Non-Rigid Correspondences

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence

weak or indirect keyword match

abstract

Dense 3D shape correspondence remains a central challenge in computer vision and graphics as many deep learning approaches still rely on intermediate geometric features or handcrafted descriptors, limiting their effectiveness under non-isometric deformations, partial data, and non-manifold inputs. To overcome these issues, we introduce RINO, an unsupervised, rotation-invariant dense correspondence framework that effectively unifies rigid and non-rigid shape matching. The core of our method is the novel RINONet, a feature extractor that integrates vector-based SO(3)-invariant learning with orientation-aware complex functional maps to extract robust features directly from raw geometry. This allows for a fully end-to-end, data-driven approach that bypasses the need for shape pre-alignment or handcrafted features. Extensive experiments show unprecedented performance of RINO across challengi

750

likely_noise

low

Scalable Feature Matching via State Space Modeling and Sparse Correlation

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence

weak or indirect keyword match

abstract

Efficient and robust feature matching is crucial for latency-sensitive and resource-constrained applications. While current semi-dense feature matching approaches commonly suffer from quadratic complexity in spatial resolution due to transformer-based long-range context modeling or redundant full correlation computations. To overcome these limitations, we present a novel scalable feature matching method that delivers reliable correspondences with low memory footprint and latency, especially at high resolutions. Our approach introduces three key innovations: (1) a hybrid Conv-Mamba backbone for efficient cross-scale and cross-view feature extraction with linear complexity, (2) a training-free norm-based feature filtering mechanism, enabling sparse correlation that significantly reduces computation overhead during inference, and (3) a lightweight recurrent coordinate refinement that surpas

751

likely_noise

low

KASALv2: Fully Automatic 3D Rotational Symmetry Classification and Axis Localization

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; data_benchmark

keyword noise pattern without direct reconstruction signal

abstract

Rotational symmetry is an important prior in 6D pose estimation, improving pose accuracy and ensuring the consistency of symmetry-aware evaluation metrics. However, current symmetry annotations for 3D objects are still largely manual or semi-automatic, often requiring predefined symmetry types or rotational orders and thus limiting scalability. This work introduces a fully automatic and reference-free framework that performs symmetry-type classification, rotational-order identification, and full-axis localization across all eight canonical 3D rotational symmetry types. The method localizes a dominant high-order axis, infers its rotational order through self-consistency analysis, and reconstructs the complete symmetry structure under a hierarchy-guided geometric formulation. A texture-aware extension further models appearance-induced reductions in rotational order while preserving axis or

752

likely_noise

low

Affine Perspective-Three-Point Problem

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence

weak or indirect keyword match

abstract

This paper addresses the Perspective-Three-Point (P3P) problem under affine camera models. We derive direct closed-form solvers for weak perspective and para perspective, which are representative affine camera models. The affine P3P solution reduces to a bi-quadratic equation. Unlike exact P3P solvers that require a cubic or quartic equation, it allows for the simple and stable calculation of real solutions using the quadratic formula. Since affine approximations are valid only when scene depth variation is small, we further propose an iterative correction that upgrades the affine solution to the exact P3P solution. Through extensive comparisons using synthetic data and public datasets, we demonstrate that affine P3P solvers with two upgrade iterations achieve performance substantially comparable to that of the state-of-the-art P3P solver.

753

likely_noise

low

Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence

weak or indirect keyword match

abstract

In this paper, we address the problem of 6-DoF object pose estimation from a single RGB image. Indirect methods that typically predict intermediate 2D keypoints, followed by a Perspective-$n$-Point solver, have shown great performance. Direct approaches, which regress the pose in an end-to-end manner, are usually computationally more efficient but less accurate. However, direct heads rely on globally pooled features, ignoring spatial second-order statistics despite their informativeness in pose prediction. They also predict, in most cases, discontinuous pose representations that lack robustness. Herein, we therefore propose a covariance-pooled representation that encodes convolutional feature distributions as a symmetric positive definite (SPD) matrix. Moreover, we propose a novel pose encoding in the form of an SPD matrix via its Cholesky decomposition. Pose is then regressed in an end

754

likely_noise

low

Linear Fundamental Matrix Estimation from 7 or 5 Points

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; depth_correspondence

weak or indirect keyword match

abstract

We revisit the problem of estimating the fundamental matrix of a pair of perspective cameras, a cornerstone of geometric computer vision.As is well-known, linear solvers require at least 8 point correspondences, whereas nonlinear minimal solvers require just 7 in the uncalibrated case or 5 in the calibrated case.In this paper, we consider a special case of the 7-point problem where 5 of the points are configured to lie on two lines, which has previously been shown to have a unique solution.As a theoretical contribution, we offer an analysis of how this uniqueness manifests in the standard 7-point algorithm. On a practical level, we provide the first practical linear solver for the minimal problem associated to this special configuration.Additionally, we evaluate a heuristic 5-point fundamental matrix solver based on the construction of virtual midpoints.When combined with early non-minim

755

likely_noise

low

GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global–Local Feature Fusion

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; data_benchmark

weak or indirect keyword match

abstract

We present GazeOnce360, a novel end-to-end model for multi-person gaze estimation from a single tabletop-mounted upward-facing fisheye camera. Unlike conventional approaches that rely on forward-facing cameras in constrained viewpoints, we address the underexplored setting of estimating the 3D gaze direction of multiple people distributed across a 360° scene from an upward fisheye perspective. To support research in this setting, we introduce MPSGaze360, a large-scale synthetic dataset rendered using Unreal Engine, featuring diverse multi-person configurations with accurate 3D gaze and eye landmark annotations.Our model tackles the severe distortion and perspective variation inherent in fisheye imagery by incorporating rotational convolutions and eye landmark supervision. To better capture fine-grained eye features crucial for gaze estimation, we propose a dual-resolution architecture th

756

likely_noise

low

Through the Frequency Lens: Cross-Domain Generalisable Gaze Estimation with Adaptive Modulation

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; data_benchmark

weak or indirect keyword match

abstract

Deep learning-based gaze estimation methods often exhibit significant performance degradation on unseen target domains. Through systematic frequency-domain analysis, we reveal that face images contain frequency components with distinct contributions: some facilitate cross-domain generalization while others introduce domain-specific interference that impedes it, with both components varying across datasets and constituting a key source of domain gap. Based on these observations, we propose the Frequency-Guided Adaptive Learning framework (FGAL), a novel framework enhancing domain generalization without accessing target domain data. The FGAL consists of two complementary modules: the Adaptive Interference Suppression Module (AISM) and the Spectrum Diversification Module (SDM). AISM adaptively suppresses sample-specific interfering frequency components through learnable modulation maps, whi

757

likely_noise

low

FusionRegister: Every Infrared and Visible Image Fusion Deserves Registrtaion

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; data_benchmark

weak or indirect keyword match

abstract

Spatial registration across different visual modalities is a critical but formidable step in multi-modality image fusion for real-world perception. Although there are several methods are proposed to address this issue, the existing registration joint fusion methods typically require extensive pre-registration operations, limiting their efficiency. To overcome these limitations, a general cross modality registration method guided by visual priors is proposed for multi-modality image fusion task, termed as FusionRegister.Firstly, FusionRegister achieves robustness by learning cross-modality misregistration representations rather than forcing alignment of all differences, ensuring stable outputs even under challenging input conditions.Moreover, FusionRegister demonstrates strong generality by operating directly on fused results, where misregistration is explicitly represented and effectivel

758

likely_noise

low

Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization; data_benchmark

weak or indirect keyword match

abstract

Estimating camera geometry typically involves solving minimal problems formulated as systems of multivariate polynomial equations, which often pose computational challenges when using existing Gröbner-basis or resultant-based methods due to matrix inversion needed in the online solver. Here we propose a sampling-based, matrix inversion-free method that constructs the solvers using sparse hidden-variable resultants. The determinant polynomial in the hidden variable is efficiently reconstructed via inverse fast Fourier transform interpolation from sampled evaluations, avoiding symbolic expansion. Solving this polynomial yields the hidden variable, and the remaining unknowns are recovered by identifying rank-1 deficient submatrices and applying Cramer's rule. A greatest common divisor-based criterion ensures robust submatrix identification under noise. Experiments on diverse minimal problem

759

likely_noise

low

WildCap: Facial Appearance Capture in the Wild via Hybrid Inverse Rendering

3D Vision & Geometry / 3D Gaussian Splatting

C. cluster representative

gaussian_radiance

weak or indirect keyword match

abstract

Existing methods achieve high-quality facial appearance capture under controllable lighting, which increases capture cost and limits usability. We propose WildCap, a novel method for high-quality facial appearance capture from a smartphone video recorded in the wild. To disentangle high-quality reflectance from complex lighting effects in in-the-wild captures, we propose a novel hybrid inverse rendering framework. Specifically, we first apply a data-driven method, i.e., SwitchLight, to convert the captured images into more constrained conditions and then adopt model-based inverse rendering. However, unavoidable local artifacts in network predictions, such as shadow-baking, are non-physical and thus hinder accurate inverse rendering of lighting and material. To address this, we propose a novel texel grid lighting model to explain non-physical effects as clean albedo illuminated by local p

760

likely_noise

low

Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning

3D Vision & Geometry / Point Cloud

C. cluster representative

gaussian_radiance; surface_occupancy

weak or indirect keyword match

abstract

Large multimodal 3D vision--language models show strong generalization across diverse 3D tasks, but their performance still degrades notably under domain shifts. This has motivated recent studies on test-time adaptation (TTA), which enables models to adapt online using test-time data. Among existing TTA methods, cache-based mechanisms are widely adopted for leveraging previously observed samples in online prediction refinement. However, they store only limited historical information, leading to progressive information loss as the test stream evolves. In addition, their prediction logits are fused heuristically, making adaptation unstable.To address these limitations, we propose BayesMM, a Multimodal Bayesian Distribution Learning framework for test-time point cloud analysis. BayesMM models textual priors and streaming visual features of each class as Gaussian distributions: textual param

761

likely_noise

low

PointCNN++: Performant Convolution on Native Points

3D Vision & Geometry / Point Cloud

C. cluster representative

pose_calibration_localization; surface_occupancy

weak or indirect keyword match

abstract

Existing convolutional learning methods for 3D point cloud data are divided into two paradigms: point-based methods that preserve geometric precision but often face performance challenges, and voxel-based methods that achieve high efficiency through quantization at the cost of geometric fidelity. This loss of precision is a critical bottleneck for tasks such as point cloud registration. We propose PointCNN++, a novel architectural design that fundamentally mitigates this precision-performance trade-off. It generalizes sparse convolution from voxels to points, treating voxel-based convolution as a specialized, degraded case of our more general point-based convolution. First, we introduce a point-centric convolution where the receptive field is centered on the original, high-precision point coordinates. Second, to make this high-fidelity operation performant, we design a computational stra

762

likely_noise

low

PointTPA: Test-Time Parameter Adaptation for 3D Scene Understanding

3D Vision & Geometry / Point Cloud

C. cluster representative

dynamic_4d; surface_occupancy

weak or indirect keyword match

abstract

Scene-level point cloud understanding remains challenging due to diverse geometries, imbalanced categories, and highly varied spatial layouts. Existing methods improve object-level performance but rely on static parameters during inference, limiting their adaptability to dynamic scene data. We propose Test-time Parameter Adaptation for Point Cloud Scene Perception (PointTPA), a test-time dynamic adaptation framework that constructs input-aware parameters for scene-level point clouds. PointTPA uses a Serialization-based Neighborhood Grouping (SNG) to form locally coherent patches and a Dynamic Parameter Projector (DPP) to produce patch-wise adaptive weights, enabling the backbone to adjust its behavior according to scene-specific variations while keeping parameter cost low. Integrated into PTv3, PointTPA reduces trainable parameters by over 95% and achieves competitive or superior perform

763

likely_noise

low

Streamlined Open-Vocabulary Human-Object Interaction Detection

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization

keyword noise pattern without direct reconstruction signal

abstract

Open-vocabulary human-object interaction (HOI) detection aims to localize and recognize all human-object interactions in an image, including those unseen during training.Existing approaches usually rely on the collaboration between a conventional HOI detector and a Vision-Language Model (VLM) to recognize unseen HOI categories.However, feature fusion in this paradigm is challenging due to significant gaps in cross-model representations.To address this issue, we introduce **SL-HOI**, a **S**tream**L**ined open-vocabulary **HOI** detection framework based solely on the powerful DINOv3 model.Our design leverages the complementary strengths of DINOv3's components: its backbone for fine-grained localization and its text-aligned vision head for open-vocabulary interaction classification.Moreover, to facilitate smooth cross-attention between the interaction queries and the vision head's output,

764

likely_noise

low

ViTPrompt: Training-Free Prompt Refinement with Visual Tokens for Open-Vocabulary Detection

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization

keyword noise pattern without direct reconstruction signal

abstract

Test-Time Adaptive Object Detection (TTAOD) aims to maintain detection performance under distribution shifts without retraining. While recent vision-language models enable open-vocabulary detection, existing TTAOD methods—whether closed-set or open-vocabulary—focus exclusively on improving classification confidence and largely overlook the degradation of bounding box localization. To address this critical gap, we propose ViTPrompt (Visual Token-Prompting), a training-free framework that jointly refines both bounding boxes and class scores at test time. Our key insight is to augment the original text prompt with instance-aware visual tokens extracted from high-confidence detections in an initial forward pass; this enriched prompt is then used in a second inference stage, where the cross-modal decoder leverages the enhanced semantic context to produce more accurate box coordinates and clas

765

likely_noise

low

ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders

3D Vision & Geometry / Point Cloud

C. cluster representative

depth_correspondence; surface_occupancy

keyword noise pattern without direct reconstruction signal

abstract

Hierarchical LiDAR geometry compression encodes voxel occupancies from low to high bit-depths, yet prior methods treat each depth independently and re-estimate local context from coordinates at every level, limiting compression efficiency. We present ELiC, a real-time framework that combines cross-bit-depth feature propagation, a Bag-of-Encoders (BoE) selection scheme, and a Morton-order-preserving hierarchy. Cross-bit-depth propagation reuses features extracted at denser, lower depths to support prediction at sparser, higher depths. BoE selects, per depth, the most suitable coding network from a small pool, adapting capacity to observed occupancy statistics without training a separate model for each level. The Morton hierarchy maintains global Z-order across depth transitions, eliminating per-level sorting and reducing latency. Together these components improve entropy modeling and comp

766

likely_noise

low

Towards Generalized Multimodal Homography Estimation

3D Vision & Geometry / Pose Estimation

C. cluster representative

pose_calibration_localization

weak or indirect keyword match

abstract

Supervised and unsupervised homography estimation methods depend on image pairs tailored to specific modalities to achieve high accuracy. However, their performance deteriorates substantially when applied to unseen modalities. To address this issue, we propose a training data synthesis method that generates unaligned image pairs with ground-truth offsets from a single input image. Our approach renders the image pairs with diverse textures and colors while preserving their structural information. These synthetic data empower the trained model to achieve greater robustness and improved generalization across various domains. Additionally, we design a network to fully leverage cross-scale information and decouple color information from feature representations, thus improving estimation accuracy. Extensive experiments show that our training data synthesis method improves generalization perfor

767

likely_noise

low

AnyPcc: Compressing Any Point Cloud with a Single Universal Model

3D Vision & Geometry / Point Cloud

C. cluster representative

surface_occupancy; data_benchmark

keyword noise pattern without direct reconstruction signal

abstract

Generalization remains a critical challenge in deep learning-based point cloud geometry compression. While existing methods perform well on standard benchmarks, their performance collapses in real-world scenarios due to two fundamental limitations: the lack of context models that are robust across diverse data densities, and the inability to efficiently adapt to out-of-distribution (OOD) data. To overcome both challenges, we introduce AnyPcc, a universal point cloud compression framework. AnyPcc first employs a Universal Context Model that leverages coarse-grained spatial priors with fine-grained channel priors to ensure robust context modeling across the entire density spectrum. Second, our novel Instance-Adaptive Fine-Tuning (IAFT) strategy tackles OOD data by synergizing explicit and implicit compression paradigms. For each instance, it fine-tunes a small subset of network weights and

768

likely_noise

low

Image-to-Point Cloud Feature Back-projection for Multimodal Training of 3D Semantic Segmentation

3D Vision & Geometry / Point Cloud

C. cluster representative

depth_correspondence; surface_occupancy

weak or indirect keyword match

abstract

The effective integration and utilization of multimodal data acquired from image cameras and LiDAR is of paramount importance for perception systems. This paper proposes **I**mage-to-**P**oint Cloud **F**eature Back-**P**rojection (**IPFP**), a novel method for training multimodal fusion networks that back-projects aggregated image-feature centers (from non-projection-aligned image pixels) into the point-cloud feature set via the estimated depth map. Consequently, image features and point cloud features reside within the same three-dimensional space, enabling the natural enrichment of image information into the point cloud during the network forward pass. This process can be selectively enabled when desired -- for instance, at training time -- and turned off in the absence of multimodal data -- for example, at testing time if only LiDAR sensors are available. Experimental results demonst

769

likely_noise

low

PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction

3D Vision & Geometry / Point Cloud

C. cluster representative

surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Autoregressive point cloud generation has long lagged behind diffusion-based approaches in quality. The performance gap stems from the fact that autoregressive models impose an artificial ordering on inherently unordered point sets, forcing shape generation to proceed as a sequence of local predictions. This sequential bias reinforces short-range continuity but limits the model’s ability to capture long-range dependencies, thereby weakening its capacity to enforce global structural properties such as symmetry, geometric consistency, and large-scale spatial regularities. Inspired by the level-of-detail (LOD) principle in shape modeling, we propose PointNSP, a coarse-to-fine generative framework that preserves global shape structure at low resolutions and progressively refines fine-grained geometry at higher scales through a next-scale prediction paradigm. This multi-scale factorization al

770

likely_noise

low

PointCSP: Cross-Sample Semantic Propagation and Stability Preservation in Self-Supervised Point Cloud Learning

3D Vision & Geometry / Point Cloud

C. cluster representative

surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Scene-level point cloud self-supervised learning (PC-SSL) has demonstrated potential in enhancing the generalization capability of 3D vision models. Despite the advances achieved in the field through existing methods, the sample-independent modelling paradigm still poses significant limitations in terms of maintaining consistent semantic representations across different scenes. This challenge hinders the construction of a unified and transferable semantic space. To address this issue, we propose a PC-SSL framework based on cross-sample semantic propagation (CSP), in which samples within a batch are serialized into a continuous input and processed by a state-space model to enable semantic state propagation. This mechanism explicitly models the dynamic dependencies across samples in the state space, allowing the network to establish cross-sample semantic consistency in the latent space, an

771

likely_noise

low

Structure-to-Intensity Diffusion for Adverse-Weather LiDAR Generation

3D Vision & Geometry / Point Cloud

C. cluster representative

surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

Adverse-weather LiDAR point cloud generation is challenged by complex weather-induced degradations. These degradations affect geometry and reflectance in fundamentally different ways, making joint modeling difficult and ambiguous, especially when diverse real-world training data is limited. To address this, we propose $\textit{Structure-to-Intensity Diffusion}$ (SiD), a diffusion-based framework that explicitly factorizes the denoising process at each time step: it first reconstructs the geometric structure, then conditions reflectance intensity denoising on the estimated structure. This structure-conditioned design decomposes the joint distribution, reduces modeling ambiguity, and leads to point clouds that are both geometrically coherent and radiometrically realistic. To mitigate data scarcity, we introduce $\textit{Real-Prior Weather Simulation}$ (RPWS), a degradation module that leve

772

likely_noise

low

Test-Time Training for LiDAR Semantic Segmentation under Corruption via Geometric Inlier Discrimination

3D Vision & Geometry / Point Cloud

C. cluster representative

surface_occupancy; data_benchmark

weak or indirect keyword match

abstract

LiDAR semantic segmentation must remain robust under various sensor and environmental corruptions to be reliable in safety-critical applications.Existing test-time adaptation methods, including approaches based on pseudo-labels and normalization statistics, have shown promising results but can still struggle under severe distribution shifts.To complement these approaches, we propose a geometry-aware test-time training framework that leverages an auxiliary self-supervised objective.Our method is based on geometric inlier discrimination (GeoID), which injects synthetic off-manifold points into the input and trains the model to distinguish geometry-consistent inliers from synthetically displaced outliers, enabling adaptation on unlabeled test data.To further stabilize this process under real corruptions, we introduce bidirectional unreliable point filtering (BiUPF), which uses inlier scores

773

likely_noise

low

LitePT: Lighter Yet Stronger Point Transformer

3D Vision & Geometry / Point Cloud

C. cluster representative

surface_occupancy

weak or indirect keyword match

abstract

Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has 3.6

774

likely_noise

low

Low-Rank Test-Time Training for Pre-Trained Point Cloud Models

3D Vision & Geometry / Point Cloud

C. cluster representative

surface_occupancy

keyword noise pattern without direct reconstruction signal

abstract

Test-time training (TTT) enhances the robustness of pretrained models to out-of-distribution (OOD) data through auxiliary self-supervised tasks, without requiring labeled samples. However, existing TTT methods predominantly rely on decoder-based auxiliary objectives, which suffer from inefficient adaptation and weak coupling with the primary task. To solve these limitations, we revisit the mechanism of test-time training by analyzing masking-based pretrained models to uncover the fundamental source of their OOD robustness. Our investigation reveals that their generalization capability stems from a latent feature-level structural invariance, the consistency of encoded representations under masked perturbations. Building on this insight, we introduce LoTT-PC, a lightweight LoRA-based framework that operationalizes this invariance-preserving principle for 3D point cloud classification. LoTT

775

likely_noise

low

Point Cloud as a Foreign Language for Multi-modal Large Language Model

3D Vision & Geometry / Point Cloud

C. cluster representative

surface_occupancy

weak or indirect keyword match

abstract

Multi-modal large language models (MLLMs) have shown remarkable progress in integrating visual and linguistic understanding. Recent efforts have extended these capabilities to 3D understanding through encoder-based architectures that rely on pre-trained 3D encoders to extract geometric features. However, such approaches suffer from semantic misalignment between geometric and linguistic spaces, resolution sensitivity, and substantial computational overhead. In this work, we present SAGE, the first end-to-end 3D MLLM that directly processes raw point clouds without relying on a pre-trained 3D encoder. Our approach introduces a lightweight 3D tokenizer that combines geometric sampling and neighbourhood aggregation with vector quantization to convert point clouds into discrete tokens—treating 3D data as a foreign language that naturally extends the LLM’s vocabulary. Furthermore, to enhance t

776

likely_noise

low

PointThinker: Point-Incentivized Parallel Thinking for Multimodal Large Language Model

3D Vision & Geometry / Point Cloud

C. cluster representative

surface_occupancy

weak or indirect keyword match

abstract

This paper explores parallel thinking for Multi-modal Large Language Models (MLLMs), aiming to improve Chain-of-Thought (CoT) through multiple diverse reasoning paths. We guide the model to list multiple visual key points and develop an independent reasoning path for each. Therefore, we term this method PointThinker, which is characterized by starting each thinking path with a point. PointThinker offers two key advantages. (1) It amplifies the benefits of parallel thinking. While parallel thinking naturally benefits from multiple reasoning paths, explicitly listing key points further amplifies these benefits by eliminating redundancy and promoting path diversity, enabling the model to explore problems from more varied perspectives. (2) It uses a novel dense (point-wise) reward for reinforcement learning. We observe that during parallel thinking, some points are helpful while others are i

777

likely_noise

low

Routing on Demand: DSNet for Efficient Progressive Point Cloud Denoising

3D Vision & Geometry / Point Cloud

C. cluster representative

surface_occupancy

weak or indirect keyword match

abstract

Point cloud denoising is a critical preprocessing step for enhancing the reliability and accuracy of 3D perception systems. Most existing progressive denoising methods rely on fixed iterative pipelines that process all regions uniformly, resulting in redundant computation and over-smoothing of geometric details when handling point clouds with non-uniform noise distributions. To overcome these limitations, we introduce Dynamic Skip Net (DSNet), a novel progressive denoising framework that adaptively determines the optimal denoising path for each local patch based on its noise characteristics. DSNet incorporates a noise discriminator that quantifies local noise intensity by analyzing normal similarity, and a reverse monotonic decision function that maps this measure to an appropriate denoising module. Furthermore, we propose a Path-Selective Iteration mechanism that dynamically re-evaluate

778

likely_noise

low

A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

In this paper, we propose a cross-view fusion framework that enhances the robustness of 6-DoF grasp pose estimation in corner views.Our framework alleviates occlusion by incorporating an auxiliary view and avoids the time-consuming, task-agnostic multi-view reconstruction through a post-fusion strategy.To enable cross-view fusion, we propose a self-supervised contrastive learning strategy that leverages cross-view associations to regularize point cloud features.In brief, a cross-view point pair is considered a match if the two points correspond to the same 3D location, and a non-match if they represent distinct grasp directions.The learning strategy significantly enhances the spatial consistency and direction distinctiveness of point features, thereby facilitating cross-view fusion and improving estimation robustness.Furthermore, we propose a cross-view-aligned cylinder integration modul

779

likely_noise

low

Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer

Multimodal & Language / Agentic AI

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; generation_editing; data_benchmark

weak or indirect keyword match

abstract

Recent progress in 4D representations, such as Dynamic NeRF and 4D Gaussian Splatting (4DGS), has enabled dynamic 4D scene reconstruction. However, text-driven 4D scene editing remains under-explored due to the challenge of ensuring both multi-view and temporal consistency across space and time during editing.Existing studies rely on 2D diffusion models that edit frames independently, often causing motion distortion, geometric drift, and incomplete editing. We introduce Dynamic-eDiTor, a training-free text-driven 4D editing framework leveraging Multimodal Diffusion Transformer (MM-DiT) and 4DGS. This mechanism consists of Spatio-Temporal Sub-Grid Attention (STGA) for locally consistent cross-view and temporal fusion, and Context Token Propagation (CTP) for global propagation via token inheritance and optical-flow-guided token replacement. Together, these components allow Dynamic-eDiTor t

780

likely_noise

low

GaussianDWM: Driving World Model using Language-aligned 3D Gaussians for Scene Understanding and Multi-modal Generation

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing

weak or indirect keyword match

abstract

Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality

781

likely_noise

low

OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; dynamic_4d; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

We introduce OpenVO, a novel framework for Open-world Visual Odometry (VO) with temporal awareness under limited input conditions. OpenVO effectively estimates real-world–scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras, enabling robust trajectory dataset construction from rare driving events recorded in dashcam.Existing VO methods are trained on fixed observation frequency (e.g., 10Hz or 12Hz), completely overlooking temporal dynamics information. Many prior methods also require calibrated cameras with known intrinsic parameters. Consequently, their performance degrades when (1) deployed under unseen observation frequencies or (2) applied to uncalibrated cameras. These significantly limit their generalizability to many downstream tasks, such as extracting trajectories from dashcam footage.To address these challenges, OpenVO (1) expl

782

likely_noise

low

PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots, crucial for contact reasoning, while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advance

783

likely_noise

low

SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

pose_calibration_localization; depth_correspondence; dynamic_4d; robotics_mapping; generation_editing; data_benchmark

weak or indirect keyword match

abstract

Controlling both camera motion and object dynamics is essential for coherent and expressive video generation, yet current methods typically handle only one motion type or rely on ambiguous 2D cues that entangle camera-induced parallax with true object movement. We present SymphoMotion, a unified motion-control framework that jointly governs camera trajectories and object dynamics within a single model. SymphoMotion features a Camera Trajectory Control mechanism that integrates explicit camera paths with geometry-aware cues to ensure stable, structurally consistent viewpoint transitions, and an Object Dynamics Control mechanism that combines 2D visual guidance with 3D trajectory embeddings to enable depth-aware, spatially coherent object manipulation. To support large-scale training and evaluation, we further construct RealCOD-25K, a comprehensive real-world dataset containing paired came

784

likely_noise

low

TACO: Task-Aware Contrastive Learning for Joint LiDAR Localization and 3D Object Detection

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Reliable navigation and decision-making of autonomous vehicles require both accurate localization and object detection. Traditionally, these two tasks are handled separately, leading to redundant computation and limited cross-task knowledge transfer. This paper proposes TACO, the first Task-Aware COntrastive learning framework, which performs joint LiDAR localization and 3D object detection within a single, unified network. TACO leverages contrastive learning to explicitly decouple and align static geographic features for localization and object-centric features for detection. This bidirectional mutual supervision not only enhances localization robustness in dynamic environments by filtering dynamic noise but also boosts detection accuracy via effective spatial context. Additionally, we propose OxfoLD, the first dataset that provides multi-traversal LiDAR localization ground truth with r

785

likely_noise

low

Test-Time 3D Occupancy Prediction

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping

weak or indirect keyword match

abstract

Self-supervised 3D occupancy prediction offers a promising solution for understanding complex driving scenes without requiring costly 3D annotations. However, training dense occupancy decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and once trained, such models struggle to adapt to varying voxel resolutions or novel object categories without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-Occ. Our method incrementally constructs, optimizes and voxelizes time-aware 3D Gaussians from raw sensor streams by integrating vision foundation models (VFMs) at runtime. The flexible nature of 3D Gaussians allows voxelization at arbitrary user-specified resolutions, while the generalization ability of VFMs enables accurate perception and open-vocabulary recognition,

786

likely_noise

low

UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Dynamic driving scene reconstruction is critical for autonomous driving simulation and closed-loop learning. While recent feed-forward methods have shown promise for 3D reconstruction, they struggle with long-range driving sequences due to quadratic complexity in sequence length and challenges in modeling dynamic objects over extended durations. We propose UFO, a novel recurrent paradigm that combines the benefits of optimization-based and feed-forward methods for efficient long-range 4D reconstruction. Our approach maintains a 4D scene representation that is iteratively refined as new observations arrive, using a visibility-based filtering mechanism to select informative scene tokens and enable efficient processing of long sequences. For dynamic objects, we introduce an object pose-guided modeling approach that supports accurate long-range motion capture. Experiments on the Waymo Open D

787

likely_noise

low

UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Perceiving and reconstructing objects from images are critical for real-to-sim transfer tasks, which are widely used in the robotics community.Existing methods rely on multiple submodules such as detection, segmentation, shape reconstruction, and pose estimation to complete the pipeline.However, such modular pipelines suffer from inefficiency and cumulative error, as each stage operates on only partial or locally refined information while discarding global context.To address these limitations, we propose UniPR, the first end-to-end object-level real-to-sim perception and reconstruction framework.Operating directly on a single stereo image pair, UniPR leverages geometric constraints to resolve the scale ambiguity.We introduce Pose-Aware Shape Representation to eliminate the need for per-category canonical definitions and to bridge the gap between reconstruction and pose estimation tasks.F

788

likely_noise

low

UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mapping

weak or indirect keyword match

abstract

Manually annotating accurate 3D hand poses is extremely time-consuming and labor-intensive. Existing self-supervised hand pose estimation methods leverage the discrepancy between input images and rendered outputs, or multiview consistency constraints, as the driving force to optimize networks and progressively refine pose accuracy. However, these methods are highly susceptible to noisy pseudo-labels and overlook the importance of fully exploiting fine-grained spatial correlations, which undermines the stability of model training. To address these issues, we propose UST-Hand, a self-supervised learning framework that estimates uncertainty distribution of hand pose and constructs a probabilistic point cloud feature space, which enables the complex spatiotemporal relationship modeling. UST-Hand employs a conditional normalizing flow model to capture hand pose distributions and samples diver

789

likely_noise

low

SyncMos: Scalable Motion Synchronisation for Multi-Agent Scene Interaction

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Text-guided motion generation in 3D scenes has advanced the synthesis of human–scene interactions, contributing to embodied AI, scene understanding, and virtual agent simulation. While recent studies have begun exploring multi-agent scenarios, achieving temporally synchronised interactions among multiple agents remains an open challenge. Existing methods are often limited in flexibility and scalability when handling diverse interaction contexts.We present a method that enables synchronised multi-agent interaction using a single-agent motion synthesis model through two key components: a text-guided dependency-aware story planner and a temporal synchronisation module. The story planner interprets natural language instructions into structured event sequences with temporal dependencies. Our synchronisation module, built upon time-warping control and diffusion posterior sampling, aligns inter

790

likely_noise

low

PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency

Segmentation & Dense Prediction / Depth / Optical Flow

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; data_benchmark

weak or indirect keyword match

abstract

Monocular depth estimation (MDE) has been widely adopted in the perception systems of autonomous vehicles and mobile robots. However, existing approaches often struggle to maintain temporal consistency in depth estimation across consecutive frames. This inconsistency not only causes jitter but can also lead to estimation failures when the depth range changes abruptly. To address these challenges, this paper proposes a consistency-aware monocular depth estimation framework that leverages wheel odometry from a mobile robot to achieve stable and coherent depth predictions over time. Specifically, we estimate camera pose and sparse depth from triangulation using optical flow between consecutive frames. The sparse depth estimates are used to update a recursive Bayesian estimate of the metric scale, which is then applied to rescale the relative depth predicted by a pre-trained depth estimation

791

likely_noise

low

DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing

weak or indirect keyword match

abstract

Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Fi

792

likely_noise

low

GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Despite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) —a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM)–based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from large-sc

793

likely_noise

low

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Current Vision-Language-Action (VLA) models primarily focus on mapping 2D observations to actions but exhibit notable limitations in spatiotemporal perception and reasoning: 1) spatial representations often rely on additional sensors, introducing substantial computational overhead; 2) visual reasoning is typically limited to future-frame prediction, lacking alignment with the instruction-grounded scene and thus compromising spatiotemporal consistency. To address these challenges, we propose **ConsisVLA-4D**, a unified and efficient framework that enhances spatiotemporal consistency in 3D-Perception and 4D-Reasoning. Specifically, we design: **1) CV-Aligner**, which ensures **C**ross-**V**iew object semantic consistency via filtering instruction-relevant regions and aligning object identities across multiple viewpoints; **2) CO-Fuser**, which guarantees **C**ross-**O**bject spatial g

794

likely_noise

low

Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping; generation_editing; data_benchmark

weak or indirect keyword match

abstract

We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations—such as translating, rotating, or resizing objects—due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampli

795

likely_noise

low

Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views

Segmentation & Dense Prediction / Depth / Optical Flow

D. adjacent but useful context

general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy

weak or indirect keyword match

abstract

Soft boundaries, like thin hairs, are commonly observed in natural and computer-generated imagery, but they remain challenging for 3D vision due to the ambiguous mixing of foreground and background cues. This paper introduces Guardians of the Hair (HairGuard), a framework designed to recover fine-grained soft boundary details in 3D vision tasks. Specifically, we first propose a novel data curation pipeline that leverages image matting datasets for training and design a depth fixer network to automatically identify soft boundary regions. With a gated residual module, the depth fixer refines depth precisely around soft boundaries while maintaining global depth quality, allowing plug-and-play integration with state-of-the-art depth models. For view synthesis, we perform depth-based forward warping to retain high-fidelity textures, followed by a generative scene painter that fills disocclude

796

likely_noise

low

Structural Action Transformer for 3D Dexterous Manipulation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Achieving human-level dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity.This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories.This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length.To encode structural priors and resolve ambiguity, we introduce

797

likely_noise

low

Spe-BEVHead: Rethinking the Detection Head Design for Bird’s-Eye-View Object Detection

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping

keyword noise pattern without direct reconstruction signal

abstract

Bird’s-Eye-View (BEV) detection has become a dominant paradigm for 3D object detection in autonomous driving, due to its strong perception capability. However, most existing methods mainly focus on constructing high-quality BEV feature representations, while neglecting the design of task-specific detection heads. In practice, they directly adopt the center-based head originally developed for 2D detection, without any specific optimization. This leads to three inherent limitations: (i) a geometric mismatch between the Gaussian kernel used for classification and the real BEV object, (ii) degraded end-to-end performance without Non-Maximum Suppression(NMS), and (iii) sparse supervisory signals. To address these issues, we propose Spe-BEVHead, a detection head specifically tailored for BEV 3D object detection. Spe-BEVHead introduces three BEV-specific adaptations: (1) a Rotated Box Kernel th

798

likely_noise

low

LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World

Detection & Tracking / Tracking

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d

weak or indirect keyword match

abstract

Tracking 3D human motion from egocentric, multi-camera devices is challenged by severe egomotion and partial visibility or occlusions. Existing methods are designed for monocular video often recorded from static or slowly-moving cameras and cannot easily leverage multi-view, calibrated and localized input. This makes them brittle and prone to fail on dynamic egocentric captures. We propose LAMP ($\textbf{L}$ocalization $\textbf{A}$ware $\textbf{M}$ulti-camera $\textbf{P}$eople Tracking): a novel, simple framework to solve this via early disentanglement of observer and target motion. LAMP introduces a two-step process: First, we leverage the device's known 6-DoF pose and calibration to convert detected 2D body keypoints from all cameras over a temporal window into a unified 3D world reference frame. Second, an end-to-end-trained Transformer model fits 3D human motion directly to this spa

799

likely_noise

low

SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

Video & Motion / Video Understanding

D. adjacent but useful context

gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy

weak or indirect keyword match

abstract

Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focus merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity-aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Fur

800

likely_noise

low

Thinking in 360°: Humanoid Visual Search in the Wild

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360°. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360° panoramic image. To study visual search in visually-crowded real-world scenarios, we build H* Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, u

801

likely_noise

low

Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors

Robustness & Safety / Safety

D. adjacent but useful context

general_reconstruction; dynamic_4d; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Adversarial robustness of BEV 3D object detectors is critical for autonomous driving (AD). Existing invasive attacks require altering the target vehicle itself (*e.g.* attaching patches), making them unrealistic and impractical for real-world evaluation. While non-invasive attacks that place adversarial objects in the environment are more practical, current methods still lack the multi-view and temporal consistency needed for physically plausible threats. In this paper, we present the first framework for generating universal, non-invasive, and 3D consistent adversarial objects that expose fundamental vulnerabilities for BEV 3D object detectors. Instead of modifying target vehicles, our method inserts rendered objects into scenes with an occlusion-aware module that enforces physical plausibility across views and time. To maintain attack effectiveness across views and frames, we optimize a

802

likely_noise

low

VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; dynamic_4d; robotics_mapping; generation_editing

weak or indirect keyword match

abstract

Recent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a sh

803

likely_noise

low

Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Affordance segmentation aims to decompose 3D objects into parts that serve distinct functional roles, enabling models to reason about object interactions rather than mere recognition. Existing methods, mostly following the paradigm of 3D semantic segmentation or prompt-based frameworks, struggle when geometric cues are weak or ambiguous, as sparse point clouds provide limited functional information. To overcome this limitation, we leverage the rich semantic knowledge embedded in large-scale 2D Vision Foundation Models (VFMs) to guide 3D representation learning through a cross-modal alignment mechanism. Specifically, we propose Cross-Modal Affinity Transfer (CMAT), a pretraining strategy that compels the 3D encoder to align with the semantic structures induced by lifted 2D features. CMAT is driven by a core affinity alignment objective, supported by two auxiliary losses, geometric reconst

804

likely_noise

low

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

gaussian_radiance; dynamic_4d; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Hierarchical Vision–Language–Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision–Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM repla

805

likely_noise

low

ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; pose_calibration_localization; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Recent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising vision-language-action (VLA) paradigm. However, most existing approaches overlook the importance of active perception: they typically rely on static, wrist-mounted cameras that provide an end-effector-centric viewpoint. As a result, these models are unable to adaptively select optimal viewpoints or resolutions during task execution, which significantly limits their performance in long-horizon tasks and fine-grained manipulation scenarios. To address these limitations, we propose ActiveVLA, a novel vision-language-action framework that empowers robots with active perception capabilities for high-precision, fine-grained manipulation. ActiveVLA adopts a coarse-to-fine paradig

806

likely_noise

low

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Vision-Language-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D physical environments, creating a significant gap between perception and action grounding. To bridge this gap, we propose a Spatial-Aware VLA Pretraining paradigm that enables models to acquire 3D spatial understanding before robot policy learning. Starting from pretrained vision-language models, we leverage large-scale human demonstration videos to extract 3D visual and 3D action annotations, forming a new source of supervision that aligns 2D visual observations with 3D spatial reasoning. We instantiate this paradigm with VIPA-VLA, a dual-encoder architecture that incorporates a 3D visual encoder to augment semantic visual representations with 3D-aware

807

likely_noise

low

SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control.Yet their ability to generalize across new environments, tasks, and embodiments remains limited.We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs).However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world.Bridging this gap directly with large-scale robotic data is costly and difficult to scale.Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities.Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image.Building on SPEAR-VLM, we introduce

808

likely_noise

low

Structural–Semantic Perception for Diffusion-Guided Temporal Forgery Localization

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

pose_calibration_localization; dynamic_4d; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Temporal Forgery Localization (TFL) is crucial for enhancing the interpretability and accountability of deepfake forensics by precisely pinpointing the manipulated segments.However, existing methods face two limitations: (1) localization precision, where one-shot boundary prediction models fail to rectify inherent initial prediction biases, and temporal emphasis overlooks modality-internal semantic forgery cues, resulting in noise-sensitive localization, and (2) cross-dataset generalization, where fixed-scale temporal receptive fields struggle to accommodate varying manipulation durations across real-world scenarios. To address these challenges, we propose a unified framework based on structural–semantic perception and diffusion-guided refinement. The structural–semantic perception comprises two complementary components: (1) structural perception, which adaptively models manipulation dur

809

likely_noise

low

TopoMA: Topology-Guided Multi-Agent Dense RGB 3D Reconstruction via Distributed Inference

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Multi-agent 3D reconstruction, as a key technology for large-scale VR/AR, robot swarms, and digital twins, has attracted growing attention. Recent end-to-end 3D reconstruction methods achieve strong performance in single-agent scenarios, but they are difficult to directly extend to multi-agent collaborative settings, where they often suffer from unstable tracking, excessive memory consumption, and frequent loop-closure failures, thus failing to meet real-time and large-scale deployment requirements. To address these issues, we propose TOPOMA, a real-time end-to-end 3D reconstruction framework tailored for multi-agent collaboration. TOPOMA explicitly models the spatial topological structure of the scene and tightly couples it with end-to-end representation learning, thereby jointly solving core challenges such as inter-agent spatial alignment and submap fusion. Concretely, we introduce to

810

likely_noise

low

Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

We present Wave-Former, a novel method capable of high-accuracy 3D shape reconstruction for completely occluded, diverse, everyday objects. This capability can open new applications spanning robotics, augmented reality, and logistics. Our approach leverages millimeter-wave (mmWave) wireless signals, which can penetrate common occlusions and reflect off hidden objects. In contrast to past mmWave reconstruction methods, which suffer from limited coverage and high noise, Wave-Former introduces a physics-aware shape completion model capable of inferring full 3D geometry. At the heart of Wave-Former's design is a novel three-stage pipeline which bridges raw wireless signals with recent advancements in vision-based shape completion by incorporating physical properties of mmWave signals. The pipeline proposes candidate geometric surfaces, employs a transformer-based shape completion model desig

811

likely_noise

low

ReManNet: A Riemannian Manifold Network for Monocular 3D Lane Detection

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

gaussian_radiance; depth_correspondence; surface_occupancy; robotics_mapping

weak or indirect keyword match

abstract

Monocular 3D lane detection remains challenging due to depth ambiguity and weak geometric constraints. Mainstream methods rely on depth guidance, BEV projection, and anchor- or curve-based heads with simplified physical assumptions, remapping high-dimensional image features while only weakly encoding road geometry. Lacking an invariant geometric–topological coupling between lanes and the underlying road surface, 2D-to-3D lifting is ill-posed and brittle, often degenerating into concavities, bulges, and twists. To address this, we propose the Road-Manifold Assumption: the road is a smooth 2D manifold in $\mathbb{R}^3$, lanes are embedded 1D submanifolds, and sampled lane points are dense observations, coupling metric and topology across surfaces, curves, and samples. Building on this, we propose ReManNet: it first produces initial lane predictions with an image backbone and detection head

812

likely_noise

low

From Manuals to Actions: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

gaussian_radiance; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Vision–Language–Action (VLA) models have recently emerged, demonstrating strong generalization in robotic scene understanding and manipulation. However, when confronted with long-horizon tasks that require defined goal states, such as LEGO assembly or object rearrangement, existing VLA models still face challenges in coordinating long-horizon planning with precise manipulation.Therefore, we aim to endow a VLA model with the capability to infer the “how” process from the “what” outcomes, transforming goal states into executable procedures. In this paper, we introduce ManualVLA, a unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution. Unlike prior VLA models that directly map sensory inputs to actions, we first equip ManualVLA with a planning expert that generates intermediate

813

likely_noise

low

Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

pose_calibration_localization; dynamic_4d; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Autonomous trucking poses unique challenges due to articulated tractor–trailer geometry, and time-varying sensor poses caused by the fifth-wheel joint and trailer flex. Existing perception and calibration methods assume static baselines or rely on high-parallax and texture-rich scenes, limiting their reliability under real-world settings. We propose dCAP (dynamic Calibration and Articulated Perception), a vision-based framework that continuously estimates the 6-DoF (degree of freedom) relative pose between tractor and trailer cameras. dCAP employs a transformer with cross-view and temporal attention to robustly aggregate spatial cues while maintaining temporal consistency, enabling accurate perception under rapid articulation and occlusion. Integrated with BEVFormer, dCAP improves 3D object detection by replacing static calibration with dynamically predicted extrinsics. To facilitate eva

814

likely_noise

low

SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; depth_correspondence; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D v

815

likely_noise

low

UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

pose_calibration_localization; dynamic_4d; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Dexterous manipulation remains challenging due to the cost of collecting real-robot teleoperation data, the heterogeneity of hand embodiments, and the high dimensionality of control. We present UniDex, a robot foundation suite that couples a large-scale robot-centric dataset with a unified vision–language–action (VLA) policy and a practical human-data capture setup for universal dexterous hand control. First, we construct UniDex-Dataset, a robot-centric dataset of 10M paired image–pointcloud–action frames and over 50K trajectories across eight dexterous hands (6–24 DoFs), derived from egocentric human video datasets. To transform human data into robot-executable trajectories, we employ a human-in-the-loop retargeting procedure to align fingertip trajectories while preserving plausible hand–object contacts, and we operate on explicit 3D pointclouds with human hands masked to narrow kinema

816

likely_noise

low

GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

gaussian_radiance; depth_correspondence; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation

817

likely_noise

low

Rethinking Visual Rearrangement from A Diffusion Perspective

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

gaussian_radiance; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Rearranging disarrayed objects to their intended goal states requires the agent to comprehend the changes that have occurred in the scene and to reason about the process of these changes. To address this, we propose a novel perspective on the visual rearrangement task, drawing inspiration from the diffusion processes in molecular thermodynamics. We model the room shuffle and unshuffle stages as the forward and reverse processes of diffusion. In contrast to conventional methods that rely on scene modeling and differential comparisons, our approach provides insight into the intrinsic evolution process between the goal and initial states of the scene, which allows for a more reasonable rearrangement of objects through fine-grained and progressive denoising steps with high confidence. By analyzing the task objectives, we represent the scene via spatial distributions of objects and model the

818

likely_noise

low

SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images

Remote Sensing & Earth / Remote Sensing

D. adjacent but useful context

pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Effectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi-target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large-scale dataset built for comprehensive training and evaluation across four critical dimensions of language-guided segmentation: hierarchical granularity, target multiplicity, reasoning requirements, and linguistic variability. By capturing these dimensions, LaSeRS moves beyond simple commands, providing a benchmark for complex geospatial reasoning. This addresses a critical gap: existing datasets oversimpl

819

likely_noise

low

Affostruction: 3D Affordance Grounding with Generative Reconstruction

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping

weak or indirect keyword match

abstract

This paper addresses the problem of affordance grounding from RGBD images of an object, which aims to localize surface regions corresponding to a text query that describes an action on the object. While existing methods predict affordance regions only on visible surfaces, we propose a unified framework for affordance grounding and reconstruction, dubbed Affostruction, where affordance grounding actively combines with shape generation. In our approach, reconstructing complete geometry from partial observations enables affordance prediction on unobserved regions, while affordance heatmaps guide active view selection to improve reconstruction quality of functional regions. We make three core contributions: generative multi-view reconstruction via sparse voxel fusion that extrapolates unseen geometry while maintaining constant token complexity, flow-based affordance grounding that captures i

820

likely_noise

low

AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Affordance learning is a complex challenge in many applications, where existing approaches primarily focus on the geometric structures, visual knowledge, and affordance labels of objects to determine interactable regions. However, extending this learning capability to a scene is significantly more complicated, as incorporating object- and scene-level semantics is not straightforward; for example, 3D instance identification often struggles with small, interactable, functional parts (i.e., knobs, handles, etc.). In this work, we introduce AffordBridge, a large-scale dataset with 291,637 functional interaction annotations across 685 high-resolution indoor scenes in the form of point clouds. Our affordance annotations are complemented by RGB images that are linked to the same instances within scenes. Building upon our dataset, we propose AffordMatcher, an affordance learning method that esta

821

likely_noise

low

Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation

Remote Sensing & Earth / Remote Sensing

D. adjacent but useful context

pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Recent advances in cross-view geo-localization (CVGL) methods have shown strong potential for supporting unmanned aerial vehicle (UAV) navigation in GNSS-denied environments. However, existing work predominantly focuses on matching UAV views to onboard map tiles, which introduces an inherent trade-off between accuracy and storage overhead, and overlooks the importance of the UAV’s heading during navigation. Moreover, the substantial discrepancies and varying overlaps in cross-view scenarios have been insufficiently considered, limiting their generalization to real-world scenarios. In this paper, we present Bearing-UAV, a purely vision-driven cross-view navigation method that jointly predicts UAV absolute location and heading from neighboring features, enabling accurate, lightweight, and robust navigation in the wild. Our method leverages global and local structural features and explicitl

822

likely_noise

low

MV-TAP: Tracking Any Point in Multi-View Videos

Detection & Tracking / Tracking

D. adjacent but useful context

general_reconstruction; dynamic_4d; data_benchmark

weak or indirect keyword match

abstract

Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to many applications. Point tracking serves as a key mechanism for capturing dynamic motion; however, conventional single-view approaches often fail due to the limited geometric information available in monocular video, which becomes a critical bottleneck for multi-view scenarios. In this work, we present \ours, a robust point tracker that tracks query points across multi-view videos of dynamic scenes by leveraging cross-view information.\ours utilizes camera geometry and cross-view attention to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tai

823

likely_noise

low

Unsupervised Multi-agent and Single-agent Perception from Cooperative Views

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping

keyword noise pattern without direct reconstruction signal

abstract

The LiDAR sensor based multi-agent and single-agent perception has shown promising performance in the environmental understanding for robots and automated vehicles. However, there is no existing method that simultaneously solves both multi-agent and single-agent perception in an unsupervised way. By sharing sensor data between multiple agents via communication, this paper discovers two key insights: 1) Improved point cloud density after the data sharing from cooperative views could benefit unsupervised object classification, 2) Cooperative view of multiple agents can be used as unsupervised guidance to the 3D object detection in the single view. Based on these two discovered insights, we propose an Unsupervised Multi-agent and Single-agent (UMS) perception framework that leverages multi-agent cooperation without human annotations to simultaneously solve multi-agent and single-agent perc

824

likely_noise

low

InternVideo-Next: Towards World-Understanding Video Models

Video & Motion / Video Understanding

D. adjacent but useful context

dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Large-scale video–text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks.We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning.To address these, we disentangle the traditional encoder–decoder design into an Encoder–Predictor–Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preservin

825

likely_noise

low

Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Collaborative perception (CP) enables data sharing among connected and autonomous vehicles (CAVs) to enhance driving safety. However, CP systems are vulnerable to adversarial attacks where malicious agents forge false objects via feature-level perturbations. Current defensive systems use threshold-based consensus verification by comparing collaborative and ego detection results. Yet, these defenses remain vulnerable to more sophisticated attack strategies that could exploit two critical weaknesses: (i) lack of robustness against attacks with systematic timing and target region optimization, and (ii) inadvertent disclosure of vulnerability knowledge through implicit confidence information in shared collaboration data. In this paper, we propose MVIG attack, a novel adaptive adversarial CP framework learning to capture vulnerability knowledge disclosed by different defensive CP systems from

826

likely_noise

low

Probabilistic Discrepancy Learning for Roadside LiDAR Scene Completion

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping

weak or indirect keyword match

abstract

We propose a probabilistic discrepancy learning approach for roadside LiDAR scene completion (PDL). Conventional methods focus on object-level completion and scene completion from ego-vehicle viewpoint. These methods struggle to cope with long-term or total occlusions caused by roadside sensors with fixed viewpoints. To address this issue, we compensate for occlusion roadside point clouds by introducing external visual information. Specifically, Our PDL is mainly divided into probabilistic pose discrepancy minimization and scene discrepancy learning. We employ probabilistic pose discrepancy minimization to correct noisy poses from vision-based detectors, while utilizing a diffusion model within scene discrepancy learning for robust full-scene completion.Furthermore, we introduce regional and global sampling discrepancy learning losses to achieve robust and efficient training. We conducte

827

likely_noise

low

VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; surface_occupancy; robotics_mapping

weak or indirect keyword match

abstract

Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in 2D vision-language tasks, their application to complex 3D scene manipulation remains underexplored. In this paper, we bridge this critical gap by tackling three key challenges in 3D object arrangement task using MLLMs. First, to address the weak visual grounding of MLLMs, which struggle to link programmatic edits with precise 3D outcomes, we introduce an MCP-based API. This shifts the interaction from brittle raw code manipulation to more robust, function-level updates. Second, we augment the MLLM's 3D scene understanding with a suite of specialized visual tools to analyze scene state, gather spatial information, and validate action outcomes. This perceptual feedback loop is critical for closing the gap between language-based updates and precise 3D-aware manipulation. Third, to manage the iterative, error-pron

828

likely_noise

low

Foca-VLA: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Embodied intelligence for contact-rich manipulation has predominantly relied on position control, while explicit awareness and regulation of interaction forces remain under-explored, limiting stability, precision, and robustness in real-world tasks. We propose Foca-VLA, an end-to-end vision-language-action framework that equips robots with hybrid force-position control and explicit force awareness. Foca-VLA introduces force-based prompts into the VLM expert to construct force-aware task concepts across stages, and employs a cross-scale routing Mixture-of-Experts (MoE) with impedance control in the action expert to adaptively fuse these concepts with real-time interaction forces for closed-loop hybrid force--position regulation. To support learning and evaluation, we construct Foca-Dataset, containing 1,000 trajectories over 5 contact-rich tasks, including wiping, pressing, and assembling

829

likely_noise

low

ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Recent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding.We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR.ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations.While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes.Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames

830

likely_noise

low

Arcadia: Toward a Full-Lifecycle Framework for Embodied Lifelong Learning

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

We contend that embodied learning is fundamentally a lifecycle problem rather than a single-stage optimization. Systems that optimize only one link (data collection, simulation, learning, or deployment) rarely sustain improvement or generalize beyond narrow settings. We introduce Arcadia, a closed-loop framework that operationalizes embodied lifelong learning by tightly coupling four stages: (1) Self-evolving exploration and grounding for autonomous data acquisition in physical environments, (2) Generative scene reconstruction and augmentation for realistic and extensible scene creation, (3) a Shared embodied representation architecture that unifies navigation and manipulation within a single multimodal backbone, and (4) Sim-from-real evaluation and evolution that closes the feedback loop through simulation-based adaptation. This coupling is non-decomposable: removing any stage breaks th

831

likely_noise

low

General Process Reward Modeling for Robotic Reinforcement Learning

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

The primary obstacle for applying reinforcement learning (RL) to real-world robotics is the design of effective reward functions. While recently learning-based Process Reward Models (PRMs) are a promising direction, they are often hindered by two fundamental limitations: their reward models lack step-aware understanding and rely on single-view perception, leading to unreliable assessments of fine-grained manipulation progress; and their reward shaping procedures are theoretically unsound, often inducing a semantic trap that misguides policy optimization.To address these, we introduce Robo-Dopamine, a novel reward modeling method for learning a general-purpose, step-aware process reward model from multi-view inputs. At its core is our General Reward Model (GRM), trained on a vast 3,400+ hour dataset, which leverages Step-wise Reward Discretization for structural understanding and Multi-Pe

832

likely_noise

low

GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Driving planning is a critical component of end-to-end (E2E) autonomous driving. However, prevailing Imitative E2E Planners often suffer from multimodal trajectory mode collapse, failing to produce diverse trajectory proposals. Meanwhile, Generative E2E Planners struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. In this paper, we propose GuideFlow, a novel planning framework that leverages Constrained Flow Matching. Concretely, GuideFlow explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our core contribution lies in directly enforcing explicit constraints within the flow matching generation process, rather than relying on implicit constraint encoding. Crucially, Gu

833

likely_noise

low

MER-Tracker: Towards High-Speed 3D Point Tracking via Multi-View Event-RGB Hybrid Cameras

Detection & Tracking / Tracking

D. adjacent but useful context

general_reconstruction; dynamic_4d; data_benchmark

weak or indirect keyword match

abstract

This paper proposes the first task for high-speed 3D point tracking using multi-view Event-RGB hybrid cameras. We design a cuboid observation device comprising 4 RGB cameras (30fps) and 2 Event cameras to synchronously capture high-speed motions, and propose MER-Tracker, a high–frame-rate 3D point-tracking network that fuses the complementary strengths of dual modalities. We first respectively extract 2D motion-change features from the RGB and Event modalities, then apply linear interpolation and anchor sampling to fuse the discrete RGB 3D features and continuous Event 3D features after 3D lifting, and finally employ a LoRA-tuned Transformer based on temporal correlationship to predict the high-frame-rate 3D point trajectories over fast motions, accomplishing high-speed 3D point tracking. To verify the effectiveness of our method, we construct both real-world and simulated high-speed mot

834

likely_noise

low

Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

general_reconstruction; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), offering a unified framework for perception, reasoning, and decision-making by jointly modeling visual inputs and natural language instructions. However, their real-world deployment is hindered by the significant computational overhead incurred when processing high-resolution, multi-view images—a standard setup in AD systems that utilize six or even more synchronized cameras to perceive the environment comprehensively. This overhead stems from the large number of visual tokens generated during encoding, which significantly increases inference latency and memory consumption when passed to large language models, owing to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework specifically designed for multi-vi

835

likely_noise

low

SMV-EAR: Bring Spatiotemporal Multi-View Representation Learning into Efficient Event-Based Action Recognition

Video & Motion / Video Understanding

D. adjacent but useful context

general_reconstruction; dynamic_4d; data_benchmark

weak or indirect keyword match

abstract

Event cameras action recognition (EAR) offers compelling privacy-protecting and efficiency advantages, where temporal motion dynamics is of great importance. Existing spatiotemporal multi-view representation learning (SMVRL) methods for event-based object recognition (EOR) offer promising solutions by projecting $H$-$W$-$T$ events alone spatial axis $H$ and $W$, yet are limited by its translation-variant spatial binning representation and naive early concatenation fusion architecture. This paper reexamines the key SMVRL design stages for EAR and propose: (i) a principled spatiotemporal multi-view representation through translation-invariant dense conversion of sparse events, (ii) a dual-branch, dynamic fusion architecture that models sample-wise complementarity between motion features from different views, and (iii) a bio-inspired temporal warping augmentation that mimics speed variabili

836

likely_noise

low

AeroAgent: A Vision–Physics–Decision Framework for Aerodynamic Vehicle Design

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

surface_occupancy; robotics_mapping; generation_editing; data_benchmark

weak or indirect keyword match

abstract

Modern generative models can propose striking 3D vehicle shapes from text and images, but turning these sketches intoaerodynamically efficient, regulation-compliant designs still requires weeks of high-fidelity computational fluiddynamics (CFD) and manual iteration. As a result, fast 3D generation without trustworthy physics in the loop doeslittle to reduce end-to-end design time. We study how an AI agent can close this loop under a strict CFD budget.We introduce AeroAgent, a vision–physics–decision framework built around a single 3D, editable surfacerepresentation for vehicle shapes. A vision module turns text and 2D references into diverse, standardized 3Dcandidates and supports image-level edits. A physics module, AeroFormer, is a geometry-guidedTransformer surrogate trained on a large-scale vehicle aerodynamics dataset of roughly 50k CFD simulations; threetask-specific heads predict

837

likely_noise

low

Instance-level Visual Active Tracking with Occlusion-Aware Planning

Detection & Tracking / Tracking

D. adjacent but useful context

general_reconstruction; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-

838

likely_noise

low

Learning to Act Robustly with View-Invariant Latent Actions

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Vision-based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view-invariant visual representations. This challenge becomes more pronounced in real-world settings, where viewpoint variability is unavoidable and can significantly disrupt policy performance.Existing methods typically learn invariance from multi-view observations at the scene level, but such approaches rely on visual appearance and fail to incorporate the physical dynamics essential for robust generalization.We propose View-Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view-invariant representations grounded in physical dynamics. VILA aligns these latent actions across viewpoints using an action-guided objective based on ground-truth action sequences.Experiments in both simulation and the real world show tha

839

likely_noise

low

Convexity-Aware Noise Calibration: A Self-Supervised Framework for Noise-Level-Unknown Image Denoising

Low-level Vision / Restoration

D. adjacent but useful context

gaussian_radiance; pose_calibration_localization; data_benchmark

weak or indirect keyword match

abstract

Image denoising is a fundamental task in computer vision aimed at recovering clean images from noise-corrupted observations. While supervised deep learning methods achieve remarkable performance when trained on paired data with known noise levels, their real-world applicability is limited as noise characteristics are often unknown. Existing unsupervised techniques, such as blind-spot networks or methods based on statistical estimation, either compromise performance due to information loss or suffer from inaccuracies in noise level estimation. To address these challenges, we propose a novel two-stage self-supervised denoising framework that first accurately estimates the noise level directly from noisy images, without requiring clean references or prior noise knowledge. Building upon theoretical insights from Noisier2Noise, we rigorously derive a relationship between the noise level and t

840

likely_noise

low

Multi-modal Test-time adaptation via Adaptive Probabilistic Gaussian Calibration

Robustness & Safety / Robustness

D. adjacent but useful context

gaussian_radiance; pose_calibration_localization; data_benchmark

weak or indirect keyword match

abstract

Multi-modal test-time adaptation (TTA) enhances the resilience of benchmark multi-modal models against distribution shifts by leveraging the unlabeled target data during inference. Despite the documented success, the advancement of multi-modal TTA methodologies has been impeded by a persistent limitation, i.e., the lack of explicit modeling of category-conditional distributions, which is crucial for yielding accurate predictions and reliable decision boundaries. Canonical Gaussian discriminant analysis (GDA) provides a vanilla modeling of category-conditional distributions and achieves moderate advancement in uni-modal contexts. However, in multi-modal TTA scenario, the inherent modality distribution asymmetry undermines the effectiveness of modeling the category‑conditional distribution via the canonical GDA. To this end, we introduce a tailored probabilistic Gaussian model for multi-mo

841

likely_noise

low

Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

gaussian_radiance; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

In real-world robotic manipulation, states typically admit a neighborhood of near-equivalent actions. That is for each state, there exist a feasible action neighborhood (FAN) rather than a single correct action, within which motions yield indistinguishable progress. However, prevalent VLA training methodologies are directly inherited from linguistic settings and does not exploit the FAN property, thus lead to poor generalization and low sample efficiency. To address this limitation, we introduce a FAN-guided regularizer that shapes the model's output distribution to align with the geometry of FAN. Concretely, we introduce a Gaussian prior that promotes locally smooth and unimodal predictions around the preferred direction and magnitude. In extensive experiments across both reinforced finetuning (RFT) and supervised finetuning (SFT), our method achieves significant improvement in sample

842

likely_noise

low

EgoRoC: Towards Egocentric Robotic Control via Task-Agnostic Visual Alignment

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

pose_calibration_localization; surface_occupancy; robotics_mapping

weak or indirect keyword match

abstract

Recent Vision-Language-Action (VLA) models map visual-textual inputs to robotic actions via end-to-end architectures, yet this approach entangles visual understanding with task-specific actions. This leads to an exhaustive collection of full operational sequences and parameter redundancy across tasks, while generic third-person camera setups require fine-tuning for different hardware due to implicit hand-eye assumptions. We argue that decoupling \textbf{how robots see} from \textbf{how robots act} is a missing primitive in VLA systems. We present \textbf{EgoRoC}, a plug-and-play egocentric alignment head that precedes any task policy and exposes only a thin 6-DoF pose interface. EgoRoC establishes task-agnostic viewpoint consistency from a wrist-mounted (first-person) camera and then alternates alignment with manipulation, while a diffusion-based online hand–eye module corrects the actio

843

likely_noise

low

OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

depth_correspondence; surface_occupancy; robotics_mapping

weak or indirect keyword match

abstract

Robust 3D semantic occupancy is essential for legged and humanoid robots, yet most Semantic Scene Completion (SSC) systems are built for wheeled platforms with forward-facing sensors. We present $\textbf{OneOcc}$, a vision-only panoramic SSC framework tailored to severe body jitter and $360^{\circ}$ continuity. OneOcc integrates four complementary modules: (i) $\textit{Dual-Projection fusion (DP-ER)}$, which jointly exploits the raw annular panorama and its equirectangular unfolding to preserve true $360^{\circ}$ continuity while enabling grid-aligned feature extraction and seam-aware context; (ii) $\textit{Bi-Grid Voxelization (BGV)}$, which reasons in Cartesian and polar/cylindrical voxel spaces to reduce discretization bias and better align with panoramic geometry, yielding sharper free/occupied boundaries; (iii) a lightweight decoder with $\textit{Hierarchical AMoE-3D}$ fusion that d

844

likely_noise

low

Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

gaussian_radiance; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Rapidly-exploring random trees (RRTs) have been widely adopted for robot motion planning due to their robustness and theoretical guarantees. However, existing RRT-based planners require explicit goal configurations specified as numerical joint angles, while many practical applications provide goal specifications through visual observations such as images or demonstration videos where precise goal configurations are unavailable. In this paper, we propose visual-RRT (vRRT), a motion planner that enables visual-goal planning by unifying gradient-based exploitation from differentiable robot rendering with sampling-based exploration from RRTs. We further introduce (1) a frontier-based exploration-exploitation strategy that adaptively prioritizes visually promising search regions, and (2) inertial gradient tree expansion that inherits optimization states across tree branches for momentum-consi

845

likely_noise

low

LiDAR-to-4DRadar Diffusion Bridge via Cross-Modal Alignment and Translation in Latent Space

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

dynamic_4d; surface_occupancy; robotics_mapping

weak or indirect keyword match

abstract

Millimeter-wave radar’s all-weather capability makes it increasingly vital for autonomous perception. However, the high cost of radar data collection drives the need for data generation to augment radar datasets. Existing works mainly target partial radar representations, e.g., 2D or 3D slices, leading to information loss and limited downstream performance. To overcome these issues, we introduce the novel task of LiDAR-to-4DRadar translation, which generates complete 4D radar tensors, with three spatial and one Doppler axes, guided by LiDAR data that preserve spatial and semantic consistency. We propose a novel diffusion bridge model in an aligned LiDAR-4DRadar latent space, namely \textbf{L2RLDB}, to tackle this task. Specifically, first, a key-voxel-aware VAE compresses high-dimensional, noisy radar tensors into a compact latent space, while enabling precise numerical reconstruction an

846

likely_noise

low

Towards Human-Like Robot Handwriting via Contour-Aware Generation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

pose_calibration_localization; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Empowering machines to simulate human handwriting is a promising research direction. Most existing methods, however, primarily focus on reproducing the writing trajectory to capture the overall character structure, while neglecting the critical aspect of stroke contour modeling. Consequently, these methods struggle to generate visually realistic, human-like handwriting, limiting their applicability in scenarios such as calligraphy robots. To address this issue, we propose a new task, called Contour-aware Handwriting Trajectory Reconstruction (CHTR). This task presents two major challenges: 1) Existing handwriting datasets lack stroke contour annotations, making supervised learning difficult; 2) Previous methods are unable to recover stroke contour and preserve the overall character structure jointly. To address the dataset limitation, we present CHTR-110K, a large-scale character datase

847

likely_noise

low

Memory-Augmented Scene Understanding and Exploration for Open-World Aerial Object-Goal Navigation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

pose_calibration_localization; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Aerial object-goal navigation (Aerial ObjectNav) requires an Unmanned Aerial Vehicle (UAV) to navigate to target objects in large-scale outdoor environments using only visual observations and high-level object descriptions, without detailed step-by-step instructions. Existing approaches rely on local observations or short-term history, lacking comprehensive scene understanding and efficient spatial exploration strategies, which constrains their navigation capability in complex aerial scenarios. To address these challenges, we propose OctMem-Agent, an octree memory-augmented framework for aerial object-goal navigation. Specifically, we introduce an Adaptive Octree Memory that incrementally aggregates RGB-D observations into a hierarchical 3D representation, capturing both explored regions and unexplored frontiers across large-scale aerial environments. We further propose a Instruction-Gu

848

likely_noise

low

Rethinking Camera Choice : An Empirical Study on Fisheye Camera Properties in Robotic Manipulation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

pose_calibration_localization; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

The adoption of fisheye cameras in robotic manipulation, driven by their exceptionally wide Field of View (FoV), is rapidly outpacing a systematic understanding of their downstream effects on policy learning. This paper presents the first comprehensive empirical study to bridge this gap, rigorously analyzing the properties of wrist-mounted fisheye cameras for imitation learning. Through extensive experiments in both simulation and the real world, we investigate three critical research questions: spatial localization, scene generalization, and hardware generalization. Our investigation reveals that: (1) The wide FoV significantly enhances spatial localization, but this benefit is critically contingent on the visual complexity of the environment. (2) Fisheye-trained policies, while prone to overfitting in simple scenes, unlock superior scene generalization when trained with sufficient envi

849

likely_noise

low

Action–Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions. However, existing methods typically rely on 2D features with limited spatial awareness, or require explicit point clouds that are difficult to obtain reliably in real-world settings. At the same time, recent 3D geometric foundation models show that accurate and diverse 3D structure can be reconstructed directly from RGB images in a fast and robust manner.We leverage this opportunity and propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model. Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap

850

likely_noise

low

Beyond Success: Refining Elegant Robot Manipulation from Mixed-Quality Data via Just-in-Time Intervention

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Vision-Language-Action (VLA) models have enabled notable progress in general-purpose robotic manipulation, yet their learned policies often exhibit variable execution quality. We attribute this variability to the mixed-quality nature of human demonstrations, where the implicit principles that govern how actions should be carried out are only partially satisfied. To address this challenge, we introduce the LIBERO-Elegant benchmark with explicit criteria for evaluating execution quality. Using these criteria, we develop a decoupled refinement framework that improves execution quality without modifying or retraining the base VLA policy. We formalize Elegant Execution as the satisfaction of Implicit Task Constraints (ITCs) and train an Elegance Critic via offline Calibrated Q-Learning to estimate the expected quality of candidate actions. At inference time, a Just-in-Time Intervention (JITI)

851

likely_noise

low

Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

While existing equivariant methods enhance data efficiency, they suffer from high computational intensity, reliance on single-modality inputs, and instability when combined with fast-sampling methods. In this work, we propose E3Flow, a novel framework that addresses the critical limitations of equivariant diffusion policies. E3Flow overcomes these challenges, successfully unifying efficient rectified flow with stable, multi-modal equivariant learning for the first time. Our framework is built upon spherical harmonic representations to ensure rigorous SO(3) equivariance. We introduce a novel invariant Feature Enhancement Module (FEM) that dynamically fuses hybrid visual modalities (point clouds and images), injecting rich visual cues into the spherical harmonic features. We evaluate E3Flow on 8 manipulation tasks from the MimicGen benchmark and further conduct 4 real-world experiments to

852

likely_noise

low

Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

pose_calibration_localization; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-aware Virtual View Exploration (TVVE), a framework designed to overcome these challenges by integrating virtual view exploration with task-specific representation learning. TVVE employs an efficient exploration policy, accelerated by a novel pseudo-environment, to acquire informative views. Furthermore, we introduce a Task-aware Mixture-of-Experts (TaskMoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TVVE generates more complete and discriminative visual representations, demonstrating significan

853

likely_noise

low

ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Vision-Language-Action (VLA) models have emerged essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model (VLM) embeddings.Recent advancements have introduced explicit intermediary reasoning—such as subtask prediction (language) or goal image synthesis (vision)—to guide action generation. However, these intermediate reasoning are often indirect and inherently limited in their capacity to convey the full, granular information required for precise action execution. Instead, we posit that the most effective form of reasoning is one that deliberates directly in the action space. We introduce Action Chain-of-Thought (ACoT), a paradigm where the reasoning process itself is formulated as a structured sequence of coarse action intents that guide the final policy. In this paper, we p

854

likely_noise

low

Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Despite strong results on recognition and segmentation, current 3D visual pre-training methods often underperform on robotic manipulation. We attribute this gap to two factors: the lack of state–action–state dynamics modeling and the unnecessary redundancy of explicit geometric reconstruction. We introduce AFRO, a scalable self-supervised framework that learns dynamics-aware 3D representations directly from point clouds without action or label supervision. AFRO casts state prediction as a generative diffusion process and jointly models forward and inverse dynamics in a shared latent space to capture causal transition structure. To prevent feature leakage in action learning, we employ feature differencing and inverse-consistency supervision, improving the quality and stability of visual features. When combined with Diffusion Policy for control, AFRO substantially increases manipulation su

855

likely_noise

low

Contact-Aware Neural Dynamics

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

High-fidelity physics simulation is essential for scalable robotic learning, but the sim-to-real gap persists, especially for tasks involving complex, dynamic, and discontinuous interactions like physical contacts. Explicit system identification, which tunes explicit simulator parameters, is often insufficient to align the intricate, high-dimensional, and state-dependent dynamics of the real world. To overcome this, we propose an implicit sim-to-real alignment framework that learns to directly align the simulator's dynamics with contact information. Our method treats the off-the-shelf simulator as a base prior and learns a contact-aware neural dynamics model to refine simulated states using real-world observations. We show that using tactile contact information from robotic hands can effectively model the non-smooth discontinuities inherent in contact-rich tasks, resulting in a neural dy

856

likely_noise

low

Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Vision-Language-Action (VLA) models built upon Chain-of-Thought (CoT) have achieved remarkable success in advancing general-purpose robotic agents, owing to its significant perceptual comprehension. Recently, since text-only CoT struggles to adequately capture scene details in complex spatial environments, a highly promising strategy involves leveraging visual priors to guide robotic action generation. Nevertheless, these strategies face two inherent challenges: (i) a modality gap between visual observations and low-level actions, and (ii) unstable training due to competing objectives between visual prediction and action generation. To address these challenges, we propose a Vision-Integrated Trajectory Alignment (VITA) framework that learns a shared discrete latent space for vision and action, enabling joint modeling of perception and motor control. VITA introduces a implicit visual CoT:

857

likely_noise

low

Self-Attention Driven Tensor Representation for High-Order Data Recovery

Low-level Vision / Restoration

D. adjacent but useful context

surface_occupancy; robotics_mapping; data_benchmark

weak or indirect keyword match

abstract

Low-rank tensor representation (LRTR) is an effective tool for compactly modeling high-order data. While nonlinear LRTR models can better capture real-world nonlinear dependencies, most existing methods rely on fixed mappings of multilayer perceptrons (MLPs) or convolutional neural networks (CNNs), limiting their ability to model complex global dependencies. To overcome this limitation, we construct a novel paradigm called Self-Attention Driven Tensor Representation (SADTR), which is the first framework that models nonlinearity from the perspective of self-attention. Specifically, we design a factor self-representation mechanism to establish dynamic global mapping, thereby adaptively capturing both local and non-local nonlinear dependencies. Moreover, we introduce an implicit sparse representation to impose sparsity constraint while avoiding additional optimization problems. As a result,

858

likely_noise

low

GaussianMatch: Semi-Supervised Regression with Pseudo-Label Filtering via Multi-View Gaussian Consistency

Learning Algorithms / Optimization

D. adjacent but useful context

general_reconstruction; gaussian_radiance

keyword noise pattern without direct reconstruction signal

abstract

Semi-Supervised Regression (SSR) is essential in domains like sentiment analysis, healthcare, etc., where labeled data is limited but unlabeled data is plentiful. Despite its practical importance, SSR remains underexplored due to the lack of effective pseudo-labeling strategies for continuous outputs. Unlike classification, regression lacks inherent confidence measures, making it harder to filter and trust pseudo-labels. This limitation permits low-quality pseudo-labels to propagate during training without proper validation, significantly amplifying prediction errors in semi-supervised regression frameworks. In this work, we propose GaussianMatch, a novel SSR framework enabling high-quality pseudo-label filtering, which selects reliable pseudo-labels through multi-view prediction consistency under feature-space smoothness assumptions. Our framework introduces two key innovations: 1) Gau

859

likely_noise

low

SpatialStack: Layered Geometry-Semantic Fusion for 3D VLM Spatial Reasoning

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

general_reconstruction; robotics_mapping

weak or indirect keyword match

abstract

Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local

860

likely_noise

low

Tracking by Predicting 3-D Gaussians Over Time

Detection & Tracking / Tracking

D. adjacent but useful context

gaussian_radiance; robotics_mapping

weak or indirect keyword match

abstract

We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pre-training a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches.

861

likely_noise

low

Taming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning

Robustness & Safety / Safety

D. adjacent but useful context

gaussian_radiance; pose_calibration_localization

weak or indirect keyword match

abstract

Prototype-based Personalized Federated Learning (ProtoPFL) enables efficient cross-domain adaptation by communicating compact class prototypes, but directly sharing prototypes raises privacy risks. A common defense involves per-example $\ell_2$ clipping before prototype computation to limit sensitivity, followed by the addition of isotropic Gaussian noise during upload to enforce Local Differential Privacy (LDP). However, this Isotropic Gaussian Prototype Perturbation (IGPP) often over-perturbs key discriminative dimensions and struggles to balance the clipping threshold with representation fidelity. We propose VPDR, a client-side privacy plug-in that can be seamlessly integrated into existing ProtoPFL frameworks. Motivated by the statistical prior that dimension-wise class variance reflects discriminability, we introduce Variance-adaptive Prototype Perturbation (VPP), which uses groupwi

862

likely_noise

low

BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird’s-Eye View Images

Autonomous Driving / Autonomous Driving

D. adjacent but useful context

pose_calibration_localization; robotics_mapping

weak or indirect keyword match

abstract

We present BEV-SLD, a LiDAR global localization method building on the Scene Landmark Detection (SLD) concept. Unlike scene-agnostic pipelines, our new self-supervised approach leverages bird’s-eye-view (BEV) images to discover scene-specific patterns at a prescribed spatial density and treat them as landmarks. A consistency loss aligns a learnable set of global landmark coordinates with per-frame heatmaps, yielding consistent detection and reliable occurrence across the scene. Across campus, industrial, and forest environments, BEV-SLD delivers robust localization and outperforms state-of-the-art methods. Code and trained models will be released after publication.

863

likely_noise

low

HQC-NBV: A Hybrid Quantum-Classical View Planning Approach

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

pose_calibration_localization; robotics_mapping

weak or indirect keyword match

abstract

Efficient view planning is a fundamental challenge in computer vision and robotic perception, critical for tasks ranging from search and rescue operations to autonomous navigation. While classical approaches, including sampling-based and deterministic methods, have shown promise in planning camera viewpoints for scene exploration, they often struggle with computational scalability and solution optimality in complex settings. This study introduces HQC-NBV, a hybrid quantum-classical framework for view planning that leverages quantum properties to efficiently explore the parameter space while maintaining robustness and scalability. We propose a specific Hamiltonian formulation with multi-component cost terms and a parameter-centric variational ansatz with bidirectional alternating entanglement patterns that capture the hierarchical dependencies between viewpoint parameters. Comprehensive e

864

likely_noise

low

Test-Time Perturbation Tuning with Delayed Feedback for Vision-Language-Action Models

Robotics & Embodied AI / Embodied AI

D. adjacent but useful context

pose_calibration_localization; robotics_mapping

weak or indirect keyword match

abstract

Vision-Language-Action models (VLAs) achieve strong performance in sequential decision-making but remain fragile to subtle environment shifts, such as small changes in object pose. We attribute this brittleness to trajectory overfitting, where VLAs over-attend to spurious cues and replicate memorized actions. We propose Perturbation learning with Delayed Feedback (PDF), a verifier-free test-time adaptation framework that improves decision performance without fine-tuning the base model. PDF mitigates spurious correlations through uncertainty-based data augmentation and action voting, while an adaptive scheduler allocates augmentation budgets to balance performance and efficiency. To further improve stability, PDF learns a lightweight perturbation module that retrospectively adjusts action logits guided by delayed feedback, correcting high-confidence errors. Experiments on LIBERO (+7.4\% s

core_reconstruction	362
strong_bridge	74
adjacent_context	223
likely_noise	205
core_reconstruction_high_conf	283
strong_bridge_high_conf	14
adjacent_context_high_conf	0
likely_noise_high_conf	0

CVPR 2026 3D Reconstruction Curated Relevance Audit

Summary

Rows