CVPR 2026 3D Reconstruction Curated Relevance Audit

This is a relevance-curated pass over the earlier 864 strict candidates. It is not a quality ranking. The goal is to separate core reconstruction papers from strong system bridges, adjacent context, and likely keyword noise.

Summary

core_reconstruction362
strong_bridge74
adjacent_context223
likely_noise205
core_reconstruction_high_conf283
strong_bridge_high_conf14
adjacent_context_high_conf0
likely_noise_high_conf0

Rows

#RelevancePaperEditorial bucketMatched groupsReasonAbstract
1core_reconstruction
high
Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation
Autonomous Driving / Autonomous Driving
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkVGGT/feed-forward geometry lineage with direct geometry signal
abstractUnderstanding dynamic 3D environments is essential for safe autonomous driving, particularly when reasoning about human-centric, nonrigid agents. However, existing self-supervised occupancy prediction frameworks predominantly assume rigid-body motion and rely on simple frame-to-frame offsets, limiting their ability to capture fine-grained deformations and maintain temporal coherence. To address this issue, we propose DeGO, a deformable Gaussian occupancy framework that unifies decoupled Gaussian deformation with factorized 4D foundation-model distillation. DeGO disentangles rigid and nonrigid motion, enabling each Gaussian primitive to evolve through both deformation and offset-based updates. In parallel, a factorized 4D distillation strategy transfers cross-camera and cross-frame knowledge from the VGGT foundation model, producing foundation-aligned features that enhance temporal consis
2core_reconstruction
high
Dynamic Visual SLAM using a General 3D Prior
Robotics & Embodied AI / Embodied AI
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mappingVGGT/feed-forward geometry lineage with direct geometry signal
abstractReliable incremental estimation of camera poses and 3D reconstruction is key to enable various applications including robotics, interactive visualization, and augmented reality. However, this task is particularly challenging in dynamic natural environments, where scene dynamics can severely deteriorate camera pose estimation accuracy. In this work, we propose a novel monocular visual SLAM system that can robustly estimate camera poses in dynamic scenes. To this end, we leverage the complementary strengths of geometric patch-based online bundle adjustment and recent feed-forward reconstruction models. Specifically, we propose a feed-forward reconstruction model to precisely filter out dynamic regions, while also utilizing its depth prediction to enhance the robustness of the patch-based visual SLAM. By aligning depth prediction with estimated patches from bundle adjustment, we robustly ha
3core_reconstruction
high
DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving
Autonomous Driving / Autonomous Driving
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mappingVGGT/feed-forward geometry lineage with direct geometry signal
abstractDynamic scene reconstruction in autonomous driving remains a fundamental challenge due to significant temporal variations, moving objects, and complex scene dynamics. Existing feed-forward 3D models have demonstrated strong performance in static reconstruction but still struggle to capture dynamic motion. To address these limitations, we propose DynamicVGGT, a unified feed-forward framework that extends VGGT from static 3D perception to dynamic 4D reconstruction. Our goal is to model point motion within feed-forward 3D models in a dynamic and temporally coherent manner. To this end, we jointly predict the current and future point maps within a shared reference coordinate system, allowing the model to implicitly learn dynamic point representations through temporal correspondence. To efficiently capture temporal dependencies, we introduce a Motion-aware Temporal Attention (MTA) module that
4core_reconstruction
medium
E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training
Learning Algorithms / Self-supervised
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractSelf-supervised pre-training has revolutionized foundation models for language, 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Exp
5core_reconstruction
high
Emergent Extreme-View Geometry in 3D Foundation Models
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstract3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, non-overlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with ded
6core_reconstruction
high
Emergent Outlier View Rejection in Visual Geometry Grounded Transformers
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; depth_correspondence; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractReliable 3D reconstruction from in-the-wild image collections is often hindered by noisy images—irrelevant inputs with little or no view overlap with others. While traditional Structure-from-Motion pipelines handle such cases through geometric verification and outlier rejection, feed-forward 3D reconstruction models lack these explicit mechanisms, leading to degraded performance under in-the-wild conditions. In this paper, we discover that the existing feed-forward reconstruction model, e.g., VGGT, despite lacking explicit outlier-rejection mechanisms or noise-aware training, can inherently distinguish distractor images. Through an in-depth analysis under varying proportions of synthetic distractors, we identify a specific layer that naturally exhibits outlier-suppressing behavior. Further probing reveals that this layer encodes discriminative internal representations that enable an effe
7core_reconstruction
high
FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstract3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences.In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of **descriptor tokens**. Global attention is then computed as cro
8core_reconstruction
high
Flow3r: Factored Flow Prediction for Visual Geometry Learning
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe propose Flow3r, a scalable framework for visual geometry learning that leverages flow prediction to guide learning using unlabeled monocular videos. Current 3D/4D reconstruction systems primarily rely on dense geometry and pose supervision, and cannot easily generalize to diverse dynamic real-world scenes. In this work, we propose a mechanism to augment training directly from unlabeled videos, leveraging dense 2D correspondences (or ‘flow’) between arbitrary image pairs as supervision. Our key insight is that a factored flow prediction module that computes from two images using ‘geometry latents’ from one image and the ‘pose latent’ from the othercan guide visual geometry learning. We first highlight the benefits and scalability of flow supervision in controlled settings and then leverage large-scale unlabeled data to improve off-the-shelf visual geometry models. We evaluate Flow3r ac
9core_reconstruction
high
FRM: Linear-Time 3D Reconstruction via Test-Time Training
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractFeed-forward transformer models such as VGGT and $\pi^3$ are highly accurate, but their computational cost grows quadratically with the number of input images, making them slow to evaluate on large collections. More efficient approaches ameliorate this cost at the expense of reconstruction quality. We introduce Fast Reconstruction Model, a stateful feed-forward reconstruction model that uses a bidirectional architecture that scales linearly in the number of input views, while matching or surpassing the reconstruction quality of quadratic-time methods. FRM employs test-time training layers to compress images into a compact hidden scene state during a single forward pass, enabling our model to reconstruct 3D scenes at speeds up to 75 FPS on a single H100 GPU---over 20 times faster than SOTA methods such as VGGT. This hidden state also serves as an implicit scene representation which can be
10core_reconstruction
high
GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting has demonstrated superior performance in rendering efficiency and quality, yet the generation of 3D Gaussians still remains a challenge without proper geometric priors. Existing methods have explored to predict point maps as geometric references for inferring Gaussian primitives, while the unreliable estimated geometries may lead to poor generations. In this work, we introduce GaussianGrow, a novel approach that generates 3D Gaussians by learning to grow them from easily accessible 3D point clouds, naturally enforcing geometric accuracy in Gaussian generation. Specifically, we design a text-guided Gaussian growing scheme that leverages a multi-view diffusion model to synthesize consistent appearances from input point clouds for supervision. To mitigate artifacts caused by fusing neighboring views, we constrain on novel views generated at non-preset camera poses iden
11core_reconstruction
high
Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; generation_editingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, \method{} produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.
12core_reconstruction
high
Generalizable Sparse-View 3D Reconstruction from Unconstrained Images
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractReconstructing 3D scenes from sparse, unposed images remains challenging under real-world conditions with varying illumination and transient occlusions. Existing methods rely on scene-specific optimization with appearance embeddings or dynamic masks, requiring extensive per-scene training and failing under sparse views. Moreover, evaluations on limited scenes raise questions about generalization. We present GenWildSplat, a feed-forward framework for sparse-view outdoor reconstruction that requires no per-scene optimization. Given unposed internet images, GenWildSplat predicts depth, camera parameters, and 3D Gaussians in a canonical space using learned geometric priors. An appearance adapter modulates appearance for target lighting conditions, while semantic segmentation handles transient objects. Through curriculum learning on synthetic and real data, GenWildSplat generalizes across div
13core_reconstruction
high
Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction
Autonomous Driving / Autonomous Driving
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; gaussian_radiance; depth_correspondence; surface_occupancy; robotics_mappingVGGT/feed-forward geometry lineage with direct geometry signal
abstractAccurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction.We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric s
14core_reconstruction
high
GGPT: Geometry-Grounded Point Transformer
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractRecent feed-forward networks have achieved remarkable progress in sparse-view 3D reconstruction by predicting dense point maps directly from RGB images. However, they often suffer from geometric inconsistencies and limited fine-grained accuracy due to the absence of explicit multi-view constraints. We introduce the Geometry-Grounded Point Transformer (GGPT), a framework that augments feed-forward reconstruction with reliable sparse geometric guidance. We first propose an improved Structure-from-Motion pipeline based on dense feature matching and lightweight geometric optimisation to efficiently estimate accurate camera poses and partial 3D point clouds from sparse input views.Building on this foundation, we propose a geometry-guided 3D point transformer that refines dense point maps under explicit sparse-geometry supervision using an optimised guidance encoding. Extensive experiments dem
15core_reconstruction
high
HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractVisual Geometry Grounded Transformer (VGGT) has shown significant progress in 3D vision tasks. However, its global attention layers incur quadratic computational cost with respect to the number of input views, becoming a critical bottleneck for scalability. Several sparsification-based acceleration techniques have been proposed to alleviate this issue, but they often suffer from substantial accuracy degradation. We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits head-wise sparsification sensitivity. In the first stage, we measure head-wise sparsification sensitivity using a novel metric, the Head Sensitivity Score (HeSS), which approxi
16core_reconstruction
high
HTTM: Head-wise Temporal Token Merging for Faster VGGT
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractThe Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT.Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the un
17core_reconstruction
high
LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractRecent feed-forward reconstruction models like VGGT and $\pi^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system byaligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($Sim(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignmen
18core_reconstruction
high
Learning 3D Reconstruction with Priors in Test Time
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe introduce a test-time framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks, without retraining or modifying the pre-trained image-only networks. Rather than feeding priors into the architecture, we cast them as constraints on the predictions and optimize the network at inference. The optimization loss is composed of a self-supervised objective and prior penalty terms. The self-supervised objective is defined as the compatibility among multi-view predictions, implemented by the photometric or geometric loss between the renderings from other views and each view itself. Any available priors are turned into the penalty terms on the corresponding output modalities. Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation, our method cons
19core_reconstruction
high
Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos
Autonomous Driving / Autonomous Driving
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; pose_calibration_localization; dynamic_4d; robotics_mapping; data_benchmarkVGGT/feed-forward geometry lineage with direct geometry signal
abstractEgo-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals
20core_reconstruction
high
LongStream: Long-Sequence Streaming Autoregressive Visual Geometry
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; dynamic_4d; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractLong-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences. They typically anchor poses to the first frame, which leads to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we solve Transformer cache issues such as attention-sink reliance and long-term KV-cache contamination. We propose cache-consistent training combined with
21core_reconstruction
high
MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractRecent advancements in neural visual geometry, including transformer-based models such as VGGT and Pi3, have achieved impressive accuracy on 3D reconstruction tasks. However, their reliance on full attention makes them fundamentally limited by GPU memory capacity, preventing them from scaling to large, unordered image collections. We introduce MERG3R, a training-free divide-and-conquer framework that enables geometric foundation models to operate far beyond their native memory limits. MERG3R first reorders and partitions unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently. It then merges the resulting local reconstructions through an efficient global alignment and confidence-weighted bundle adjustment procedure, producing a globally consistent 3D model. Our framework is model-agnostic and can be paired with existing neural geometry mod
22core_reconstruction
high
MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractRecent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks.In 3D visual geometry reconstruction, large-scale training has likewise proven effective for learning versatile representations.However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability.Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation.In addition, it integrates dense sem
23core_reconstruction
high
MotionCrafter: Repurposing Video Generators for Dense Geometry and Motion Reconstruction
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy; generation_editingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe introduce MotionCrafter, the first video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. To represent them effectively in latent space, we propose a 4D VAE that encodes point maps and scene flows as a unified latent compatible with pretrained video generators. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents—despite their fundamentally different distributions—we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in joint 4D geometry reconstruction and dense scene flow estimation, d
24core_reconstruction
high
OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractGeneral 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed,
25core_reconstruction
high
Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips alongside object-level semantics; and 2D–3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation — marking a step forward toward real-time, semantics-aware Spatial AI.
26core_reconstruction
high
Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; surface_occupancy; generation_editing; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractCurrent compositional image-to-3D scene generation approaches construct 3D scenes by time-consuming iterative layout optimization or inflexible joint object-layout generation. Moreover, most methods rely on limited field-of-view perspective images, hindering the creation of complete $360^\circ$ environments. To address these limitations, we design $\textbf{Pano3DComposer}$, an efficient feed-forward framework for panoramic images. To decouple object generation from layout estimation, we propose a plug-and-play Object-World Transformation Predictor. This module converts the 3D objects generated by off-the-shelf image-to-3D models from local to world coordinates. To achieve this, we adapt the VGGT architecture to $\textbf{Alignment-VGGT}$ by using target object crop, multi-view object renderings and camera parameters to predict the transformation. The predictor is trained using pseudo-geom
27core_reconstruction
high
PhysIR-Splat: Physically Consistent Thermal Infrared Radiative Transfer in 3D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; gaussian_radiance; pose_calibration_localizationcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractThermal infrared (TIR) 3D reconstruction provides geometry that is intrinsically coupled to the temperature field, even in low-light, nighttime, and smoke-obscured environments. TIR imaging measures self-emitted thermal radiation driven by object temperature and is largely independent of external illumination; therefore, simply carrying over visible-spectrum assumptions to TIR-based 3D reconstruction and novel view synthesis (NVS) often results in floating artifacts and blurred edges. In addition, radiometric inconsistency and low contrast in TIR weaken structure-from-motion (SfM) initialization, which in turn hinders subsequent 3D Gaussian Splatting (3DGS) optimization. We present PhysIR-Splat, a 3DGS framework that follows infrared radiative transfer: we explicitly model temperature, emissivity, and environmental irradiance on Gaussian primitives and, during rendering, jointly account
28core_reconstruction
high
Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractUnderstanding how the 3D world evolves over time is a fundamental task in computer vision, essential for embodied settings, autonomous driving, etc. It requires not only the reconstruction of the observed scene but also the anticipation of how the scene dynamics will unfold in the future. While the area of 3D reconstruction has progressed rapidly with the advent of recent feed-forward neural networks, forecasting future dynamics in 3D, given the 2D frames of a video remains unexplored. We present Point4Cast, a unified framework that processes streaming 2D frame sequences of a video to estimate the past, present, and future of the underlying dynamic scene, in 3D. At the core of our approach lies a persistently evolving latent \emph{spacetime representation} that models the environment’s evolution across time. Upon receiving a new 2D frame, an update operation integrates the incoming evide
29core_reconstruction
high
Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; surface_occupancy; generation_editing; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractRecent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain—from active sensors such as LiDAR or from feed-forward predictors like VGGT—offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation. A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a st
30core_reconstruction
high
QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractEstimating 3D attributes directly from images has advanced rapidly with the Visual Geometry Grounded Transformer (VGGT), which predicts camera parameters, depth maps, and point clouds in a single forward pass. However, its 1.2B-parameter scale severely limits deployment on resource-constrained platforms such as UAVs and mobile AR devices. To address this limitation, we introduce QVGGT, a tailored quantization framework designed to compress VGGT. Our approach starts from the observation that transformer blocks within VGGT exhibit heterogeneous sensitivity to quantization. We thus analyze per-block quantization sensitivity and propose a selective mixed-precision strategy that allocates higher precision to the most fragile transformer blocks. To address the amplification of quantization error caused by high-variance camera and register tokens, we further introduce token filtering with camer
31core_reconstruction
high
Reliev3R: Relieving Feed-forward 3D Reconstruction from Multi-View Geometric Annotations
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWith recent advances, Feed-forward Reconstruction Models (FFRMs) have demonstrated great potential in reconstruction quality and adaptiveness to multiple downstream tasks. However, the excessive reliance on multi-view geometric annotations, e.g. 3D point maps and camera poses, makes the fully-supervised training scheme of FFRMs difficult to scale up.In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. Relieving the reliance on geometric sensory data and compute-exhaustive structure-from-motion preprocessing, our method draws 3D knowledge directly from monocular relative depths and image sparse correspondences given by zero-shot predictions of pretrained models.At the core of Reliev3R, we design an ambiguity-aware relative depth loss and a trigonometry-based reprojection loss to facilita
32core_reconstruction
high
Scal3R: Scalable Test-Time Training for Feed-forward Large-Scale 3D Reconstruction
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractThis paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lig
33core_reconstruction
high
Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractNovel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapt
34core_reconstruction
high
SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model
Autonomous Driving / Autonomous Driving
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkVGGT/feed-forward geometry lineage with direct geometry signal
abstractThis paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird’s eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state
35core_reconstruction
medium
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
Learning Algorithms / Efficient Models
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractOnline 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. While causal VGGT transformers address this challenge through key-value (KV) cache mechanism, the linear growth of the cache introduces a significant memory bottleneck. When memory constraints trigger early eviction, reconstruction quality and temporal consistency deteriorate markedly. In this work, we observe that attention patterns in causal transformers for 3D reconstruction exhibit intrinsic spatio-temporal sparsity. Leveraging this insight, we propose **STAC**, a *S*patio-**T**emporally **A**ware **C**ache compression framework specifically designed for streaming 3D reconstruction using large causal transformers. STAC incorporates three key components: a **Working Temporal Token Caching** mechanism that preserves long-term informative tokens based on decayed cumula
36core_reconstruction
high
TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; gaussian_radiance; dynamic_4dcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractIn this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss.This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby _unbinding_ the number of predicted primitives from input image resolution and number of views. Our resulting method, __TokenGS__, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance o
37core_reconstruction
high
Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
Autonomous Driving / Autonomous Driving
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mappingVGGT/feed-forward geometry lineage with direct geometry signal
abstractRecent advancements in feed-forward architectures for visual geometry estimation have achieved significant progress. Interestingly, per-frame visual geometry estimation approaches typically exhibit weaker multi-frame consistency but demonstrate superior per-frame accuracy compared to multi-frame algorithms. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals three key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furtherm
38core_reconstruction
high
V-DPM: Video Reconstruction with Dynamic Point Maps
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractNew, powerful 3D representations such as DUSt3R’s invariant point maps, which encode 3D shape and camera parameters, have significantly advanced feed-forward 3D reconstruction. While point maps assume static scenes, Dynamic Point Maps (DPMs) extend the concept to dynamic 3D content, also representing 3D scene motion.However, DPMs have so far been limited to image pairs and, like DUSt3R, require post-processing via optimization when more than two views are involved. We argue that DPMs are far more meaningful when applied to videos and introduce V-DPM to demonstrate this.First, we show how to set up DPMs for videos to optimize their representational power, ease of neural prediction, and reuse of pre-trained models. Second, we implement these ideas on top of VGGT, a recent state-of-the-art 3D reconstructor. Although VGGT was trained on static scenes, we show that a small amount of synthetic
39core_reconstruction
high
VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training.VGG-T$^3$ ($\mathbf{V}$isual $\mathbf{G}$eometry $\mathbf{G}$rounded $\mathbf{T}$est $\mathbf{T}$ime $\mathbf{T}$raining) scales linearly w.r.t. the number of input views, similar to online models, and achieves a $11.6\times$ speed-up over baselines that rely on softmax attention for reconstructing a $1k$ image collection in just $54$ seconds. Because our method retains global scene aggregation capability, our resulting point map reconstruction err
40core_reconstruction
high
VGGT-$\Omega$
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present VGGT-Ω, a feed-forward model for 3D reconstruction that substantially advances the state of the art in accuracy, efficiency, and capability for both static and dynamic scenes. Prior models such as VGGT have shown that feed-forward 3D reconstruction can already be competitive with traditional optimization-based methods. Here, we further demonstrate that the accuracy and robustness of these models scale predictably with model capacity and data size. To enable training 3D reconstruction models at an unprecedented scale, we introduce a high-quality data annotation pipeline that handles dynamic scenes, a self-supervised learning protocol, and architectural changes that greatly reduce memory requirements. We significantly simplify VGGT’s architecture by replacing multiple dense prediction heads with loss-driven multitask learning, removing unstable DPT blocks, and introducing more e
41core_reconstruction
high
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; depth_correspondence; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractThis paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that together form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT’s perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (i
42core_reconstruction
high
VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-view Indoor 3D Object Detection
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractCurrent multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain—i.e., precisely calibrated multi-view camera poses—to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key c
43core_reconstruction
high
VGGTracker: Fast Spatial Tracking with Visual Geometry Transformer
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractExisting 3D point tracking methods mostly rely on heuristic designs or scene reconstruction, which incurs significant computational overhead and makes it difficult to meet the demands of real-time applications.To address this problem, in this work, we present VGGTracker, a novel spatial tracker that leverages a feed-forward visual geometry transformer to predict the trajectories of arbitrary query points from monocular videos in real time.Specifically, we employ a query initialization mechanism to maintain and update a global feature vector and a set of frame-level feature vectors for each query point.Then, we propose a new spatial tracking framework, which consists of a visual geometry transformer backbone, a global embedding branch, a frame-level embedding branch, and a tracking head.The key innovation lies in the dual-branch embedding design, where the global embedding branch integrat
44core_reconstruction
high
WildPose: A Unified Framework for Robust Pose Estimation in the Wild
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractEstimating camera pose in dynamic environments is a critical challenge, as most visual SLAM and SfM methods assume inputs from static environments. While recent dynamic-aware methods exist, they are often not unified: semantic-based approaches are brittle, per-sequence optimization methods fail on short sequences, and other learned models sometimes perform badly on static-only scenes. We present Wildpose, a unified monocular pose estimation framework that is robust in dynamic environments while maintaining state-of-the-art performance on static and low-ego-motion datasets. Our key insight is to connect the two powerful paradigms in modern 3D vision: the rich perceptual frontend of feed-forward models and the end-to-end optimization of differentiable bundle adjustment (BA). We achieve this by enhancing the differentiable BA pipeline in two ways. First, we introduce a new 3D-aware update o
45core_reconstruction
high
DVGT: Visual Geometry Transformer for Autonomous Driving
Autonomous Driving / Autonomous Driving
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; dynamic_4d; robotics_mappingVGGT/feed-forward geometry lineage with direct geometry signal
abstractPerceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, it still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Visual Geometry Transformer specifically designed for autonomous Driving (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. Finally, we use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego pose for each frame. Our DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-
46core_reconstruction
high
OccAny: Generalized Unconstrained Urban 3D Occupancy
Autonomous Driving / Autonomous Driving
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; gaussian_radiance; surface_occupancy; robotics_mappingVGGT/feed-forward geometry lineage with direct geometry signal
abstractRelying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization.While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios.We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features.OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images.Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework
47core_reconstruction
high
VGA:Empowering Aerial-Ground Localization by Visual Geometry Alignment
Remote Sensing & Earth / Remote Sensing
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; pose_calibration_localization; robotics_mappingVGGT/feed-forward geometry lineage with direct geometry signal
abstractAerial-ground visual localization is a challenging task due to the significant differences in scene scale and view point captured between two views. In this work, we explore the practical benefit of jointly learning camera calibration and bird’s-eye-view (BEV) projection for estimating full 6 Degrees-of-freedom relative camera pose between uncalibrated aerial and ground views. We present Visual Geometry Alignment (VGA), a unified framework that jointly learns a global gravity-alignment prior inferred from dense monocular perspective fields, and a planar alignment prior complementing the unobserved azimuth angle through Procrustes alignment in a shared BEV plane. At inference, we jointly refine the relative camera pose by integrating the predicted per-camera gravity alignment and relative planar azimuth angle, yielding improved orientation and translation alignment from visual input with
48core_reconstruction
high
AREA3D: Active Reconstruction Agent with Unified Feed-Forward 3D Perception and Vision-Language Guidance
Multimodal & Language / Agentic AI
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstructionVGGT/feed-forward geometry lineage with direct geometry signal
abstractActive 3D reconstruction enables an agent to autonomously select viewpoints to build accurate and complete scene geometry efficiently, rather than passively reconstructing scenes from pre-collected images. Existing active reconstruction methods often rely on geometric heuristics, which may result in redundant observations without improving reconstruction quality. To address this, we propose \textbf{AREA3D}, an active reconstruction agent for 3D reconstruction by leveraging feed-forward 3D models and vision-language guidance. The framework decouples view uncertainty modeling from feed-forward reconstruction, enabling precise uncertainty estimation without online optimization. Moreover, the integrated Vision-Language Model provides high-level semantic guidance that guides exploration beyond purely geometric cues. Extensive experiments on both scene-level and object-level benchmarks (Replic
49core_reconstruction
high
4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps.Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity.The resulting system e
50core_reconstruction
high
ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractExisting hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object's metric scale and pose for grounding its normalized mesh i
51core_reconstruction
high
CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractAccurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and
52core_reconstruction
high
Catch Me if You Can: Active Mapping of Moving 3D Objects
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractCurrent 3D mapping pipelines generally assume static environments, which limits their ability to accurately capture and reconstruct moving objects. To address this limitation, we introduce the novel task of active mapping of moving objects, in which a mapping agent must plan its trajectory while compensating for the object's motion. Our approach, Paparazzo, provides a learning-free solution that robustly predicts the target's trajectory and identifies the most informative viewpoints from which to observe it, to plan its own path. We also contribute a comprehensive benchmark designed for this new task. Through extensive experiments, we show that Paparazzo significantly improves 3D reconstruction completeness and accuracy compared to several strong baselines, marking an important step toward dynamic scene understanding.
53core_reconstruction
high
Complet4R: Geometric Complete 4D Reconstruction
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe introduce Complet4R, a novel end-to-end framework for Geometric Complete 4D Reconstruction, which aims to recover temporally coherent and geometrically complete reconstruction for dynamic scenes. Our method formalizes the task of Geometric Complete 4D Reconstruction as a unified framework of reconstruction and completion, by directly accumulating full contexts onto each frame. Unlike previous approaches that rely on pairwise reconstruction or local motion estimation, Complet4R utilizes a decoder-only transformer to operate all context globally directly from sequential video input, reconstructing a complete geometry for every single time step, including occluded regions visible in other frames. Our method demonstrates the state-of-the-art performance on our proposed benchmark for Geometric Complete 4D Reconstruction and the 3D point tracking task. Code will be released to support futur
54core_reconstruction
high
Efficiently Reconstructing Dynamic Scenes one D4RT at a Time
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractUnderstanding and reconstructing the complex geometry and motion of dynamic 4D scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward network designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our unified decoding interface allows the model to independently and efficiently probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state-of-the-art, outper
55core_reconstruction
high
EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractStrand-level hair geometry reconstruction is a fundamental problem in virtual human modeling and the digitization of hairstyles. However, existing methods still suffer from a significant trade-off between accuracy and efficiency. Implicit neural representations can capture the global hair shape but often fail to preserve fine-grained strand details, while explicit optimization-based approaches achieve high-fidelity reconstructions at the cost of heavy computation and poor scalability.To address this issue, we propose EfficientMonoHair, a fast and accurate framework that combines the implicit neural network with multi-view geometric fusion for strand-level reconstruction from monocular video. Our method introduces a fusion-patch-based multi-view optimization that reduces the number of optimization iterations for point cloud direction, as well as a novel parallel hair-growing strategy that
56core_reconstruction
high
FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; generation_editingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractSingle-image 3D human reconstruction holds significant promise due to its convenience and high demand in various applications. Previous methods have garnered tremendous progress by employing 2D multi-view diffusion models to generate auxiliary views as reconstruction priors, but they struggle with 3D inconsistencies and limited generalization capabilities. In this paper, we present FISHuman, which aims to generate fine-grained, high-fidelity, and content-wise diverse 3D humans from a single-view input, providing production-ready 3D assets. We propose an elaborately designed workflow that reconstructs dynamic 3D meshes from multi-view inconsistent guidance. Specifically, we adapt a dual-stream transformer-based video diffusion model to generate cross-modally aligned multi-view RGB and normal sequences. We find that naively employing static 3D reconstruction can lead to geometric distortio
57core_reconstruction
high
Inferring Compositional 4D Scenes without Ever Seeing One
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractScenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard.Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, a
58core_reconstruction
high
MoRe: Motion-aware Feed-forward 4D Reconstruction Transrformer
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; pose_calibration_localization; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractReconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruc
59core_reconstruction
high
Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractemporally consistent surface reconstruction of dynamic 3D objects from unstructured point cloud data remains challenging, especially for very long sequences. Existing methods either optimize deformations incrementally, risking drift and requiring long runtimes, or rely on complex learned models that demand category-specific training. We present Neu-PiG, a fast optimization method based on a novel preconditioned surface encoding that estimates coherent non-rigid deformations without sacrificing temporal stability or accuracy. Our method encodes entire deformations across all time steps at various spatial scales into a multi-resolution latent grid, parameterized by the position and normal direction of a reference surface from a single keyframe. This latent representation is then augmented for time modulation and decoded into per-frame 6-DoF deformations via a lightweight multi-layer percep
60core_reconstruction
high
PatchScene: Patch-based Voxel Diffusion Model for Large-Scale Scene Completion
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe propose PatchScene, a novel diffusion-based framework for large-scale LiDAR scene completion. Unlike existing methods that rely on global latent representations or dense voxel grids, PatchScene adopts a patch-based voxel diffusion paradigm that explicitly generates fine-grained geometry within localized 3D regions. To ensure coherent reconstruction at both spatial and temporal scales, we introduce a confidence-guided spatio-temporal fusion mechanism that integrates overlapping patches and adjacent frames in a unified generative process. Furthermore, we design an Annular-Flow diffusion strategy that leverages the radial density pattern of LiDAR scans to progressively propagate high-fidelity information from near-range to far-range regions, enabling spatially unbounded scene completion. Extensive experiments on the SemanticKITTI benchmark demonstrate that PatchScene achieves state-of-th
61core_reconstruction
high
ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel self-correction manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre-computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation.To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation-Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision.The core of ReFlow is a novel self-correction flow matching mechanism, c
62core_reconstruction
high
ReGenHOI: Unifying Reconstruction and Generation for 3D Human–Object Interaction Understanding
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; generation_editingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractUnderstanding 3D human–object interaction (HOI) involves two highly-related abilities: reconstruction, which perceives observed geometry, and generation, which imagines plausible future interactions. However, most existing methods treat these abilities as separate tasks, limiting their capacity to capture the unified nature of human spatial reasoning. To address this, we propose a unified framework that bridges reconstruction and generation through a shared semantic–geometric reasoning space. Specifically, a 3D Contact Reasoning mechanism enables direct reasoning in 3D space, jointly modeling geometric structure and semantic relationships, while a Reasoning Trace Refinement module iteratively refines contact predictions by integrating geometric and semantic cues. The framework builds a unified latent representation via explicit reasoning on human–object contact regions. To further enhanc
63core_reconstruction
medium
Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling
3D Vision & Geometry / Pose Estimation
A. thesis anchor: dynamic/4D recongeneral_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractVisual SLAM is one of the most fundamental problems in computer vision, with direct applications to real-time localization tasks such as AR/VR, robotics, and 3D scene reconstruction. Although significant progress has been made in both sparse and dense approaches, real-time monocular SLAM remains challenging—particularly in the uncalibrated setting, where existing methods are often inefficient and lack modularity. In this paper, we present a new visual SLAM pipeline implemented from scratch in C++ that explicitly leverages the spatio-temporal structure of the scene for improved localization, and is designed to be modular so that off-the-shelf components can be easily integrated. We introduce a temporal representation based on a buffer of recent keyframes that preserves short-term scene continuity. To complement this, we incorporate a spatial representation based on a 3D cell-based scene m
64core_reconstruction
high
Vista4D: Video Reshooting with 4D Point Clouds
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present **Vista4D**, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, our method re-synthesizes the scene with the same dynamics from a different camera trajectory and viewpoint. Existing video reshooting methods often struggle with depth estimation artifacts of real-world dynamic videos, while also failing to preserve content appearance and maintain precise camera control for challenging new trajectories. We build a 4D-grounded point cloud representation with static pixel segmentation and 4D reconstruction to explicitly preserve seen content and provide rich camera signals, and we train with reconstructed multiview dynamic data for robustness against point cloud artifacts during real-world inference. Our results demonstrate improved 4D consistency, camera control, and visual quali
65core_reconstruction
high
WorldStereo: Bridging Controllable Video Generation and Scene Reconstruction via 3D Geometric Memories
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; generation_editingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractRecent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories.In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds.Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank.These components enable WorldStereo to generate multi-view-co
66core_reconstruction
medium
TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos
3D Vision & Geometry / Pose Estimation
A. thesis anchor: dynamic/4D recongeneral_reconstruction; pose_calibration_localization; dynamic_4ddirect reconstruction/3DGS/4D title linked to core representation cluster
abstractReconstructing humans and their surrounding environments in a globally consistent 4D space is essential for comprehensive perception. However, prior works typically assume single-view inputs or decouple humans, scenes, and cameras, making them unable to recover coherent geometry, stable motion, and physically aligned trajectories. These limitations motivate us to introduce a new task: unified human–scene–camera reconstruction from multi-view videos, which aims to jointly estimate dynamic humans, static scenes, and camera poses in one global coordinate frame. We propose TROPHIES--Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos--a unified framework tailored for this task. TROPHIES features a Human Branch that models human through temporal and spatial reasoning, and a Scene Branch that reconstructs static geometry with human-aware attention. A global alignment
67core_reconstruction
high
Any4D: Unified Feed-Forward Metric 4D Reconstruction
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordi
68core_reconstruction
high
$\textit{4DSurf}$: High-Fidelity Dynamic Scene Surface Reconstruction
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractThis paper addresses the problem of dynamic scene surface reconstruction using Gaussian Splatting (GS), aiming to recover temporally consistent geometry. While existing GS-based dynamic surface reconstruction methods can yield superior reconstruction, they are typically limited to either a single object or objects with only small deformations, struggling to maintain temporally consistent surface reconstruction of large deformations over time. We propose ``4DSurf'', a novel and unified framework for generic dynamic surface reconstruction that does not require specifying the number or types of objects in the scene, can handle large surface deformations and temporal inconsistency in reconstruction. The key innovation of our framework is the introduction of Gaussian deformations induced Signed Distance Function Flow Regularization that constrains the motion of Gaussians to align with the evo
69core_reconstruction
high
$L^{2}DGS$: Low-Light Dynamic Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractSynthesizing novel spatiotemporal views of dynamic scenes is inherently challenging due to both object and camera motion, as well as sparsity of observations. Recent advances in Neural Radiance Fields (NeRFs) and Gaussian Splatting (GS) have enabled 4D dynamic scene reconstruction, but predominantly from well-lit images or videos. Some works address the problem of reconstructing a well-lit scene from low-light input, but these are limited to static scenes. Moreover, prior methods primarily emphasize improving illumination, while overlooking the underlying scene characteristics. Reconstructing well-lit dynamic scenes from inputs captured under low-light conditions is particularly challenging due to shadows, occlusions, and disocclusions caused by object motion, which makes the problem highly ambiguous and ill-posed. We propose $L^{2}DGS$ (Low-Light Dynamic Gaussian Splatting), a self-supe
70core_reconstruction
high
3D Gaussian Splatting from unposed Spike Stream
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4dcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) has significantly advanced 3D reconstruction with its impressive performance. However, its reliance on sharp images and precise camera pose priors limits its effectiveness in high-speed scenarios. Recent advances have integrated spike camera, a bio-inspired sensor with a high temporal resolution, to enhance 3DGS in such conditions. Although spike-based methods reduce the need for sharp images, they still face challenges in achieving precise camera pose estimation due to unstable observations and visual texture deficiency.To address these challenges, we propose Nope-SGS, the first framework that reconstructs high-speed 3D scenes from **unposed captures** of the bio-inspired high-temporal-resolution spike camera. To achieve robust 3D reconstruction and pose estimation, we first reformulate the spike model from a probabilistic perspective and extend its applicat
71core_reconstruction
high
3D Gaussian Splatting with Self-Constrained Prior for High Fidelity Surface Reconstruction
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRendering 3D surfaces has been revolutionized within the modeling of radiance fields through either 3DGS or NeRF. Although 3DGS has shown advantages over NeRF in terms of rendering quality or speed, there is still room for improvement in recovering high fidelity surfaces through 3DGS. To resolve this issue, we propose a self-constrained prior to constraining the movement of 3D Gaussians, aiming for more accurate depth rendering. Our self-constrained prior is a TSDF grid fused by the rendered depth during the learning of 3D Gaussians. The prior measures a band on both sides of the estimated surface for imposing more specific constraints on the right 3D Gaussians, such as removing 3D Gaussians outside the band, encouraging larger opacity for Gaussians near the center of the band or smaller opacity for Gaussians near the boundary of the band. We regularly update the prior by fusing more rec
72core_reconstruction
high
4C4D: 4 Camera 4D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4dcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractThis paper tackles the challenge of recovering 4D dynamic scenes from videos captured by as few as four portable cameras. Learning to model scene dynamics for temporally consistent novel-view rendering is a foundational task in computer graphics, where previous works often require dense multi-view captures using camera arrays of dozens or even hundreds of views. We propose 4C4D, a novel framework that enables high-fidelity 4D Gaussian Splatting from video captures of extremely sparse cameras. Our key insight lies that the geometric learning under sparse settings is substantially more difficult than modeling appearance. Driven by this observation, we introduce a Neural Decaying Function on Gaussian opacities for enhancing the geometric modeling capability of 4D Gaussians. This design mitigates the inherent imbalance between geometry and appearance modeling in 4DGS by encouraging the 4DGS
73core_reconstruction
high
4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstract4D reconstruction of equine family (e.g. horses) from monocular video is important for animal welfare. Previous mainstream 4D animal reconstruction methods require joint optimization of motion and appearance over a whole video, which is time-consuming and sensitive to incomplete observation. In this work, we propose a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction. For motion, we introduce a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video. For appearance, we design a novel feed-forward network that reconstructs a high-fidelity, animatable 3D Gaussian avatar from as few as a single image. To assist training, we create a large-scale synthetic motion dataset, VarenPose
74core_reconstruction
high
ActivePolicy: Active Gaussian Reconstruction and Optimization Strategy Based on Global-Local Information Gain
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractActive 3D Gaussian reconstruction achieves superior completeness and rendering quality by intelligently selecting viewpoints. However, existing methods suffer from two critical limitations: information gain metrics that prioritize geometric coverage while ignoring rendering quality, and overfitting to sparse view configurations that degrades novel view synthesis. We introduce ActivePolicy, a novel framework addressing both challenges through principled NBV selection and regularization. We propose \textbf{GL-Graph}, a graph-theoretic strategy that unifies geometric consistency, rendering quality, and observation redundancy into a single stability criterion. To counteract overfitting, we introduce \textbf{4D-Reg}, which identifies floaters through manifold discrepancies among three depth types (R-Depth, $\alpha$-Depth, C-Depth) and suppresses them via adaptive dropout. Extensive experiment
75core_reconstruction
high
AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractMonocular UAV videos pose a fundamental challenge for 3D reconstruction: dynamic scene modeling requires accurate camera poses, yet recovering poses from long UAV trajectories often fails under texture-sparse regions and moving objects.Existing approaches typically handle either pose-free static reconstruction or dynamic reconstruction with known poses, but jointly solving both from casual aerial footage remains difficult due to motion coupling and severe scale variation.We introduce \modelname, a scale-aware Gaussian splatting framework that jointly recovers camera trajectories and reconstructs dynamic scenes from pose-free monocular videos.Central to our method are scale-aware spatio-temporal anchors (S$^2$A-Anchors), which enable a unified optimization via three key decoupling mechanisms:(i) separating ego-motion from object motion,(ii) isolating static geometry from temporal deformat
76core_reconstruction
high
AnchorSplat: Feed-Forward 3D Gaussian Splatting With 3D Geometric Priors
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractScene-level 3D reconstruction has attracted increasing attention, and feed-forward 3D Gaussian Splatting (3DGS) has emerged as a promising paradigm for novel view synthesis. However, most existing methods adopt a pixel-aligned formulation that maps each 2D pixel to a 3D Gaussian, making the number of Gaussians tightly coupled with the input images. This leads to several limitations: (i) reconstruction quality is sensitive to the quantity and viewpoint coverage of input images, often causing Gaussians to accumulate more densely in regions with frequent viewpoints; (ii) alignment errors become more pronounced under sparse-view conditions; and (iii) the lack of explicit geometric consistency can degrade depth estimation and downstream 3D tasks. In this paper, we propose AnchorSplat, a novel multi-view feed-forward 3DGS framework for scene-level reconstruction that departs from pixel-aligned
77core_reconstruction
high
ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractReconstructing articulated objects into high-fidelity digital twins is crucial for applications such as robotic manipulation and interactive simulation. Recent self-supervised methods using differentiable rendering frameworks like 3D Gaussian Splatting remain highly sensitive to the initial part segmentation. Their reliance on heuristic clustering or pre-trained models often causes optimization to converge to local minima, especially for complex multi-part objects.To address these limitations, we propose ArtPro, a novel self-supervised framework that introduces adaptive integration of mobility proposals. Our approach begins with an over-segmentation initialization guided by geometry features and motion priors, generating part proposals with plausible motion hypotheses. During optimization, we dynamically merge these proposals by analyzing motion consistency among spatial neighbors, while
78core_reconstruction
high
BA-GS: Bayesian Adaptive Gaussian Splatting for SFM-Free 3D Reconstruction
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) has demonstrated exceptional performance in reconstruction and novel view synthesis tasks. However, its reliance on Structure-from-Motion preprocessing may lead to degraded performance under sparse-view scenarios. Recent works attempt to address this limitation by leveraging pre-trained image matching models to generate Gaussian primitives but overlook the probabilistic uncertainty embedded in both the initial primitive distribution and iterative position updates. This uncertainty can accumulate and degrade reconstruction fidelity. Hence, we propose BA-GS, a Bayesian framework that models both the global distribution and local uncertainty of Gaussian primitives. At global initialization, a Variational Bayesian Gaussian Mixture Model (VB-GMM) models the latent distribution of primitives, capturing region-wise density and gradient patterns. At local refinement,
79core_reconstruction
high
BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractThe boundary representation (B-rep) models a 3D solid as its explicit boundaries: trimmed corners, edges, and faces. Recovering B-rep representation from unstructured data is a challenging and valuable task of computer vision and graphics. Recent advances in deep learning have greatly improved the recovery of 3D shape geometry, but still depend on dense and clean point clouds and struggle to generalize to novel shapes. We propose B-rep Gaussian Splatting (BrepGaussian), a novel framework that learns 3D parametric representations from 2D images. We employ a Gaussian Splatting renderer with learnable features, followed by a specific fitting strategy. To disentangle geometry reconstruction and feature learning, we introduce a two-stage learning framework that first captures geometry and edges and then refines patch features to achieve clean geometry and coherent instance representations. Ex
80core_reconstruction
high
CGHair: Compact Gaussian Hair Reconstruction with Card Clustering
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe present a compact pipeline for high-fidelity hair reconstruction from multi-view images. While recent 3D Gaussian Splatting (3DGS) methods achieve realistic results, they often require millions of primitives, leading to high storage and rendering costs. Observing that hair exhibits structural and visual similarities across a hairstyle, we cluster strands into representative hair cards and group these into shared texture codebooks. Our approach integrates this structure with 3DGS rendering, significantly reducing reconstruction time and storage while maintaining comparable visual quality. In addition, we propose a generative prior accelerated method to reconstruct the initial strand geometry from a set of images. Our experiments demonstrate a 4-fold reduction in strand reconstruction time and achieve comparable rendering performance with over 200× lower memory footprint.
81core_reconstruction
high
ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractDynamic 3D scene reconstruction is essential for immersive media such as VR, MR, and XR, yet remains challenging for long multi-view sequences with large-scale motion. Existing dynamic Gaussian approaches are either Frame-Stream, offering scalability but poor temporal stability, or Clip, achieving local consistency at the cost of high memory and limited sequence length.We propose ClipGStream, a hybrid reconstruction framework that performs stream optimization at the clip level rather than the frame level. The sequence is divided into short clips, where dynamic motion is modeled using clip-independent spatio-temporal fields and residual anchor compensation to capture local variations efficiently, while inter-clip inherited anchors and decoders maintain structural consistency across clips. This Clip-Stream design enables scalable, flicker-free reconstruction of long dynamic videos with hig
82core_reconstruction
high
Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondencecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe present Gaussian Splatting Alignment (GSA), a novel method for aligning two independent 3D Gaussian Splatting (3DGS) models via a similarity transformation (rotation; translation; scale), even when they are of different objects in the same category (e.g, different cars). In contrast, existing methods can only align 3DGS models of the same object (e.g, the same car) and often must be given true scale as input, while we estimate it successfully. Our approach leverages viewpoint-guided spherical map features to obtain robust correspondences and introduces a two-step optimization framework that aligns models while keeping the 3DGS models fixed. First, we perform an iterative, feature-guided coarse registration that is robust to extremely poor initialization (e.g, 180° misalignment or a 10× scale gap), followed by a fine registration step enforcing multi-view feature consistency, inspired
83core_reconstruction
high
CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractSingle-view 3D human reconstruction has garnered significant attention in recent years. Despite numerous advancements, prior research has concentrated on reconstructing 3D models from clear, close-up images of individual subjects, often yielding subpar results in the more prevalent multi-person scenarios. Reconstructing 3D human crowd models is a highly intricate task, laden with challenges such as: 1) extensive occlusions, 2) low clarity, and 3) numerous and various appearances. To address this task, we propose CrowdGaussian, a unified framework that directly reconstructs multi-person 3D Gaussian Splatting (3DGS) representations from single-image inputs. To handle occlusions, we devise a self-supervised adaptation pipeline that enables the pretrained large human model to reconstruct complete 3D humans with plausible geometry and appearance from heavily occluded inputs.Furthermore, we in
84core_reconstruction
high
Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractNovel view synthesis has been significantly advanced by NeRFs and 3D Gaussian Splatting (3DGS), which require ordering volumetric samples or primitives for correct color blending. While the recent Gaussian-Enhanced Surfels (GES) enable high-performance, sort-free rendering, they suffer from aliasing artifacts and suboptimal reconstruction. To address these limitations, we propose DP-GES, a novel representation that augments opaque surfels with semi-transparent boundaries and leverages Depth Peeling to establish accurate per-pixel ordering. This design enables sort-free Gaussian splatting with correct transmittance modulation, effectively eliminating aliasing and popping artifacts while facilitating a fully differentiable joint optimization. Extensive experiments demonstrate that our method achieves superior reconstruction quality and compares favorably against state-of-the-art techniques
85core_reconstruction
high
DiffSoup: Direct Differentiable Rasterization of Triangle Soup for Extreme Radiance Field Simplification
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRadiance field reconstruction aims to recover high-quality 3D representations from multi-view RGB images. Recent advances, such as 3D Gaussian splatting, have achieved real-time rendering with high visual fidelity, given sufficiently powerful graphics hardware. However, drastic model simplification — i.e., reducing the number of primitives by several orders of magnitude — is required to enable efficient online transmission and rendering across diverse hardware platforms. We introduce DiffSoup, a radiance field representation that employs a soup (i.e., a highly unstructured primitives) of a small number of triangles with neural textures that have binary opacity. We show that the binary opacity representation is directly differentiable via stochastic opacity masking, enabling stable training without molifier (i.e., smooth rasterization). DiffSoup can be rasterized with a traditional depth-
86core_reconstruction
high
Disco-GS: Gaussian Splatting in Dynamic Color Lighting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRecent advances in Gaussian Splatting (GS) have significantly improved 3D scene reconstruction and novel view synthesis. However, most existing methods typically assume that training inputs are captured under stable lighting conditions and achromatic light. In contrast, scenes recorded under temporally varying color light, as in “disco lights” commonly seen in events, performances, and decorative settings, introduce severe ambiguities in both scene photometry and geometry. We propose Disco-GS, a framework that leverages GS for reconstructing the 3D scene while simultaneously recovering the underlying canonical appearance from videos captured under dynamic lighting conditions. Disco-GS estimates the effective per-pixel transient light, which, when applied to the canonical image, results in the observed color image of the scene, thereby enabling self-supervised learning. Disco-GS is an end
87core_reconstruction
high
Distilling Unsigned Distance Function for Surface Reconstruction from 3D Gaussian Splatting
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractUnsigned distance fields (UDFs) are well suited for representing open surfaces, but learning them from multi-view images is challenging because ground-truth surfaces are unavailable for supervision in most cases and the gradient of a UDF is undefined on the underlying surface. Prior methods optimize UDFs with global objectives and apply gradient-based priors ignoring the non-differentiability for queries on the target surface, which leads to unstable training and over-smoothing on fine details. We address these issues by distilling a patch-based UDF prior, trained on synthetic ground truth algebraic surfaces with closed form expressions, into a lightweight student UDF inside Gaussian optimization process. We design band-limited knowledge distillation strategy that leverages a pretrained patch-based UDF predictor to provide reliable near-surface UDF supervision, enabling stable student t
88core_reconstruction
high
Dropping Anchor and Spherical Harmonics for Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRecent 3D Gaussian Splatting (3DGS) dropout methods address overfitting under sparse-view conditions by randomly nullifying Gaussian opacities. However, we identify a neighbor compensation effect in these approaches: dropped Gaussians are often compensated by their neighbors, weakening the intended regularization. Moreover, these methods overlook the contribution of high-degree spherical harmonic coefficients (SH) to overfitting. To address these issues, we propose DropAnSH-GS, a novel anchor-based dropout strategy. Rather than dropping Gaussians independently, our method randomly selects certain Gaussians as anchors and simultaneously removes their spatial neighbors. This effectively disrupts local redundancies and encourages the model to learn more robust, globally informed representations. Furthermore, we extend the dropout to color attributes by randomly dropping higher-degree SH coe
89core_reconstruction
high
DualPrim: Compact 3D Reconstruction with Positive and Negative Primitives
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present Compact 3D Reconstruction with Positive and Negative Primitives (DualPrim), a novel approach for reconstructing compact and topologically regular 3D meshes from multi-view images. Unlike traditional methods that rely on implicit representations such as signed distance functions, or explicit formats such as meshes and point clouds, our method models geometry using quadrics-based 3D primitives. Each primitive is defined by a positive-density superquadric that contributes to the shape, and a negative-density superquadric that carves out local volumes, enabling fine-grained geometric control and flexible topology. This dual-primitive representation yields compact, well-regularized, and efficiently parameterized mesh reconstructions. To infer primitive parameters from multi-view images, we design a differentiable rendering pipeline that jointly estimates positive and negative super
90core_reconstruction
high
E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractThe emergence of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS) has advanced novel view synthesis (NVS). These methods, however, require high-quality RGB inputs and accurate corresponding poses, limiting robustness under real-world conditions such as fast camera motion or adverse lighting. Event cameras, which capture brightness changes at each pixel with high temporal resolution and wide dynamic range, enable precise sensing of dynamic scenes and offer a promising solution. However, existing event-based NVS methods still rely on known poses or depend on depth estimation models and auxiliary modalities such as RGB-D. We present E2EGS, a pose-free framework operating solely on event streams. Our key insight is that edge information provides rich structural cues essential for accurate trajectory estimation and high-quality NVS. To extract edges from noisy event streams, we
91core_reconstruction
high
EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; robotics_mappingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractUnderstanding a 3D scene immediately with its exploration is essential for embodied tasks, where an agent must construct and comprehend the 3D representation in an online and nearly real-time manner. In this study, we propose **EmbodiedSplat**, an online feed-forward 3DGS for open-vocabulary scene understanding that enables simultaneous online 3D reconstruction and 3D semantic understanding from the streaming images. Unlike existing open-vocabulary 3DGS methods, our objectives are two-fold: 1) Reconstructs the semantic-embedded 3DGS of the entire scene from over 300 streaming images in an online manner. 2) Highly generalizable to novel scenes with feed-forward design and supports nearly real-time 3D semantic reconstruction when combined with real-time 2D models. To achieve these objectives, we propose an Online Sparse Coefficients Field with a CLIP Global Codebook where it binds the 2D C
92core_reconstruction
medium
Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow
3D Vision & Geometry / Pose Estimation
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; robotics_mapping; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractHigh-quality 3D scene representation in radiance fields relies on accurate camera poses which are often difficult to acquire in real-world scenarios. An effective solution is to use RGB images for the joint optimization of radiance fields and camera poses, an approach that has been well explored in NeRF series methods. However, unlike NeRF, joint optimization in 3D Gaussian Splatting (3DGS) often requires additional regularization or prior spatial knowledge to reach comparable performance. To eliminate these dependencies, we introduce Energy-GS, a pose-aware Gaussian splatting framework that jointly optimizes scene representation and camera poses using only RGB images. We observe that pose gradients in joint optimization are unstable due to the point-based rendering mechanism. Furthermore, unlike NeRF’s spatial sampling framework that enables coarse-to-fine pose alignment, rasterization-
93core_reconstruction
high
FastEventDGS: Deformable Gaussian Splatting for Fast Dynamic Scenes from a Single Event Camera
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractThe demand for dynamic 3D assets in AR/VR has recently popularized Deformable Gaussian Splatting. However, traditional RGB cameras are limited in their ability to reconstruct high-speed scenes due to motion blur and low temporal resolution. While event cameras offer a promising alternative, reconstructing a complete scene from their sparse and noisy output is a significant challenge. Existing event-based methods rely on an auxiliary sensor, such as a frame camera, thereby inducing tedious hardware and calibration challenges.We introduce FastEventDGS, a novel Deformable Gaussian Splatting-based framework that leverages a single event camera for high-fidelity 4D reconstruction. Our method utilizes a continuous camera trajectory parametrization and integrates two event generation models to provide both photometric and geometric constraints. We further propose a local patch event motion loss
94core_reconstruction
high
FastGS: Training 3D Gaussian Splatting in 100 Seconds
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractThe dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multi-view consistency, efficiently solving the trade-off between training time and rendering quality. We innovatively design a densification and pruning strategy based on multi-view consistency, dispensing with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets demonstrate that our method significantly outperforms the state-of-the-art methods in training speed, achieving a 3.29× training acceleration and comparable rendering quality compared with DashGaussian on the Mip-NeRF 360 dataset and a 15.45× acceleration
95core_reconstruction
high
FHAvatar: Fast and High-Fidelity Reconstruction of Face-and-Hair Composable 3D Head Avatar from Few Casual Captures
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; generation_editingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe present FHAvatar, a novel framework for reconstructing 3D Gaussian avatars with composable face and hair components from an arbitrary number of views. Unlike previous approaches that couple facial and hair representations within a unified modeling process, we explicitly decouples two components in texture space by representing the face with planar Gaussians and the hair with strand-based Gaussians. To overcome the limitations of existing methods that rely on dense multi-view captures or costly per-identity optimization, we propose an aggregated transformer backbone to learn geometry-aware cross-view priors and head-hair structural coherence from multi-view datasets, enabling effective and efficient feature extraction and fusion from few casual captures. Extensive quantitative and qualitative experiments demonstrate that FHAvatar achieves state-of-the-art reconstruction quality from on
96core_reconstruction
high
FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe present FlexAvatar, a flexible large reconstruction model for high-fidelity 3D head avatars with detailed dynamic deformation from single or sparse images, without requiring camera poses or expression labels. It leverages a transformer-based reconstruction model with structured head query tokens as canonical anchor to aggregate flexible input-number-agnostic, camera-pose-free and expression-free inputs into a robust canonical 3D representation.For detailed dynamic deformation, we introduce a lightweight UNet decoder conditioned on UV-space position maps, which can produce detailed expression-dependent deformations in real time. To better capture rare but critical expressions like wrinkles and bared teeth, we also adopt a data distribution adjustment strategy during training to balance the distribution of these expressions in the training set.Moreover, a lightweight 10-second refinemen
97core_reconstruction
high
FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractReal objects inhabit a physical world and must behave plausibly during interaction with other physical objects. However, current methods that perform 3D reconstructions of real-world scenes from multi-view images optimize primarily for visual fidelity, i.e., they train with photometric losses and reason about uncertainty in the image or representation space. This appearance-centric view overlooks body contacts and couplings, conflates function-critical regions (e.g., aerodynamic or hydrodynamic surfaces) with ornamentation, and reconstructs structures suboptimally, even when physical regularizers are added. We consider the question: How can 3D reconstruction become aware of real-world interactions and underlying object function, beyond visual cues? We propose FluidGaussian, a plug-and-play method that tightly couples geometry reconstruction with ubiquitous fluid-structure interactions to
98core_reconstruction
high
FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractThe increasing need for augmented reality and robotics is urging for articulated object reconstruction with high scalability. However, the existing settings of reconstructing from discrete articulation states or casual monocular video need non-trivial axes alignment or suffer from insufficient coverage, limiting the applications. In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simpler setup and high scalability. FreeArtGS combines free-moving part segmentation with joint estimation and end-to-end optimization, taking only a monocular RGB-D video as input. By optimizing with the priors from off-the-shelf point-tracking and feature models, free-moving part segmentation discovers rigid parts from relative motion in unconstrained capture. The joint estimation module proposes a noise-resistant approa
99core_reconstruction
high
From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractIn this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors.Given uncalibrated and unposed multi-view images, NAS3R reconstructs 3D Gaussian primitives from context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision.To ensure stable convergence, NAS3R integrates scene reconstruction and camera estimation within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization.The framework is compatible with state-of-the-art architectures and can incorporate pretrained priors or intrinsic information when available.Extensive experiments show that NAS3R achieves superior result
100core_reconstruction
medium
From Rays to Projections: Better Inputs for Feed-Forward View Synthesis
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting, but title/abstract signal is narrower
abstractFeed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plücker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity
101core_reconstruction
high
FSFSplatter: Geometrically Accurate Reconstruction with Free Sparse-view Images within 2 minutes
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractGaussian Splatting has become a leading reconstruction technique, known for its high-quality novel view synthesis and detailed reconstruction. However, most existing methods require dense, calibrated views. Reconstruction from free sparse-view images often leads to poor surface due to limited overlap and overfitting.We introduce FSFSplatter for $\textbf{f}$ast geometrically accurate reconstruction from $\textbf{f}$ree $\textbf{s}$parse-view images. Our method integrates end-to-end dense Gaussian scene initialization and geometry-enhanced scene optimization.Specifically, FSFSplatter employs a large transformer to encode multi-view images and generates a dense and geometrically consistent Gaussian scene initialization via a batch based self-splitting Gaussian head. It eliminates local floaters through contribution-based pruning and mitigates overfitting by leveraging depth and multi-view f
102core_reconstruction
high
GaussFusion: Improving 3D Reconstruction in the Wild with Geometry-Informed Video Generator
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; generation_editingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe present GaussFusion, a novel approach for improving 3D Gaussian splatting (3DGS) reconstructions in the wild through geometry-informed video generation. GaussFusion mitigates common 3DGS artifacts, including floaters, flickering, and blur caused by camera pose errors, incomplete coverage, and noisy geometry initialization. Unlike prior RGB-based approaches limited to a single reconstruction pipeline, our method introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Given an existing reconstruction, we render a Gaussian primitive video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. We further introduce an artifact synthesis pipeline that simulates diverse degradation patterns, ensuring robustness and generaliz
103core_reconstruction
high
Gaussian Mapping for Evolving Scenes
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractMapping systems with novel view synthesis (NVS) capabilities are widely used in computer vision, as well as in various applications, including augmented reality, robotics, and autonomous driving. Most notably, 3D Gaussian Splatting-based systems show high NVS performance; however, many current approaches are limited to static scenes. While recent works have begun addressing short-term dynamics (motion within the camera's view), long-term dynamics (the scene evolving through changes out of view) remain less explored.To overcome this limitation, we introduce a dynamic scene adaptation mechanism that continuously updates the 3D representation to reflect the latest changes. In addition, since maintaining geometric and semantic consistency remains challenging due to stale observations disrupting the reconstruction process, we propose a novel keyframe management mechanism that discards outdate
104core_reconstruction
high
GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe introduce GaussianZoom, a generative zoom-in 3D reconstruction system with an iterative progressive framework that combines geometry-consistent scene modeling and multi-scale semantic reasoning to enable high-fidelity extreme zoom-in rendering from low-resolution inputs.To achieve this, we develop a novel multi-view consistent super-resolution module with depth-based feature warping and VLM-driven detail synthesis, ensuring accurate multi-view correspondence while enriching fine-scale appearance beyond the observed resolution.To support zooming across large magnification ranges, we further introduce a new expandable continuous Level-of-Detail hierarchy that dynamically modulates Gaussian visibility for smooth, alias-free cross-scale rendering. Experiments on Mip-NeRF360 and Tanks\&Temples demonstrate that GaussianZoom achieves superior perceptual quality, multi-view consistency, and r
105core_reconstruction
high
GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractReconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diffusion models have enabled remarkable progress in image and video generation for avatar reconstruction, existing methods primarily rely on 2D priors and struggle to achieve consistent 3D geometry. We propose a novel framework that leverages geometry-aware diffusion to distill strong geometry priors for high-fidelity head avatar reconstruction. Our approach jointly synthesizes portrait images and corresponding surface normals, while a pose-free expression encoder captures implicit expression representations. Both synthesized images and expression latents are distilled into 3D Gaussian-based avatars, enabling photorealistic rendering with accurate geometry. Extensive experiments demonstrate that our method substantially outperforms state-of-
106core_reconstruction
high
Geometric-Photometric Event-based 3D Gaussian Ray Tracing
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractEvent cameras offer a high temporal resolution over traditional frame-based cameras, which makes them suitable for motion and structure estimation. However, it has been unclear how event-based 3D Gaussian Splatting (3DGS) approaches could leverage fine-grained temporal information of sparse events. This work proposes a framework to address the trade-off between accuracy and temporal resolution in the event-based 3DGS. Our key idea is to decouple the rendering into two branches: event-by-event geometry (depth) rendering and snapshot-based radiance (intensity) rendering, by using ray-tracing and the image of warped events. The extensive evaluation shows that our method achieves the state-of-the-art performance on the real-world datasets and competitive performance on the synthetic datasets. Also, the proposed method works without prior information (e.g., pretrained image reconstruction mod
107core_reconstruction
high
GHPT: Real-Time Relightable Gaussian Splatting using Hybrid Path Tracing
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian splatting (3DGS) has emerged as a promising approach for high-fidelity 3D scene representation. However, relighting and composition of Gaussian splatting remain challenging because path tracing is not directly applicable. Existing relighting methods for Gaussian splatting typically adopt either approximate rendering formulations or rely on Gaussian ray tracing, yielding low relighting performance and low rendering efficiency. To address these limitations, we propose Gaussian hybrid path tracing (GHPT), a three-stage framework to acquire relightable Gaussian splatting models. The first stage utilizes planar-based Gaussian splatting reconstruction representation (PGSR) to enable multi-view consistent depth rendering and reconstruct the surface mesh of a scene. The second stage performs physically-based differentiable rendering on the obtained mesh to reconstruct the material ma
108core_reconstruction
high
GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractFeed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficie
109core_reconstruction
high
GOR-IS: 3D Gaussian Object Removal In the Intrinsic Space
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; generation_editing; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRecent advances in Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have made it standard practice to reconstruct 3D scenes from multi-view images. Removing objects from such 3D representations is a fundamental editing task that requires complete and seamless inpainting of occluded regions, ensuring consistency in geometry and appearance. Although existing methods have made notable progress in improving inpainting consistency, they often neglect global lighting effects, leading to physically implausible results. Moreover, these methods struggle with view-dependent non-Lambertian surfaces, where appearance varies across viewpoints, leading to unreliable inpainting. In this paper, we present 3D **G**aussian **O**bject **R**emoval in the **I**ntrinsic **S**pace (GOR-IS), a novel framework for physically consistent and visually coherent 3D object removal. Our approach decompose
110core_reconstruction
high
GP-4DGS: Probabilistic Analysis of 4D Gaussian Splattings for Monocular Video Reconstruction via Variational Gaussian Processes
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe present GP-4DGS, a probabilistic framework for monocular video reconstruction that models the motion of 4D Gaussian Splatting (GS) primitives using variational Gaussian Processes (GPs). In contrast to prior approaches that depend on manually designed motion priors, our kernel-based probabilistic formulation enables flexible, data-adaptive motion modeling while implicitly providing appropriate priors for unobserved regions. GP-4DGS employs variational GPs with spatial kernels to capture geometric correlations and periodic kernels to characterize temporal dynamics, achieving efficient scalability to large sets of primitives compared to standard GPs. To train GP-4DGS, we introduce an optimization strategy that jointly optimizes GS primitive parameters as well as GP hyperparameters, establishing a complementary relationship between probabilistic and geometric modeling. Beyond improved rec
111core_reconstruction
high
HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractDiffusion priors have recently demonstrated strong capability in enhancing the quality of sparse-view 3D reconstruction by augmenting training views at novel viewpoints, but they inevitably introduce hallucinated content-- artifacts inconsistent with the input views -- into the final 3D model. To address this challenge, we propose Hallucination-Aware Diffusion prior (HAD), which estimates pixel-wise hallucination score maps for augmented images by leveraging multi-view reasoning capabilities from a feedforward novel view synthesis (NVS) network pre-trained on large-scale 3D data. These hallucination scores enable selective masking of unreliable pixels during the progressive 3D reconstruction procedure, preventing the introduction of non-existent artifacts into the 3D model. To further enhance performance, we create multiple versions of augmented images at each novel view by conditioning
112core_reconstruction
high
Hermite Radial Basis Function for Surface Reconstruction via Differentiable Rendering
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractRecent advances in novel view synthesis have enabled differentiable rendering methods to reconstruct 3D scenes directly from images. Algorithms such as 3D Gaussian Splatting and RayGauss use local basis functions to represent radiance fields, enabling fast, high-quality rendering of real-world scenes. However, these methods lack an exact geometric representation of the scene. In this work, inspired by Hermite Radial Basis Function (HRBF) implicits, we introduce a global implicit function constructed from local RBFs and their derivatives to represent surfaces. The proposed formulation enables learning scene geometry through differentiable rendering of an implicit function. By leveraging local basis functions, it achieves both an efficient geometric representation and fast rendering, using a bounding volume hierarchy (BVH) to accelerate intersections with the local basis functions. The imp
113core_reconstruction
high
HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) has recently emerged as a promising approach in novel view synthesis, combining photorealistic rendering with real-time efficiency. However, its success heavily relies on dense camera coverage; under sparse-view conditions, insufficient supervision leads to irregular Gaussian distributions—characterized by globally sparse coverage, blurred background, and distorted high-frequency areas.To address this, we propose HeroGS—Hierarchical Guidance for Robust 3D Gaussian Splatting—a unified framework that establishes hierarchical guidance across the image, feature, and parameter levels. At the image level, sparse supervision is converted into pseudo-dense guidance, globally regularizing the Gaussian distributions and forming a consistent foundation for subsequent optimization. Building upon this, Feature-Adaptive Densification and Pruning (FADP) at the feature leve
114core_reconstruction
high
IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractGeneralizable 3D Gaussian Splatting aims to directly predict Gaussian parameters using a feed-forward network for scene reconstruction. Among these parameters, Gaussian means are particularly difficult to predict, so depth is usually estimated first and then unprojected to obtain the Gaussian sphere centers. Existing methods typically rely solely on a single warp to estimate depth probability, which hinders their ability to fully leverage cross-view geometric cues, resulting in unstable and coarse depth maps. To address this limitation, we propose IDESplat, which iteratively applies warp operations to boost depth probability estimation for accurate Gaussian mean prediction. First, to eliminate the inherent instability of a single warp, we introduce a Depth Probability Boosting Unit (DPBU) that integrates multi-level epipolar attention maps in a multiplicative manner. Next, we construct a
115core_reconstruction
high
Illumination-Consistent Human-Scene Reconstruction from Monocular Video
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractReconstructing 3D humans and scenes from monocular videos is a challenging task, particularly due to human motion, varying illumination, and dynamic scene shadows. While recent works have explored scene disentanglement by jointly modeling humans and their surrounding scenes, they often overlook illumination and shadow effects—resulting in inconsistent human appearance and degraded scene realism. To address this gap, we propose a photometrically consistent integration of human and scene reconstruction based on 3D Gaussian Splatting, with a key focus on modeling spatially-varying illumination and shadows. Central to our method is a learnable light volume that provides localized lighting cues to human Gaussians, enabling more realistic and consistent appearance synthesis. To further ensure accurate human geometry and alignment, we adopt a two-stage reconstruction strategy: we first optimize
116core_reconstruction
high
iLRM: An Iterative Large 3D Reconstruction Model
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractFeed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input images to enabl
117core_reconstruction
high
iSplat: Iterative Learning for Fine-Grained Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRecent advances in feed-forward 3D Gaussian splatting have demonstrated remarkable efficiency by reconstructing scenes in a single pass. However, the reconstruction fidelity of these methods lags behind that of traditional optimization-based approaches, which gradually correct reconstruction flaws through a lengthy iterative process. In this paper, we leverage the strengths of both paradigms and introduce iSplat, a novel framework that reformulates reconstruction as an iterative feed-forward process involving multiple (typically three) passes.Central to iSplat is a recurrent GRU-based optimizer that refines both geometry and appearance in a synergistic loop. To address geometric inaccuracies, we propose an uncertainty-driven depth refinement strategy that progressively narrows the search space for each Gaussian based on its estimated uncertainty from the previous step. To further improve
118core_reconstruction
high
Layered 4D-Rotor Gaussian Splatting: A Compressed Representation for Long Dynamic Scenes
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe address the challenge of reconstructing long dynamic scenes from multi-view videos in a storage-efficient manner. Recent advances in Gaussian Splatting and its extensions to dynamic scenes have demonstrated impressive visual quality, but remain limited to short duration (<10 s), large storage size (>500 MB), and high GPU VRAM usage.To overcome these limitations, we introduce Layered 4D-Rotor Gaussian Splatting (L4DRotorGS), a novel compressed representation designed for long dynamic scenes. Our approach integrates a layered 4D representation, efficient training, and effective compression into a unified framework. Specifically, 4D Gaussians are first organized into layers based on their temporal extents and then partitioned into discrete temporal buckets. This structure allows for selective access and rendering of only the necessary subsets of 4D Gaussians, substantially reducing GPU m
119core_reconstruction
high
Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; robotics_mappingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractRobust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics.We introduce $\textbf{\textit{UniSplat}}$, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a $\textit{dual-masking strategy}$ that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural informatio
120core_reconstruction
high
Learning Compact 3D Representations from Feed-Forward Novel View Synthesis
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractReconstructing and understanding 3D scenes from sparse views in a feed-forward manner remains challenging. While recent approaches use per-pixel 3D Gaussian Splatting for reconstruction and 2D-to-3D feature lifting for scene understanding, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation. We propose a feed-forward framework that estimates compact Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns to efficiently lift features. Extensive experiments on 3D open-vocabulary segmentation and view-invariant feature generation demon
121core_reconstruction
high
LumiMotion: Improving Gaussian Relighting with Scene Dynamics
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractIn 3D reconstruction, the problem of inverse rendering, namely recovering the illumination of the scene and the material properties, is fundamental. Existing Gaussian Splatting-based methods primarily target static scenes and often assume simplified or moderate lighting to avoid entangling shadows with surface appearance. This limits their ability to accurately separate lighting effects from material properties, particularly in real-world conditions. We address this limitation by leveraging dynamic elements - regions of the scene that undergo motion - as a supervisory signal for inverse rendering. Motion reveals the same surfaces under varying lighting conditions, providing stronger cues for disentangling material and illumination. This thesis is supported by our experimental results which show we improve LPIPS by 23\% for albedo estimation and by 15% for scene relighting relative to nex
122core_reconstruction
medium
LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; generation_editing; data_benchmarkcore genus=3D Reconstruction, but title/abstract signal is narrower
abstractWe present a novel approach for interactive light editing in indoor scenes from a single multi-view scene capture. Our method leverages a generative image-based light decomposition model that factorizes complex indoor scene illumination into its constituent light sources. This factorization enables independent manipulation of individual light sources, specifically allowing control over their state (on/off), chromaticity, and intensity. We further introduce multi-view lighting harmonization to ensure consistent propagation of the lighting decomposition across all scene views. This is integrated into a relightable 3D Gaussian splatting representation, providing real-time interactive control over the individual light sources. Our results demonstrate highly photorealistic lighting decomposition and relighting outcomes across diverse indoor scenes. We evaluate our method on both synthetic and
123core_reconstruction
high
MAPo : Motion-Aware Partitioning of Deformable 3D Gaussian Splatting for High-Fidelity Dynamic Scene Reconstruction
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4dcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting, known for enabling high-quality static scene reconstruction with fast rendering, is increasingly being applied to multi-view dynamic scene reconstruction. A common strategy involves learning a deformation field to model the temporal changes of a canonical set of 3D Gaussians. However, these deformation-based methods often produce blurred renderings and lose fine motion details in highly dynamic regions due to the inherent limitations of a single, unified model in representing diverse motion patterns. To address these challenges, we introduce Motion-Aware Partitioning of Deformable 3D Gaussian Splatting (MAPo), a novel framework for high-fidelity dynamic scene reconstruction. Its core is a dynamic score-based partitioning strategy that distinguishes between high- and low-dynamic 3D Gaussians. For high-dynamic 3D Gaussians, we recursively partition them temporally an
124core_reconstruction
high
Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object’s complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object’s overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We
125core_reconstruction
high
MeshSplatting: Differentiable Rendering with Opaque Meshes
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractPrimitive-based splatting methods like 3D Gaussian Splatting (3DGS) have revolutionized novel view synthesis with real-time rendering.However, their point-based representations remain incompatible with mesh-based pipelines that power AR/VR and game engines. We present MeshSplatting, a mesh-based reconstruction approach that jointly optimizes geometry and appearance through differentiable rendering.By enforcing connectivity via restricted Delaunay triangulation and refining surface consistency, MeshSplatting creates end-to-end smooth, visually high-quality meshes that render efficiently in real-time 3D engines.On Mip-NeRF360, it boosts PSNR by +0.69 dB over the current state-of-the-art MiLo for mesh-based novel view synthesis, while training 2x faster and using 2x less memory, bridging neural rendering and interactive 3D graphics for seamless real-time scene interaction.
126core_reconstruction
high
MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractRecently, 3D Gaussian Splatting and its derivatives have achieved significant breakthroughs in large-scale scene reconstruction. However, how to efficiently and stably achieve high-quality geometric fidelity remains a core challenge. To address this issue, we introduce MetroGS, a novel Gaussian Splatting framework for efficient and robust reconstruction in complex urban environments. Our method is built upon a distributed 2D Gaussian Splatting representation as the core foundation, serving as a unified backbone for subsequent modules. To handle potential sparse regions in complex scenes, we propose a structured dense enhancement scheme that utilizes SfM priors and a pointmap model to achieve a denser initialization, while incorporating a sparsity compensation mechanism to improve reconstruction completeness. Furthermore, we design a progressive hybrid geometric optimization strategy that
127core_reconstruction
high
Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; robotics_mappingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractOpen-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs. free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-base
128core_reconstruction
high
MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectioanl Blending with Hierarchical Densification
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractRecent advances in 4D Gaussian Splatting (4DGS) have extended the high-speed rendering capability of 3D Gaussian Splatting (3DGS) into the temporal domain, enabling real-time rendering of dynamic scenes.However, one of the major remaining challenges lies in modeling long-range motion-contained dynamic videos, where a naïve extension of existing methods leads to severe memory explosion, temporal flickering, and failure to handle appearing or disappearing occlusions over time. To address these challenges, we propose a novel 4DGS framework characterized by an Anchor Relay-based Bidirectional Blending (ARBB) mechanism, named MoRel, which enables temporally consistent and memory-efficient modeling of long-range dynamic scenes.Our method progressively constructs locally canonical anchor spaces at key-frame time index and models inter-frame deformations at the anchor level, enhancing temporal c
129core_reconstruction
high
MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4dcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractOnline reconstruction of dynamic scenes aims to learn from streaming multi-view inputs under low-latency constraints. The fast training and real-time rendering capabilities of 3D Gaussian Splatting have made on-the-fly reconstruction practically feasible, enabling online 4D reconstruction. However, existing online approaches, despite their efficiency and visual quality, fail to learn per-Gaussian motion that reflects true scene dynamics. Without explicit motion cues, appearance and motion are optimized solely under photometric loss, causing per-Gaussian motion to chase pixel residuals rather than true 3D motion. To address this, we propose MoRGS, an efficient online per-Gaussian motion reasoning framework that treats Gaussian movement as a core modeling object. Specifically, we efficiently leverage optical flow on a sparse set of key views as a lightweight motion cue to guide per-Gaussi
130core_reconstruction
high
MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4dcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe present MOSAIC-GS, a novel, fully explicit, and computationally efficient approach for high-fidelity dynamic scene reconstruction from monocular videos using Gaussian Splatting.Monocular reconstruction is inherently ill-posed due to the lack of sufficient multiview constraints, making accurate recovery of object geometry and temporal coherence particularly challenging. To address this, we leverage multiple geometric cues, such as depth, optical flow, dynamic object segmentation, and point tracking. Combined with rigidity-based motion constraints, these cues allow us to estimate preliminary 3D scene dynamics during an initialization stage.Recovering scene dynamics prior to the photometric optimization reduces reliance on motion inference from visual appearance alone, which is often ambiguous in monocular settings.To enable compact representations, fast training, and real-time rendering
131core_reconstruction
high
MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRealistic reconstruction of dynamic 4D scenes is essential for understanding the physical world.Despite recent progress in monocular view synthesis, existing methods still struggle to recover accurate 3D geometry and temporally consistent motion in complex environments.To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences, enabling faithful reconstruction of high-fidelity scene structures and coherent motion representation under complex dynamics.To handle motion with arbitrary flexibility and long-term variation, we introduce a scalable motion field built upon cluster-based bases that adaptively grow to capture diverse motion patterns over time.Moreover, we introduce a progressive optimization strategy that extends naturally to unseen frames. This strategy comprises two propagation modules: 1)
132core_reconstruction
high
MSCD-GS: Motion-Separated Cooperative Deblurring Dynamic Reconstruction via Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractAlthough 4D reconstruction based on Gaussian Splatting has achieved many impressive results, reconstructing real-world images captured by a casual monocular camera remains a significant challenge. In dynamic scenes, as the camera and objects move during the exposure time, these input images inevitably contain a considerable amount of motion blur, which severely compromises the quality of reconstruction and new viewpoint synthesis. The existing deblurring 3D Gaussian models still cannot handle motion blur issues in real dynamic scenes. To address these challenges, we propose MSCD-GS—a novel method for motion-separated collaborative deblurring 4D reconstruction via Gaussian Splatting, capable of effectively handling motion-blurred inputs. Specifically, due to the distinct motion characteristics of static and dynamic Gaussians, we perform separate motion modeling to achieve dynamic scene re
133core_reconstruction
high
MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractGeneralizable Neural Radiance Fields (GeNeRF) enable high-quality scene reconstruction from a limited number of views and can generalize to unseen scenes. However, in real-world environments, transient distractors disrupt structural consistency across views, leading to deviated supervision signals and degraded reconstruction quality. Existing distractor-free NeRF methods rely on per-scene optimization and they estimate uncertainty from per-view reconstruction errors to remove distractors, but this is unreliable to GeNeRF, because it may misjudge inconsistent static structures from source views as distractors. To address this issue, we propose MUGeNeRF: a multi-view uncertainty-guided distractor-aware GeNeRF method, aim to effectively alleviate GeNeRF's robust modeling challenges in dynamic scenes with transient distractions. We explicitly decompose distractor awareness into two complemen
134core_reconstruction
high
Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; generation_editingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRecent years have witnessed the rapid emergence of 3D Gaussian Splatting (3DGS) as a powerful approach for 3D reconstruction and novel view synthesis. Its explicit representation with Gaussian primitives enables fast training, real-time rendering, and convenient post-processing such as editing and surface reconstruction. However, 3DGS suffers from a critical drawback: the number of primitives grows drastically for scenes with high-frequency appearance details, since each primitive can represent only a single color, requiring multiple primitives for every sharp color transition.To overcome this limitation, we propose Neural Gabor splatting, which augments each Gaussian primitive with a lightweight multi-layer perceptron (MLP) that models a wide range of color variations within a single primitive. To further control primitive numbers, we introduce a frequency-aware densification strategy t
135core_reconstruction
high
No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present the first study of cross-sensor view synthesis across different modalities. We examine a practical, fundamental, yet widely overlooked problem: getting aligned RGB-X data, where most RGB-X prior work assumes such pairs exist and focuses on modality fusion, but it empirically requires huge engineering effort in calibration. We propose a match-densify-consolidate method. First, we perform RGB-X image matching followed by guided point densification. Using the proposed confidence-aware densification and self-matching filtering, we attain better view synthesis and later consolidate them in 3D Gaussian Splatting (3DGS). Our method uses no 3D priors for X-sensor and only assumes nearly no-cost COLMAP for RGB. We aim to remove the cumbersome calibration for various RGB-X sensors and advance the popularity of cross-sensor learning by a scalable solution that breaks through the bottlene
136core_reconstruction
high
Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; generation_editingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractFeed-forward 3D Gaussian Splatting (3DGS) models enable real-time scene generation but are hindered by suboptimal pixel-aligned primitive placement, which relies on a dense, rigid grid and limits both quality and efficiency. We introduce a new feed-forward architecture that detects 3D Gaussian primitives at a sub-pixel level, replacing the pixel grid with an adaptive, ``Off The Grid" distribution. Inspired by keypoint detection, our multi-resolution decoder learns to distribute primitives across image patches. This module is trained end-to-end with a 3D reconstruction backbone using self-supervised learning. Our resulting pose-free model generates photorealistic scenes in seconds, achieving state-of-the-art novel view synthesis for feed-forward models. It outperforms competitors while using far fewer primitives, demonstrating a more accurate and efficient allocation that captures fine de
137core_reconstruction
high
P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; robotics_mappingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit representation enabling fast, high-fidelity rendering, making it a promising foundation for closed-loop simulators and perception models in autonomous driving. However, conventional 3DGS implicitly assumes consistent exposure and tone mapping across views. Real driving data violates this assumption due to heterogeneous camera pipelines and dynamic outdoor illumination, baking exposure discrepancies and sensor noise into the radiance field and producing artifacts and inconsistent illumination especially in static backgrounds crucial for realistic simulation. These issues are amplified in autonomous driving, where sparse viewpoints, varying exposures, and outdoor lighting interact, while prior work mainly targets dynamic-object reconstruction and overlooks cross-view photometric consistency.To address this limitation,
138core_reconstruction
high
PackUV: Packed Gaussian UV Maps for 4D Volumetric Video
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; generation_editing; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractVolumetric videos offer immersive 4D experiences, but remain difficult to reconstruct, store, and stream at scale. Existing Gaussian Splatting based methods achieve high-quality reconstruction but break down on long sequences, temporal inconsistency, and fail under large motions and disocclusions. Moreover, their outputs are typically incompatible with conventional video coding pipelines, preventing practical applications. We introduce PackUV, a novel 4D Gaussian representation that maps all Gaussian attributes into a sequence of structured, multi-scale UV atlas, enabling compact, image-native storage. To fit this representation from multi-view videos, we propose PackUV-GS, a temporally consistent fitting method that directly optimizes Gaussian parameters in the UV domain. A flow-guided Gaussian labeling and video keyframing module identifies dynamic Gaussians, stabilizes static regions,
139core_reconstruction
high
Part$^{2}$GS: Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractArticulated objects are common in the real world, yet modeling their structure and motion remains a challenging task for 3D reconstruction methods. In this work, we introduce Part$^{2}$GS, a novel framework for modeling articulated digital twins of multi-part objects with high-fidelity geometry and physically consistent articulation. Part$^{2}$GS leverages a part-aware 3D Gaussian representation that encodes articulated components with learnable attributes, enabling structured, disentangled transformations that preserve high-fidelity geometry. To ensure physically consistent motion, we propose a motion-aware canonical representation guided by physics-based constraints, including contact enforcement, velocity consistency, and vector-field alignment. Furthermore, we introduce a field of repel points to prevent part collisions and maintain stable articulation paths, significantly improving
140core_reconstruction
high
ParticleGS: Learning Neural Gaussian Particle Dynamics from Videos for Prior-free Physical Motion Extrapolation
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; generation_editingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractThe ability to extrapolate dynamic 3D scenes beyond the observed timeframe is fundamental to advancing physical world understanding and predictive modeling. Existing dynamic 3D reconstruction methods have achieved high-fidelity rendering of temporal interpolation, but typically lack physical consistency in predicting the future. To overcome this issue, we propose ParticleGS, a physics-based framework that reformulates dynamic 3D scenes as physically grounded systems. ParticleGS comprises three key components: 1) an encoder that decomposes the scene into static properties and initial dynamic physical fields; 2) an evolver based on Neural Ordinary Differential Equations (Neural ODEs) that learns continuous-time dynamics for motion extrapolation; and 3) a decoder that reconstructs 3D Gaussians from evolved particle states for rendering. Through this design, ParticleGS integrates physical re
141core_reconstruction
high
PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; generation_editingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present PercHead, a model for single-image 3D head reconstruction and disentangled 3D editing - two tasks that are inherently challenging due to ambiguity in plausible explanations for the same input. At the heart of our approach lies our novel perceptual loss based on DINOv2 and SAM 2.1. Unlike widely-adopted low-level losses like LPIPS, SSIM or L1, we rely on deep visual understanding of images and the resulting generalized supervision signals. We show that our new loss can be a drop-in replacement for standard losses and used to improve visual quality in high-frequency areas. We base our model architecture on Vision Transformers (ViTs), allowing us to decouple the 3D representation from the 2D input. We train our method on multi-view images for view-consistency and in-the-wild images for strong transferability to new environments. Our model achieves state-of-the-art performance in
142core_reconstruction
high
PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractUnderstanding physical properties such as friction, stiffness, hardness, and material composition is essential for enabling robots to interact safely and effectively with their surroundings. However, existing 3D reconstruction methods focus on geometry and appearance and cannot infer these underlying physical properties. We present PhysGS, a Bayesian-inferred extension of 3D Gaussian Splatting that estimates dense, per-point physical properties from visual cues and vision--language priors. We formulate property estimation as Bayesian inference over Gaussian splats, where material and property beliefs are iteratively refined as new observations arrive. PhysGS also models aleatoric and epistemic uncertainties, enabling uncertainty-aware object and scene interpretation. Across object-scale (ABO-500), indoor, and outdoor real-world datasets, PhysGS improves accuracy of the mass estimation by
143core_reconstruction
high
Physically Inspired Gaussian Splatting for HDR Novel View Synthesis
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractHigh dynamic range novel view synthesis (HDR-NVS) reconstructs scenes with dynamic details by fusing multi-exposure low dynamic range (LDR) views, yet it struggles to capture ambient illumination-dependent appearance. Implicitly supervising HDR content by constraining tone-mapped results fails in correcting abnormal HDR values, and results in limited gradients for Gaussians in under/over-exposed regions. To this end, we introduce PhysHDR-GS, a physically inspired HDR-NVS framework that models scene appearance via intrinsic reflectance and adjustable ambient illumination. PhysHDR-GS employs a complementary image-exposure (IE) branch and Gaussian-illumination (GI) branch to faithfully reproduce standard camera observations and capture illumination-dependent appearance changes, respectively. During training, the proposed cross-branch HDR consistency loss provides explicit supervision fo
144core_reconstruction
high
Plug-and-Play PDE Optimization for 3D Gaussian Splatting: Toward High-Quality Rendering and Reconstruction
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction by achieving high-quality novel view synthesis with fast rendering speed, introducing 3D Gaussian primitives to represent the scene. However, 3DGS encounters blurring and floaters when applied to complex scenes, caused by the reconstruction of redundant and ambiguous geometric structures. We attribute this issue to the unstable optimization of the Gaussians. To address this limitation, we present a plug-and-play PDE-based optimization method that overcomes the optimization constraints of 3DGS-based approaches in various tasks, such as novel view synthesis and surface reconstruction. Firstly, we theoretically derive that the 3DGS optimization procedure can be modeled as a PDE, and introduce a viscous term to ensure stable optimization. Secondly, we use the Material Point Method (MPM) to obtain a stable numerical
145core_reconstruction
medium
PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting
3D Vision & Geometry / Point Cloud
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; robotics_mappingdirect reconstruction/3DGS/4D title linked to core representation cluster
abstractUnsupervised point cloud segmentation is critical for embodied intelligence and autonomous driving, as it mitigates the prohibitive cost of dense point-level annotations required by fully supervised methods. Integrating 2D pre-trained models such as SAM to supplement semantic information is a natural choice, yet this approach faces a fundamental mismatch between discrete 3D points and continuous 2D images. This mismatch leads to inevitable projection overlap and complex modality alignment, resulting in compromised semantic consistency across 2D-3D transfer.​ To address these limitations and achieve semantic-consistent segmentation, this paper proposes PointGS, a simple yet effective pipeline for unsupervised 3D point cloud segmentation. PointGS leverages 3D Gaussian Splatting as a unified intermediate representation to bridge the discrete-continuous domain gap. Input sparse point clouds
146core_reconstruction
high
PolarGuide-GSDR: 3D Gaussian Splatting Driven by Polarization Priors and Deferred Reflection for Real-World Reflective Scenes
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractPolarization-aware Neural Radiance Fields (NeRF) enable novel view synthesis of specular-reflection scenes but face challenges in slow training, inefficient rendering, and strong dependencies on material/viewpoint assumptions. However, 3D Gaussian Splatting (3DGS) enables real-time rendering yet struggles with accurate reflection reconstruction from reflection-geometry entanglement, adding a deferred reflection module introduces environment map dependence. We address these limitations by proposing PolarGuide-GSDR, a polarization-forward-guided paradigm establishing a bidirectional coupling mechanism between polarization and 3DGS: first 3DGS’s geometric priors are leveraged to resolve polarization ambiguity, and then the refined polarization information cues are used to guide 3DGS’s normal and spherical harmonic representation. This process achieves high-fidelity reflection separation an
147core_reconstruction
high
Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractOmnidirectional 3D Gaussian Splatting with panoramas is a key technique for 3D scene representation, and existing methods typically rely on slow SfM to provide camera poses and sparse points priors. In this work, we propose a pose-free omnidirectional 3DGS method, named PFGS360, that reconstructs 3D Gaussians from unposed omnidirectional videos. To achieve accurate camera pose estimation, we first construct a spherical consistency-aware pose estimation module, which recovers poses by establishing consistent 2D–3D correspondences between the reconstructed Gaussians and the unposed images using Gaussians' internal depth priors. Besides, to enhance the fidelity of novel view synthesis, we introduce a depth-inlier-aware densification module to extract depth inliers and Gaussian outliers with consistent monocular depth priors, enabling efficient Gaussian densification and achieving photoreali
148core_reconstruction
medium
PoseGaussian: 6D Pose Estimation for Unseen Objects via Sparse-View Object-Level 3D Gaussian Splatting
3D Vision & Geometry / Pose Estimation
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstract6D pose estimation is a key technology in computer vision and robotic manipulation. However, many methods remain heavily dependent on CAD models that are difficult to obtain. Object-level 3D reconstruction provides an alternative route, and 3D Gaussian Splatting (3DGS) shows convincing potential owing to its training and rendering efficiency. Nevertheless, under sparse reference views, 3DGS is prone to floating artifacts and appearance overfitting, which weakens the stability of pose estimation. We present PoseGaussian, a method for sparse-view 6D pose estimation for unseen object that builds on improved 3DGS. First, we use sparse RGB-D views to inject a depth structure prior into the 3DGS initialization for stable structure, and we adopt adaptive density control, view-warping augmentation, and joint photometric–depth supervision to reduce floaters and appearance overfitting under sparse
149core_reconstruction
high
Radiance Meshes for Volumetric Reconstruction
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; generation_editingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe introduce Radiance Meshes for representing radiance fields with constant density tetrahedral cells produced with a Delaunay tetrahedralization.Unlike a Voronoi diagram, a Delaunay tetrahedralization yields simple triangles that are natively supported by existing hardware. As such, our model is able to perform exact and fast volume rendering using both rasterization and ray-tracing. We introduce a new rasterization method that achieve faster rendering speeds than all prior radiance field representations (assuming an equivalent number of primitives and resolution) across a variety of platforms.Optimizing the positions of Delaunay vertices introduces topological discontinuities (edge flips). To solve this, we use a Zip-NeRF-style backbone which allows us to express a smoothly varying field even when the topology changes.Our rendering method exactly evaluates the volume rendering equation
150core_reconstruction
high
REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractArticulated objects are pervasive in daily environments, such as drawers and refrigerators. Towards their part-level surface reconstruction and joint parameter estimation, REArtGS introduces a category-agnostic approach using multi-view RGB images at two different states. However, we observe that REArtGS still struggles with screw-joint or multi-part objects and lacks geometric constraints for unseen states. In this paper, we propose REArtGS++, a novel method towards generalizable articulated object reconstruction with temporal geometry constraint and planar Gaussian splatting. We first model a decoupled screw motion for each joint without type prior, and jointly optimize part-aware Gaussians with joint parameters through part motion blending. To introduce time-continuous geometric constraint for articulated modeling, we encourage Gaussians to be planar and propose a temporally consisten
151core_reconstruction
high
RelightAnyone: A Generalized Relightable 3D Gaussian Head Model
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; robotics_mapping; generation_editing; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) has become a standard approach to reconstruct and render photorealistic 3D head avatars. A major challenge is to relight the avatars to match any scene illumination. For high quality relighting, existing methods require subjects to be captured under complex time-multiplexed illumination, such as one-light-at-a-time (OLAT). We propose a new generalized relightable 3D Gaussian head model that can relight any subject observed in a single- or multi-view images without requiring OLAT data for that subject. Our core idea is to learn a mapping from flat-lit 3DGS avatars to corresponding relightable Gaussian parameters for that avatar. Our model consists of two stages: a first stage that models flat-lit 3DGS avatars without OLAT lighting, and a second stage that learns the mapping to physically-based reflectance parameters for high-quality relighting. This two-stage
152core_reconstruction
high
ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractHigh-quality 3D garment reconstruction plays a crucial role in mitigating the sim-to-real gap in applications such as digital avatars, virtual try-on and robotic manipulation. However, existing garment reconstruction methods, typically rely on the unstructured representations, such as 3D Gaussian Splats, which struggle to provide accurate reconstructions of garment topology and sewing structures. As a result, the reconstructed outputs are often unsuitable for high-fidelity physical simulation. We propose \textbf{ReWeaver}, a novel framework for topology-accurate 3D garment and sewing pattern reconstruction from \textit{sparse} multi-view RGB images. Given as few as four input views, ReWeaver predicts seams and panels as well as their connectivities in both the 2D UV space and the 3D space. The reconstructed seams and panels align precisely with the input images, and can be easily convert
153core_reconstruction
high
RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractNeural fields (NFs) have achieved remarkable success in scene reconstruction and novel view synthesis. However, existing NF approaches that rely on RGB or LiDAR inputs often struggle under adverse weather conditions, limiting their robustness in real-world outdoor environments such as autonomous driving. In contrast, millimeter-wave radar is inherently resilient to environmental variations, yet its integration with NFs remains largely underexplored. Moreover, outdoor driving scenes frequently involve dynamic objects, making spatiotemporal modeling crucial for temporally consistent novel view synthesis. To address these challenges, we present RF4D, a radar-based neural field framework tailored for novel view synthesis in outdoor dynamic scenes. RF4D explicitly incorporates temporal information into its representation, enabling more accurate modeling of object motion. A dedicated \textbf{s
154core_reconstruction
high
RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractReconstructing people, objects, and their interactions in 3D is a long-standing and fundamental goal for intelligent systems. Often the input is RGB video from a moving camera, making the task ill-posed; depth is ambiguous, humans and objects occlude each other, and camera and object motion entangle to create apparent motion. Most prior work addresses humans or objects in isolation, ignoring their interplay, or assumes known 3D shapes or cameras, which is impractical for real-world applications. We develop RHINO (Reconstructing Human Interactions with Novel Objects), a novel three-step framework that recovers in 3D a human, novel (unseen) manipulated object, and static scene in a common world frame from a monocular RGB video. First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and
155core_reconstruction
high
S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractExplicit 3D representations have already become an essential medium for 3D simulation and understanding.However, the most commonly used point cloud and 3D Gaussian Splatting (3DGS) each suffer from non-photorealistic rendering and significant degradation under sparse inputs.In this paper, we introduce Sparse to Dense lifting (S2D), a novel pipeline that bridges the two representations and achieves high-quality 3DGS reconstruction with minimal inputs.Specifically, the S2D lifting is two-fold.We first present an efficient one-step diffusion model that lifts sparse point cloud for high-fidelity image artifact fixing.Meanwhile, to reconstruct 3D consistent scenes, we also design a corresponding reconstruction strategy with random sample drop and weighted gradient for robust model fitting from sparse input views to dense novel views.Extensive experiments show that S2D achieves the best consis
156core_reconstruction
medium
ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; generation_editing; data_benchmarkcore genus=3D Reconstruction, but title/abstract signal is narrower
abstractRecent advancements in 3D object generation using diffusion models have achieved remarkable success, but generating realistic 3D urban scenes remains challenging. Existing methods relying solely on 3D diffusion models tend to suffer a degradation in appearance details, while those utilizing only 2D diffusion models typically compromise camera controllability. To overcome this limitation, we propose ScenDi, a method for urban scene generation that integrates both 3D and 2D diffusion models. We first train a 3D latent diffusion model to generate 3D Gaussians, enabling the rendering of images at a relatively low resolution. To enable controllable synthesis, this 3DGS generation process can be optionally conditioned by specifying inputs such as 3d bounding boxes, road maps, or text prompts. Then, we train a 2D video diffusion model to enhance appearance details conditioned on rendered images
157core_reconstruction
high
SDGS: Spatial Difference Guided Gaussian Splatting for Simultaneous Localization and 3D Reconstruction
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit 3D representation, enabling photorealistic and real-time novel view synthesis. However, most 3DGS pipelines still assume precomputed camera poses and offline optimization, which introduces latency and makes them brittle in fast-motion, real-world scenarios. Existing online 3DGS systems mostly fall into two camps: (1) hybrid systems that rely on a separate traditional SLAM system for camera poses and optimize Gaussians decoupled from tracking, increasing system complexity; and (2) purely Gaussian-based systems that estimate poses from dense photometric errors, requiring repeated rendering of a large number of Gaussians and thus incurring high computational cost. Moreover, current online methods are often sensitive to motion blur and high dynamic range scenes, limiting their applicability in practice.We address these
158core_reconstruction
high
Semantic Foam: Unifying Spatial and Semantic Scene Decomposition
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; generation_editingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractCurrent generation scene reconstruction methods like 3D Gaussian Splatting are capable of producing photo-realistic novel view synthesis at real-time speeds, yet see only limited adoption in many practical graphics applications.One significant contributing factor to this gap is the difficulty of interacting with and editing these representations in comparison to classic human-authored 3D assets.While work has been done to impose semantic decomposition onto these representations, there are still significant limitations in the quality and consistency of these segmentations.We address this by proposing a semantically decomposed variant of the recently introduced Radiant Foam method.Our approach, Semantic Foam, combines the natural spatial volumetric decomposition provided by Radiant Foam's Voronoi mesh with an explicit semantic feature field parameterized on the cells.The explicit mesh stru
159core_reconstruction
high
SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe presents SGS-Intrinsic, an indoor inverse rendering framework that works well for sparse-view images. Unlike existing 3D Gaussian Splatting (3DGS) based methods that focus on object-centric reconstruction and fail to work under sparse view settings, our method allows to achieve high-quality geometry reconstruction and accurate disentanglement of material and illumination. The core idea is to construct a dense and geometry-consistent Gaussian semantic field guided by semantic and geometric priors, providing a reliable foundation for subsequent inverse rendering. Building upon this, we perform material–illumination disentanglement by combining a hybrid illumination model and material prior to effectively capture illumination–material interactions. To mitigate the impact of cast shadows and enhance the robustness of material recovery, we introduce illumination-invariant material constrai
160core_reconstruction
high
SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractNovel view synthesis of dynamic scenes is fundamental to achieving photorealistic 4D reconstruction and immersive visual experiences. Recent progress in Gaussian-based representations has significantly improved real-time rendering quality, yet existing methods still struggle to maintain a balance between long-term static and short-term dynamic regions in both representation and optimization. To address this, we present SharpTimeGS, a lifespan-aware 4D Gaussian framework that achieves temporally adaptive modeling of both static and dynamic regions under a unified representation.Specifically, we introduce a learnable lifespan parameter that reformulates temporal visibility from a Gaussian-shaped decay into a flat-top profile, allowing primitives to remain consistently active over their intended duration and avoiding redundant densification. In addition, the learned lifespan modulates each
161core_reconstruction
high
Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractReconstructing high-fidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25-200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while optimization-based methods achieve higher fidelity but require dense views and expensive computation. We bridge this gap with a hybrid approach that combines the strengths of both paradigms.Our method introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass. We then
162core_reconstruction
high
SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy b
163core_reconstruction
high
SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractReconstructing dynamic fluids from sparse views is a long-standing and challenging problem, due to the severe lack of 3D information from insufficient view coverage. While several pioneering approaches have attempted to address this issue using differentiable rendering or novel view synthesis, they are often limited by time-consuming optimization under ill-posed conditions. We propose SmokeSVD, an efficient and effective framework to progressively reconstruct dynamic smoke from a single video by integrating the generative capabilities of diffusion models with physically guided consistency optimization. Specifically, we first propose a physically guided side-view synthesizer based on diffusion models, which explicitly incorporates velocity field constraints to generate spatio-temporally consistent side-view images frame by frame, significantly alleviating the ill-posedness of single-view
164core_reconstruction
high
SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; robotics_mappingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractArticulated 3D objects are critical for embodied AI, robotics, and interactive scene understanding, yet creating simulation-ready assets remains labor-intensive and requires expert modeling of part hierarchies and motion structures. We introduce SPARK, a framework for reconstructing physically consistent, kinematic part-level articulated objects from a single RGB image. Given an input image, we first leverage VLMs to extract coarse URDF parameters and generate part-level reference images. We then integrate the part-image guidance and the inferred structure graph into a generative diffusion transformer to synthesize consistent part and complete shapes of articulated objects. To further refine the URDF parameters, we incorporate differentiable forward kinematics and differentiable rendering to optimize joint types, axes, and origins under VLM-generated open-state supervision. Extensive exp
165core_reconstruction
high
Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractNeural Radiance Fields (NeRF) have shown remarkable success in image novel view synthesis (NVS), inspiring extensions to LiDAR NVS. However, most methods heavily rely on accurate camera poses for scene reconstruction. The sparsity and textureless nature of LiDAR data also present distinct challenges, leading to geometric holes and discontinuous surfaces. To address these issues, we propose SG-NLF, a pose-free LiDAR NeRF framework that integrates spectral information with geometric consistency. Specifically, we design a hybrid representation based on spectral priors to reconstruct smooth geometry. For pose optimization, we construct a confidence-aware graph based on feature compatibility to achieve global alignment. In addition, an adversarial learning strategy is introduced to enforce cross-frame consistency, thereby enhancing reconstruction quality. Comprehensive experiments demonstrate
166core_reconstruction
high
Splatent: Splatting Diffusion Latents for Novel View Synthesis
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRadiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusion-based pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D spa
167core_reconstruction
high
SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4dcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) enables high-quality novel view synthesis, motivating interest in generating higher-resolution renders than those available during training. A natural strategy is to apply super-resolution (SR) to low-resolution (LR) input views, but independently enhancing each image introduces multi-view inconsistencies, leading to blurry renders. Prior methods attempt to mitigate these inconsistencies through learned neural components, temporally consistent video priors, or joint optimization on LR and SR views, but all uniformly apply SR across every image. In contrast, our key insight is that close-up LR views may contain high-frequency information for regions also captured in more distant views, and that we can use the camera pose relative to scene geometry to inform where to add SR content. Building from this insight, we propose SplatSuRe, a method that selectively app
168core_reconstruction
high
SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D super-resolution (3DSR) aims to reconstruct high-resolution (HR) 3D scenes from low-resolution (LR) multi-view images. Existing methods rely on dense LR inputs and per-scene optimization, which restricts the high-frequency priors for constructing HR 3D Gaussian Splatting (3DGS) to those inherited from pretrained 2D super-resolution (2DSR) models. This severely limits reconstruction fidelity, cross-scene generalization, and real-time usability. We propose to reformulate 3DSR as a direct feed-forward mapping from sparse LR views to HR 3DGS representations, enabling the model to autonomously learn 3D-specific high-frequency geometry and appearance from large-scale, multi-scene data. This fundamentally changes how 3DSR acquires high-frequency knowledge and enables robust generalization to unseen scenes. Specifically, we introduce \textbf{SR3R}, a feed-forward framework that directly predi
169core_reconstruction
high
STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractReconstructing high-fidelity and animatable 3D head avatars from monocular videos remains a challenging yet essential task. Existing methods based on 3D Gaussian Splatting typically bind Gaussians to mesh triangles and model deformations solely via Linear Blend Skinning, which results in rigid motion and limited expressiveness. Moreover, they struggle to reconstruct frequently occluded regions (e.g., mouth interiors, eyelids). To address these limitations, we propose STAvatar, which consists of two key components: (1) a UV-Adaptive Soft Binding framework that leverages both image- and FLAME-based priors to learn per-Gaussian feature offsets within the UV space. This UV representation supports dynamic resampling, ensuring full compatibility with Adaptive Density Control (ADC) and enhanced adaptability to geometric and textural variations. (2) a Temporal ADC strategy, which first clusters
170core_reconstruction
high
SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractReconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step.However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object’s motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven
171core_reconstruction
high
Tavatar: Topology-Aware Gaussian Attribute Derivation for Animatable Human Avatars
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; robotics_mappingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractReconstructing high-fidelity, animatable human avatars from monocular videos remains a critical challenge. Existing 3DGS-based human animation methods constrain Gaussian parameters but exclude scale, which we argue is crucial for adapting human poses to challenging out-of-distribution poses. To achieve robust animation under unseen poses, we propose Tavatar, which derives key parameters such as scale, rotation, and other geometric attributes directly from the local mesh geometry, instead of learning them through unconstrained optimization. This paradigm shift enforces topological consistency by design, as each Gaussian is analytically anchored to the local mesh geometry, inheriting its spatial structure and deformation behavior. Specifically, we bind Gaussians to mesh faces and vertices, deriving their scales and orientations from triangle properties and local edge lengths to ensure cohe
172core_reconstruction
high
TokenSplat: Token-aligned 3D Gaussian Splatting for Feed-forward Pose-free Reconstruction
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe present **TokenSplat**, a feed-forward framework for joint 3D Gaussian reconstruction and camera pose estimation from unposed multi-view images.At its core, TokenSplat introduces a **Token-aligned Gaussian Prediction** module that aligns semantically corresponding information across views directly in the feature space.Guided by coarse token positions and fusion confidence, it aggregates multi-scale contextual features to enable long-range cross-view reasoning and reduce redundancy from overlapping Gaussians.To further enhance pose robustness and disentangle viewpoint cues from scene semantics, TokenSplat employs learnable camera tokens and an **Asymmetric Dual-Flow Decoder (ADF-Decoder)** that enforces directionally constrained communication between camera and image tokens. This maintains clean factorization within a feed-forward architecture, enabling coherent reconstruction and stab
173core_reconstruction
high
Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; generation_editing; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractGenerating dynamic 4D objects from sparse inputs is difficult because it demands joint preservation of appearance and motion coherence across views and time while suppressing artifacts and temporal drift. We hypothesize that the view discrepancy arises from supervision limited to pixel- or latent-space video-diffusion losses, which lack explicitly temporally aware, feature-level tracking guidance.We present \emph{Track4DGen}, a two-stage framework that couples a multi-view video diffusion model with a foundation point tracker and a hybrid 4D Gaussian Splatting (4D-GS) reconstructor. The central idea is to explicitly inject tracker-derived motion priors into intermediate feature representations for both multi-view video generation and 4D-GS. In Stage One, we enforce dense, feature-level point correspondences inside the diffusion generator, producing temporally consistent features that cur
174core_reconstruction
high
tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model’s capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforwar
175core_reconstruction
high
TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractNovel view synthesis from sparse-view inputs poses a significant challenge in 3D computer vision, particularly for achieving high-quality scene reconstructions with limited viewpoints. We introduce TWINGS, a framework that enhances 3D Gaussian Splatting (3DGS) by directly addressing point sparsity. We employ Thin Plate Splines (TPS), a smooth non-rigid deformation model that minimizes bending energy to estimate a globally coherent warp from control-point correspondences, to align backprojected points from estimated depth with triangulated 3D control points, yielding calibrated backprojected points. By sampling these calibrated points near the control points, TWINGS provides a fast and geometrically accurate initialization for 3DGS, ultimately improving structural detail preservation and color fidelity in reconstructed scenes. Extensive experiments on DTU, LLFF, and Mip-NeRF360 demonstrat
176core_reconstruction
high
Uika: Universal Head Avatar from Pose-Free Images
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of unposed inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation. First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise UV coordinate estimation. Such UV coordinate estimation allows us to project each valid pixel from screen space to UV space, which is independent of camera pose and character expression. We thus leverage this UV space to represent our Gaussian head avatar. To this end, we design learnable UV tokens on which the attent
177core_reconstruction
high
Unblur-SLAM: Dense Neural SLAM for Blurry Inputs
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe propose Unblur-SLAM, an RGB SLAM pipeline for sharp 3D reconstruction from blurred image inputs. In contrast to previous work, our approach is able to handle different types of blur and demonstrates state-of-the-art performance in the presence of both motion blur and defocus blur. Moreover, we adjust the computation effort with the amount of blur in the input image.As a first stage, our method uses a feed-forward image deblurring model for which we propose a suitable training scheme that can improve both tracking and mapping modules.Frames that are successfully deblurred by the feed-forward network obtain refined poses and depth through local-global multi-view optimization and loop closure. Frames that fail the first stage deblurring are directly modeled through the global 3DGS representation and an additional blur network to model multiple blurred sub-frames and simulate the blur for
178core_reconstruction
high
Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; robotics_mappingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe present Gaussian Splatting Anisotropic Visibility Field (GAVIS), a novel framework for uncertainty quantification and active mapping in 3DGS. Our key insight is that regions unseen from the training views yield unreliable predictions from the 3DGS. To address this, we introduce a principled and efficient method for quantifying the visibility field in 3DGS, defined as the anisotropic visibility of each particle with respect to the training views, and represented using spherical harmonics. The resulting visibility field is integrated into a Bayesian Network–based uncertainty-aware volume rendering process, enabling real-time (200 FPS) uncertainty quantification for synthesized views. Active mapping is further performed within a maximum information gain framework building on this formulation.Extensive experiments across diverse environments demonstrate that GAVIS consistently and signifi
179core_reconstruction
high
Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractReconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce a novel feed-forward framework that reconstructs 3D scenes from unposed multi-view images. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction—all within a single, feed-forward pass. Extensive experiments demonstrate this method establishes a new state-of-the-art across multiple benchmarks, including RE10K and ScanNet. Our work signifies a novel paradigm towards generalizable 3D scene reconstruction.
180core_reconstruction
high
VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian splatting (3DGS) has demonstrated impressive performance in synthesizing high-fidelity novel views. Nonetheless, its effectiveness critically depends on the quality of the initialized point cloud. Specifically, achieving uniform and complete point coverage over the underlying scene structure requires overlapping observation frustums, an assumption that is often violated in unbounded, dynamic urban environments. Training Gaussian models with partially initialized point clouds often leads to distortions and artifacts, as camera rays may fail to intersect valid surfaces, resulting in incorrect gradient propagation to Gaussian primitives associated with occluded or invisible geometry. Additionally, existing densification strategies simply clone and split Gaussian primitives from existing ones, incapable of reconstructing geometry from missing structures. To address these limitati
181core_reconstruction
high
VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractSimultaneous Localization and Mapping (SLAM) with 3D Gaussian Splatting (3DGS) enables fast, differentiable rendering and high-fidelity reconstruction across diverse real-world scenes. However, existing 3DGS-SLAM approaches handle measurement reliability implicitly, making pose estimation and global alignment susceptible to drift in low-texture regions, transparent surfaces, or areas with complex reflectance properties. To this end, we introduce VarSplat, an uncertainty-aware 3DGS-SLAM system that explicitly learns per-splat appearance variance. By using the law of total variance with alpha compositing, we then compute corresponding differentiable per-pixel uncertainty map. This variance map guides tracking, submap registration, and loop detection toward focusing on reliable regions and contributes to more stable optimization. Experimental results on Replica (synthetic) and TUM-RGBD, Sca
182core_reconstruction
high
VDFE: Difference-Aware 3D Scene Editing with Non-Intrusive Video Diffusion Priors for Multi-View Consistency and Efficiency
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; generation_editing; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractText-driven 3D editing, enabled by advancements in 3D reconstruction techniques such as NeRF and 3D Gaussian Splatting, aims to provide intuitive scene customization. However, existing methods frequently exhibit limitations in controllability and consistency. To address these shortcomings, we propose \textbf{VDFE}, a difference-aware 3D scene editing method based on non-intrusive utilization of pre-trained video diffusion priors, which integrates Optimal Control Guided Flow Editing (FlowOCE), Decoupled Flow Difference (DFD), and Difference-Aware Gaussians Editing (DAGE). Specifically, FlowOCE treats the editing process as an optimal control problem, optimizing a noise-free editing trajectory to minimize unintended modifications in non-target region; DFD precisely locates editing region by analyzing flow differences, which supplies priors for the subsequent optimization process; and DAGE
183core_reconstruction
high
Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractExisting single-image 3D human avatar methods primarily rely on rigid joint transformations, limiting their ability to model realistic cloth dynamics. We present DynaAvatar, a zero-shot framework that reconstructs animatable 3D human avatars with motion-dependent cloth dynamics from a single image. Trained on large-scale multi-person motion datasets, DynaAvatar employs a Transformer-based feed-forward architecture that directly predicts dynamic 3D Gaussian deformations without subject-specific optimization. To overcome the scarcity of dynamic captures, we introduce a static-to-dynamic knowledge transfer strategy: a Transformer pretrained on large-scale static captures provides strong geometric and appearance priors, which are efficiently adapted to motion-dependent deformations through lightweight LoRA fine-tuning on dynamic captures. We further propose the DynaFlow loss, an optical flow
184core_reconstruction
high
GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localizationcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWhile 3D Gaussian splatting has emerged as a powerful paradigm, it fundamentally fails to model transparency such as glass panels, which are prevalent in everyday environments. The core challenge lies in decoupling the intertwined radiance contributions from transparent interfaces and the transmitted geometry observed through the glass. We present GLINT, a framework that models scene-scale transparency through explicit decomposed Gaussian representation. GLINT reconstructs the primary interface and separates outgoing radiance into reflection and transmission components according to its optical properties, enabling coherent Gaussian radiance transport. During the optimization, GLINT bootstraps transparency localization by utilizing geometry separation cues that emerge from our decomposition with the geometry and material priors from a pre-trained video relighting model. Extensive experime
185core_reconstruction
high
MatSpray: Fusing 2D Material World Knowledge on 3D Geometry
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractManual modeling of material parameters and 3D geometry is a time consuming yet essential task in the gaming and film industries. While recent advances in 3D reconstruction have enabled accurate approximations of scene geometry and appearance, these methods often fall short in relighting scenarios due to the lack of precise, spatially varying material parameters. At the same time, diffusion models operating on 2D images have shown strong performance in predicting physically based rendering (PBR) properties such as albedo, roughness, and metallicity. However, transferring these 2D material maps onto reconstructed 3D geometry remains a significant challenge. We propose a framework for fusing 2D material data into 3D geometry using a combination of novel learning-based and projection-based approaches. We begin by reconstructing scene geometry via Gaussian Splatting. From the input images, a
186core_reconstruction
high
Multi-view Pyramid Transformer: Look Coarser to See Broader
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the un
187core_reconstruction
high
Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondencecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) represents scenes through primitives with coupled intrinsic properties: geometric attributes (position, covariance, opacity) and appearance attributes (view-dependent color). Faithful reconstruction requires intrinsic geometry-appearance consistency, where geometry accurately captures 3D structure while appearance reflects photometry. However, sparse observations lead to appearance overfitting and underconstrained geometry, causing severe novel-view artifacts.We present ICO-GS (Intrinsic Geometry-Appearance Consistency Optimization for 3DGS), a principled framework that enforces this consistency through tightly coupled geometric regularization and appearance learning. Our approach first regularizes geometry via feature-based multi-view photometric constraints by employing pixel-wise top-k selection to handle occlusions and edge-aware smoothness to preserve sh
188core_reconstruction
high
RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractHuman perceive the 3D world through 2D observations from limited viewpoints. While recent feed-forward generalizable 3D reconstruction models excel at recovering 3D structures from sparse images, their representations are often confined to observed regions, leaving unseen geometry un-modeled. This raises a key, fundamental challenge: Can we infer a complete 3D structure from partial 2D observations? We present RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation. At the core of RnG, we propose a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level, and treats the KV-cache as an implicit 3D representation. Then, arbitrary poses can efficiently query this cache to render high-fidelity, novel-view RGBD outputs. As a result, RnG n
189core_reconstruction
medium
Motion-Aware Animatable Gaussian Avatars Deblurring
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; data_benchmarkcore genus=3D Gaussian Splatting, but title/abstract signal is narrower
abstractThe creation of 3D human avatars from multi-view videos is a significant yet challenging task in computer vision. However, existing techniques rely on high-quality, sharp images as input, which are often impractical to obtain in real-world scenarios due to variations in human motion speed and intensity. This paper introduces a novel method for directly reconstructing sharp 3D human Gaussian avatars from blurry videos. The proposed approach incorporates a 3D-aware, physics-based model of blur formation caused by human motion, together with a 3D human motion model designed to resolve ambiguities in motion-induced blur. This framework enables the joint optimization of the avatar representation and motion parameters from a coarse initialization. Comprehensive benchmarks are established using both a synthetic dataset and a real-world dataset captured with a 360-degree synchronous hybrid-expos
190core_reconstruction
high
PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractMulti-view 3D reconstruction methods remain highly sensitive to photometric inconsistencies arising from camera optical characteristics and variations in image signal processing (ISP). Existing mitigation strategies such as per-frame latent variables or affine color corrections lack physical grounding and generalize poorly to novel views. We propose the Physically-Plausible ISP (PPISP) correction module, which disentangles camera-intrinsic and capture-dependent effects through physically based and interpretable transformations. A dedicated PPISP controller, trained on the input views, predicts ISP parameters for novel viewpoints, analogous to auto exposure and auto white balance in real cameras. This design enables realistic and fair evaluation on novel views without access to ground-truth images. PPISP achieves SoTA performance on standard benchmarks, while providing intuitive control a
191core_reconstruction
medium
High-Fidelity Mobile Avatars with Pruned Local Blendshapes
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Reconstruction, but title/abstract signal is narrower
abstractWe propose a method to reconstruct high-fidelity human avatars from multi‑view video that can run on mobile devices. Many works can model high‑quality Gaussian-based full-body avatars from multi‑view video. However, these methods require heavy computation to obtain pose‑dependent appearance, making deployment on mobile devices very difficult. Recent methods distill from pretrained models and model pose‑dependent nonlinear Gaussian attributes by linearly combining global pose features with blendshapes. Although they can run on mobile devices, they suffer some loss of detail. We observe that nearby Gaussians are often highly correlated within a local region of the body, and can be linearly modeled with less error. Therefore, we use local linear blendshapes in small body parts to capture global nonlinear changes of Gaussian attributes. To further reduce computation and model size, we propos
192core_reconstruction
medium
Learning Convex Decomposition via Feature Fields
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancycore genus=3D Reconstruction, but title/abstract signal is narrower
abstractThis work proposes a new formulation to the long-standing problem of convex decomposition through learning feature fields, enabling the first feed-forward model for open-world learning of convex decomposition. Our method produces high-quality decompositions of 3D shapes into a union of convex bodies, which are essential to accelerate collision detection in physical simulation, amongst many other applications.The key insight is to adopt a feature learning approach and learn a continuous feature field that can later be clustered to yield a good convex decomposition via our self-supervised, purely-geometric objective derived from the classical definition of convexity.Our formulation can be used for single shape optimization, but more importantly, feature prediction unlocks scalable, self-supervised learning on large datasets resulting in the first learned open-world for convex decompositio
193core_reconstruction
high
EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractFeed-forward 3D Gaussian Splatting (3DGS) enables efficient one-pass scene reconstruction, providing 3D representations for novel view synthesis without per-scene optimization. However, existing methods typically predict pixel-aligned primitives per-view, producing an excessive number of primitives in dense-view settings and offering no explicit control over the number of predicted Gaussians. To address this, we propose EcoSplat, the first efficiency-controllable feed-forward 3DGS framework that adaptively predicts the 3D representation for any given target primitive count at inference time. EcoSplat adopts a two-stage optimization process. The first stage is Pixel-aligned Gaussian Training (PGT) where our model learns initial primitive prediction. The second stage is Importance-aware Gaussian Finetuning (IGF) stage where our model learns rank primitives and adaptively adjust their param
194core_reconstruction
high
More Natural, More Real: Object-aware Gaussian Splatting for 3D Visual Decoding from Human Brain
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractExploring human visual perception and understanding of the stereoscopic world represents a significant topic in computational neuroscience. Recent studies have provided rich Brain-3D datasets, conducted preliminary explorations into 3D visual reconstruction. However, existing research struggles to capture the differences in dynamic changes of 3D stimulus views, and there remains room for improvement in high-fidelity reconstruction and rendering. 3D Gaussian Splatting (3DGS) has recently achieved significant progress in stereoscopic view synthesis. Inspired by it, we propose BrainGS -- an innovative framework for decoding more realistic 3D objects from the brain. BrainGS incorporates a Fusion Time-Spatial Network to achieve comprehensive encoding of the brain, combined with the Multi-Attribute Controller (MAC), it decouples features using visual, semantic, and color as anchors, effectivel
195core_reconstruction
high
CoRoGS: Contextual Gaussian Splatting for Robust Large-Deviation View Synthesis
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractNovel view synthesis (NVS) under large view deviations remains an underexplored challenge for 3D Gaussian Splatting (3DGS). In urban scenes with limited training coverage, models often fail to maintain geometric consistency when extrapolating to unseen viewpoints, resulting in severe distortions and degraded rendering quality. We introduce Context-Aware Gaussian Splatting (CoRoGS), a $\textbf{Co}$ntext-aware framework for $\textbf{Ro}$bust large-deviation novel view synthesis (LD-NVS) that embeds contextual reasoning into 3DGS. Instead of treating Gaussians as independent primitives, CoRoGS adopts a contextual formulation that explicitly models inter-Gaussian dependencies. This representation is implemented by constructing a 3D Gaussian graph, which propagates relational geometry and semantics via message passing, resulting in context-aware Gaussian updates. To further maintain structura
196core_reconstruction
high
DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) has enabled efficient 3D scene reconstruction from everyday images with real-time, high-fidelity rendering, greatly advancing VR/AR applications. Fisheye cameras, with their wider field of view (FOV), promise high-quality reconstructions from fewer inputs and have recently attracted much attention. However, since 3DGS relies on rasterization, most subsequent works involving fisheye camera inputs first undistort images before training, which introduces two problems: 1) Black borders at image edges cause information loss and negate the fisheye’s large FOV advantage; 2) Undistortion’s stretch‐and‐interpolate resampling spreads each pixel’s value over a larger area, diluting detail density— causes 3DGS overfitting these low‐frequency zones, producing blur and floating artifacts.In this work, we integrate fisheye camera model into the original 3DGS framework, enab
197core_reconstruction
high
Evidential Neural Radiance Fields
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractUnderstanding sources of uncertainty is fundamental to trustworthy three-dimensional scene modeling. While recent advances in neural radiance fields (NeRFs) achieve impressive accuracy in scene reconstruction and novel view synthesis, the lack of uncertainty estimation significantly limits their deployment in safety-critical settings. Existing uncertainty quantification methods for NeRFs fail to capture both aleatoric and epistemic uncertainty. Among those that do quantify one or the other, many of them either compromise rendering quality or incur significant computational overhead to obtain uncertainty estimates. To address these issues, we introduce Evidential Neural Radiance Fields, a probabilistic approach that seamlessly integrates with the NeRF rendering process and enables direct quantification of both aleatoric and epistemic uncertainty from a single forward pass. We compare mult
198core_reconstruction
high
LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractNovel View Synthesis has often relied on explicit 3D representations, which inject a strong 3D bias in the process; however, recent work has shown that network-based rendering can work better despite lacking 3D inductive biases. In this paper, we show that much better quality can be obtained by leveraging a strong 3D bias without a 3D representation. To do so, we introduce LagerNVS, an encoder-decoder network that uses 3D-aware features as a latent scene encoding. The encoder is initialized from a 3D reconstruction network, paired with a lightweight decoder, and trained end-to-end with photometric losses. LagerNVS achieves state-of-the-art deterministic feed-forward Novel View Synthesis results (including 31.1 PSNR on Re10k), with and without known cameras, renders in real-time, generalizes to in-the-wild data without known cameras, and can be paired with a diffusion decoder for generati
199core_reconstruction
high
Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe present an approach for high-quality dynamic Gaussian Splatting from monocular videos. To this end, we in this work go one step further beyond previous methods to explicitly model continuous position and orientation deformation of dynamic Gaussians, using an SE(3) B-spline motion bases with a compact set of control points. To improve computational efficiency while enhancing the ability to model complex motions, an adaptive control mechanism is devised to dynamically adjust the number of motion bases and control points. Besides, we develop a soft segment reconstruction strategy to mitigate long-interval motion interference, and employ a multi-view diffusion model to provide multi-view cues for avoiding overfitting to training views. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in novel view synthesis. Our code and trained model will be made pub
200core_reconstruction
high
NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe present NimbusGS, a unified framework for reconstructing high-quality 3D scenes from degraded multi-view inputs captured under diverse and mixed adverse weather conditions. Unlike existing methods that target specific weather types, NimbusGS addresses the broader challenge of generalization by modeling the dual nature of weather: a continuous, view-consistent medium that attenuates light, and dynamic, view-dependent particles that cause scattering and occlusion. To capture this structure, we decompose degradations into a global transmission field and per-view particulate residuals. The transmission field represents static atmospheric effects shared across views, while the residuals model transient disturbances unique to each input. To enable stable geometry learning under severe visibility degradation, we introduce a geometry-guided gradient scaling mechanism that mitigates gradient i
201core_reconstruction
high
RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) has emerged as a leading technology for high-quality 3D scene reconstruction. However, the iterative refinement and densification process leads to the generation of a large number of primitives, each contributing to the reconstruction to a substantially different extent. Estimating primitive importance is thus crucial, both for removing redundancy during reconstruction and for enabling efficient compression and transmission.Existing methods typically rely on rendering-based analyses, where each primitive is evaluated through its contribution across multiple camera viewpoints. However, such methods are 1) sensitive to the number and selection of views; 2) rely on specialized differentiable rasterizers; and 3) have long calculation times that grow linearly with view count, making them difficult to integrate as plug-and-play modules, as well as resulting in limi
202core_reconstruction
high
ReLaGS: Relational Language Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractAchieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either object-centric or rely on costly training for inter-object reasoning. We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. A Gaussian pruning mechanism refines scene geometry, while a robust multi-view language alignment strategy aggregates noisy 2D features into accurate 3D object embeddings. On top of this hierarchy, we build an open-vocabulary 3D scene graph with Vision Language-derived annotations and Graph Neural Network-based relational reasoning. Our approach enables efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships, validated a
203core_reconstruction
high
ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos
3D Vision & Geometry / 3D Reconstruction
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; surface_occupancy; robotics_mapping; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractThe ubiquity of monocular videos capturing daily hand-object interactions presents a valuable resource for embodied intelligence. While 3D hand reconstruction from in-the-wild videos has seen significant progress, reconstructing the involved objects remains challenging due to severe occlusions and the complex, coupled motion of the camera, hands, and object. In this paper, we introduce ForeHOI, a novel feed-forward model that directly reconstructs 3D object geometry from monocular hand-object interaction videos within one minute of inference time, eliminating the need for any pre-processing steps. Our key insight is that, the joint prediction of 2D mask inpainting and 3D shape completion in a feed-forward framework can effectively address the problem of severe occlusion in monocular hand-held object videos, thereby achieving results that outperform the performance of optimization-based
204core_reconstruction
high
PromptDepth: Efficient and Promptable Geometric 3D Vision Model \\ for Embodied Intelligence
3D Vision & Geometry / 3D Reconstruction
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractVision models for embodied intelligence require efficient 3D comprehension and interaction with objects within the scene. Existing 3D reconstruction models either overlook instance-level perception or rely on time-consuming offline reasoning, showing a less adaptability in real-time embodied scenario. In this paper, we present PromptDepth, the first promptable vision model that features both geometric 3D understanding and instance-level interaction especially designed for embodied intelligence. PromptDepth is a feed-forward network that quickly yields panoptic, instanced, or tracked depth map from two corresponding frames, enabling the real-time infer sequences from embodied agents. Specifically, following the minimal prediction problem, we design a promptable Dense Prediction Transformer, making it flexible to interact with unified dense prediction according to a specific prompt. Consid
205core_reconstruction
medium
Representing 3D Faces with Learnable B-Spline Volumes
3D Vision & Geometry / 3D Reconstruction
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mappingcore genus=3D Reconstruction, but title/abstract signal is narrower
abstractWe present CUBE (Control-based Unified B-Splinie Encoding), a new geometric representation for digital humans that combines B-Spline volumes with learned features, and demonstrate its use as decoder for 3D scan registration and monocular 3D face reconstruction. Unlike existing B-Spline representations that use 3D control points, CUBE is parametrized by a lattice (e.g., $8 \times 8 \times 8$) of high-dimensional control features, increasing the models' expressivity. These control features define a continuous mapping from a 3D parametric domain to 3D Euclidean space through an intermediate feature space, which is evaluated in two stages. First, high-dimensional control features are locally blended using the B-Spline bases, yielding a high-dimensional feature vector, where the first three values are the 3D coordinates of a coarse base mesh. This feature vector is input to a small MLP to pre
206core_reconstruction
medium
SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings
3D Vision & Geometry / Pose Estimation
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractMonocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustm
207core_reconstruction
high
SMVRT: Implicit Human 3D Modeling Using Sparse Multi-view Volumetric Reconstruction with Transformer Fusion
3D Vision & Geometry / 3D Reconstruction
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; surface_occupancy; robotics_mapping; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractRecently, the community has witnessed significant progress in human modeling from a single view or multi-views, which often involves "guessing" the occluded parts using either generative models or template fitting. In this work, we address these challenges by exploring optimal fusion strategies from sparse views only. We propose an end-to-end implicit 3D reconstruction framework using a sparse multi-view setup. Specifically, we achieve this by exploring fusion blocks at three stages of the network. First, 2D feature encoders carrying out locally and globally, which produce enhanced features. Second, 3D feature grid, formed by attentional fusion of warped multi-view and multi-level 2D features, which follows 3D regularization of feature grids to aggregate spatially coherent multi-view features. Third, attentional 2D3D feature aggregation associated to query point generate enhanced latent
208core_reconstruction
high
OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
B. bridge: reconstruction becomes mapping/world modelgaussian_radiance; depth_correspondence; robotics_mapping; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractOpen-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit spatial attribute grids for the
209core_reconstruction
high
Reconstructing Functional 3D Scenes from Egocentric Interaction Videos
3D Vision & Geometry / 3D Reconstruction
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; surface_occupancy; robotics_mappingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present FunREC, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGB-D interaction videos. Unlike existing methods on articulated reconstruction, which rely on controlled setups, multi-state captures, or CAD priors, FunREC operates directly on in-the-wild human interaction sequences to recover interactable 3D scenes. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. Across new real and simulated benchmarks, FunREC surpasses prior work by a large margin, achieving up to +50 mIoU improvement in part segmentation, 5$-$10$\times$ lower articulation and pose errors, and significantly higher reconstruction accuracy. We further demonstrate applications on URDF/USD export for simulation
210core_reconstruction
high
X-Part: High Fidelity And Structure Coherent Shape Decomposition And Completion
3D Vision & Geometry / 3D Reconstruction
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; surface_occupancy; robotics_mappingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractGenerating 3D shapes at part level is pivotal for downstream applications such as mesh retopology, UV mapping, and 3D printing. However, existing part-based generation methods often lack sufficient controllability and suffer from poor semantically meaningful decomposition. To this end, we introduce X-Part, a controllable generative model designed to decompose a holistic 3D object into semantically meaningful and structurally coherent parts with high geometric fidelity. X-Part exploits the bounding box as prompts for the part generation and injects point-wise semantic features for meaningful decomposition. Furthermore, we design an editable pipeline for interactive part generation. Extensive experimental results show that X-Part achieves state-of-the-art performance in part-level shape generation. This work establishes a new paradigm for creating production-ready, editable, and structural
211core_reconstruction
high
AERGS-SLAM: Auto-Exposure-Robust Stereo 3D Gaussian Splatting SLAM
3D Vision & Geometry / 3D Gaussian Splatting
B. bridge: representation meets metric posegaussian_radiance; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian splatting (3DGS) has emerged as a revolutionary scene representation in simultaneous localization and mapping (SLAM) research. However, existing research on 3DGS-based SLAM fails to accurately address the appearance variations induced by camera auto-exposure in prevalent real-world scenarios, resulting in reduced localization and photorealistic mapping accuracy. To address this issue, we propose a stereo auto-exposure-robust Gaussian splatting SLAM (AERGS-SLAM), a framework robust to such variations and enables both reliable localization and exposure-controlled photorealistic mapping. Our key contributions are two fold. Firstly, we propose a camera exposure network to model the camera exposure process, which we integrate with Gaussian splatting to achieve exposure-controlled novel view synthesis. Secondly, we exploit an illumination-robust geometric feature for localization a
212core_reconstruction
high
Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM
3D Vision & Geometry / 3D Gaussian Splatting
B. bridge: representation meets metric posegaussian_radiance; pose_calibration_localization; depth_correspondence; dynamic_4d; robotics_mappingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractHandling the dynamic environments is a significant research challenge in Visual Simultaneous Localization and Mapping (SLAM). Recent research combines 3D Gaussian Splatting (3DGS) with SLAM to achieve both robust camera pose estimation and photorealistic renderings. However, using SLAM to efficiently reconstruct both static and dynamic regions remains challenging. In this work, we propose an efficient framework for dynamic 3DGS SLAM guided by optical flow. Using the input depth and prior optical flow, we first propose a category-agnostic motion mask generation strategy by fitting a camera ego-motion model to decompose the optical flow. This module separates dynamic and static Gaussians and simultaneously provides flow-guided camera pose initialization. We boost the training speed of dynamic 3DGS by explicitly modeling their temporal centers at keyframes. These centers are propagated usin
213core_reconstruction
high
ODGS-SLAM: Omnidirectional Gaussian Splatting SLAM
3D Vision & Geometry / 3D Gaussian Splatting
B. bridge: representation meets metric posegaussian_radiance; pose_calibration_localization; robotics_mapping; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractThis work presents ODGS-SLAM, an omnidirectional simultaneous localization and mapping (SLAM) system utilizing 3D Gaussian Splatting (3DGS) as the unified representation for tracking and mapping.Thus, it reconstructs scene geometry from panoramic image sequences (RGB or RGBD) via splats while also detecting the camera poses.Such a framework is important to understand the full surrounding, *e.g.*, for augmented reality applications or autonomous systems.We extended existing 3DGS-SLAM methods to handle omnidirectional input by including closed-form gradients for mapping and camera pose estimation, utilizing an equirectangular projection model.To lower memory footprint, a key frame removal procedure based on graph analysis is proposed, enabling the application to handle larger input sizes.For evaluation, we provide a data set of controlled real-world and synthetic test scenes (indoor and ou
214core_reconstruction
high
SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM
3D Vision & Geometry / 3D Gaussian Splatting
B. bridge: representation meets metric posegaussian_radiance; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) has made huge progress in RGBD SLAM. Current methods usually use 3D Gaussians or view-tied 3D Gaussians to represent radiance fields in tracking and mapping. However, these Gaussians are either too flexible or too limited in movements, resulting in slow convergence or limited rendering quality. To resolve this issue, we adopt pixel-aligned Gaussians but allow each Gaussian to adjust its position along its ray to maximize the rendering quality, even if Gaussians are simplified for improving scalability. To speed up the tracking, we model the depth distribution around each pixel as a Gaussian function, and then use these points to align each frame to the 3D scene quickly. We report our evaluations on widely used benchmarks, justify our designs, and show advantages over the latest methods in view rendering, camera tracking, runtime, and storage complexity.
215core_reconstruction
medium
ULF-Loc: Unbiased Landmark Feature for Robust Visual Localization with 3D Gaussian Splatting
3D Vision & Geometry / Pose Estimation
B. bridge: representation meets metric posegaussian_radiance; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractVisual localization is a core technology for augmented reality and autonomous navigation. Recent methods combine the efficient rendering of 3D Gaussian Splatting (3DGS) with feature-based localization. These methods rely on direct matching between 2D query features and the 3D Gaussian feature field, but this often results in mismatches due to an inherent bias in the learned Gaussian feature. We theoretically analyze the feature learning process in 3DGS, revealing that the widely adopted $\alpha$-blending optimization inherently introduces bias into 3D point features. This bias stems from the entanglement between individual Gaussians and their neighboring Gaussians, making the learned features unsuitable for precise matching tasks. Motivated by these findings, we propose ULF-Loc, an unbiased landmark feature framework that replaces biased feature optimization with geometry-weighted featur
216core_reconstruction
high
Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty
3D Vision & Geometry / 3D Gaussian Splatting
B. bridge: representation meets metric posegaussian_radiance; pose_calibration_localization; depth_correspondencecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) has recently emerged as a powerful scene representation and is increasingly used for visual localization and pose refinement. However, despite its high-quality differentiable rendering, the robustness of 3DGS-based pose refinement remains highly sensitive to both the initial camera pose and the reconstructed geometry. In this work, we take a closer look at these limitations and identify two major sources of uncertainty: (i) pose prior uncertainty, which often arises from regression or retrieval models that output a single deterministic estimate, and (ii) geometric uncertainty, caused by imperfections in the 3DGS reconstruction that propagate errors into PnP solvers. Such uncertainties can distort reprojection geometry and destabilize optimization, even when the rendered appearance still looks plausible.To address these uncertainties, we introduce a relocaliza
217core_reconstruction
high
Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
B. bridge: representation meets metric posegaussian_radiance; pose_calibration_localization; depth_correspondencecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractVisual relocalization is a fundamental task in the field of 3D computer vision, estimating a camera’s pose when it revisits a previously known scene. While point-based hierarchical localization methods have shown strong scalability and efficiency, they are often limited by sparse image observations and weak feature matching. In this work, we propose SplatHLoc, a novel hierarchical visual relocalization framework that uses Feature Gaussian Splatting as the scene representation. For feature matching, we observe that Gaussian-rendered features and those extracted directly from images exhibit different strengths across the two-stage matching process: the former performs better in the coarse stage, while the latter proves more effective in the fine stage. Therefore, we introduce a hybrid feature matching strategy, enabling more accurate and efficient pose estimation. Extensive experiments on
218core_reconstruction
high
GenSplat: Bridging the Generalization Gap in 3DGS Language Comprehension
3D Vision & Geometry / 3D Gaussian Splatting
B. bridge: representation meets metric posegaussian_radiance; pose_calibration_localization; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractIn this paper, we propose GenSplat, a novel approach for language comprehension in 3D Gaussian Splatting (3DGS). Unlike previous methods that either achieve cross-scene generalization by being bounded to a predefined vocabulary or handle free-form language by overfitting to individual scenes, GenSplat is robust to free-form language queries and generalizable across 3DGS scene representations. Our key insight for this problem is to formulate a structured learning process to progressively align linguistic concepts with 3D Gaussians. It contains two novel technical contributions. First, we propose a Progressive Language Grounding Curriculum that structurally guides the model through learning category-level semantics to instance-level concepts and free-form language, preventing overfitting by building a generalizable language feature space. Second, we design a Multi-modal Large Language Mode
219core_reconstruction
high
GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
B. bridge: representation meets metric posegaussian_radiance; pose_calibration_localizationcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and real-time rendering. Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points. Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issue, we propose graph-based spatial distribution optimization for compact 3D Gaussian Splatting (GS\textasciicircum2), which enhances reconstruction quality by optimizing the spatial distribution of Gaussian points. Specifically, we introduce an evidence lower bound (ELBO)-based adaptive densification strategy that automatically controls the densification process. In addition, an opacity-aware progressive pruning strategy is proposed to further reduce memory consumption by dynamically removing
220core_reconstruction
high
3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; surface_occupancy; generation_editing; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe introduce 3D-Fixer, a novel generalizable and efficient scheme for single-image to compositional 3D scene generation. Unlike existing feed-forward frameworks that lack generalization ability in open-set scenarios due to the limited dataset, or divide-and-conquer frameworks that suffer from slow inference or accumulated registration errors during layout alignment, 3D-Fixer extends pre-trained object-level 3D generation priors to perform in-place completion on the single-view estimated geometry, eliminating the need for pose alignment while preserving feed-forward efficiency. At its core, 3D-Fixer introduces a coarse-to-fine scheme to accurately determine the completion boundary and generate high quality completion 3D asset based on the single-view estimated fragmented geometry. Also, we design a dual-branch conditioning network that integrates 2D and 3D contextual information to guide
221core_reconstruction
high
AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and even surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks.
222core_reconstruction
high
Cupid: Generative 3D Reconstruction via Joint Object and Pose Modeling
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe introduce Cupid, a generative 3D reconstruction framework that jointly models the full distribution over both canonical objects and camera poses. Our two-stage flow-based model first generates a coarse 3D structure and 2D-3D correspondences to estimate the camera pose robustly. Conditioned on this pose, a refinement stage injects pixel-aligned image features directly into the generative process, marrying the rich prior of a generative model with the geometric fidelity of reconstruction. This strategy achieves exceptional faithfulness, outperforming state-of-the-art reconstruction methods by over 3 dB PSNR and 10\% in Chamfer Distance. As a unified generative model that decouples the object and camera pose, Cupid naturally extends to multi-view and scene-level reconstruction tasks without requiring post-hoc optimization or fine-tuning.
223core_reconstruction
high
Edges Compete for Trust: Group Relative Edge Optimization for Building Reconstruction from Point Clouds
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractBuilding reconstruction aims to extract compact wireframes from point clouds. Recent edge-based methods achieve impressive results but suffer from sparse supervision from one-to-one matching, which leaves most edge proposals under-optimized. In this paper, we present Group Relative Edge Optimization (GREO), the first attempt to incentivize dense supervision across edges proposals through reinforcement learning-style optimization in wireframe reconstruction. Specifically, GREO computes edge-level rewards based on geometric alignment quality and transforms them into target confidence distributions via group-wise normalization. In addition, we incorporate entropy regularization to maintain distributional stability and prevent confidence collapse. This joint optimization enables dense and discriminative supervision across all edge proposals through cross-entropy minimization. Experiments
224core_reconstruction
high
JRM: Joint Reconstruction Model for Multiple Objects without Alignment
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractObject-centric reconstruction seeks to recover the 3D structure of a scene through composition of independent objects. While this independence can simplify modeling, it discards strong signals that could improve reconstruction, notably repetition where the same object model is seen multiple times in a scene, or across scans. We propose the Joint Reconstruction Model (JRM) to leverage repetition by framing object reconstruction as one of personalized generation: multiple observations share a common subject that should be consistent for all observations, while still adhering to the specific pose and state from each. Prior methods in this direction rely on explicit matching and rigid alignment across observations, making them sensitive to errors and difficult to extend to non-rigid transformations. In contrast, JRM is a 3D flow-matching generative model that implicitly aggregates unaligned
225core_reconstruction
high
Long-Tail Internet Photo Reconstruction
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractInternet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed, while most real-world sites contain only sparse, noisy, and uneven imagery that defeats classical and learned 3D methods. Existing 3D foundation models generalize well to curated datasets but collapse under the sparsity, ambiguity, and irregularity of Internet photos. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large-scale, clean, and depth-refined dataset, together with a sparse-aware sampling strategy that mimic camera distributions in long-tai
226core_reconstruction
high
ManifoldNeuS: Manifold-aware View Optimizability for Pose-Free Neural Surface Reconstruction
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractJointly optimizing camera poses and object geometry from unposed images is a challenging task in neural surface reconstruction. Existing methods often suffer from pose drift and geometric distortion, stemming from the easy-view bias --- uniform view optimization favors easy-to-optimize views with abundant texture and good overlap that dominate gradient updates, while hard-to-optimize counterparts with weak texture or limited overlap yet critical for geometric completeness are progressively marginalized. To address this, we propose ManifoldNeuS, a novel framework that explicitly models and leverages per-view optimizability to guide pose-free neural surface reconstruction. Specifically, we introduce the manifold-aware view optimizability score (MaVOS), which jointly assesses immediate fitness (the ease of optimizing each view) and long-term coverage gain (the value of optimizing each view)
227core_reconstruction
high
Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractThe 3D characterization of microstructures is crucial for understanding and designing functional materials. However, the scanning electron microscope (SEM), widely used in scientific research, captures only 2D electron intensity distributions. Existing SEM 3D reconstruction methods struggle with textureless regions, shadowing artifacts, and calibration dependencies, whereas advanced learning-based approaches fail to generalize to microscopic SEM domains due to the lack of physical priors and domain-specific data. To address these challenges, we introduce NFH-SEM, a neural field-based hybrid reconstruction framework that recovers high-fidelity 3D surfaces from multi-view, multi-detector SEM images. NFH-SEM integrates coarse multi-view geometry with photometric stereo cues from detector signals through a continuous neural field, incorporating a learnable forward model that embeds SEM imagi
228core_reconstruction
high
PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractPanoramic imagery offers a full $360^\circ$ field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting.We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth a
229core_reconstruction
medium
Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancycore genus=3D Reconstruction, but title/abstract signal is narrower
abstractRecent learning-based face reconstruction and registration frameworks such as ToFu and TEMPEH have shown that dense correspondence between facial scans and a common topology can be learned directly from images. However, these approaches still depend on precomputed registrations obtained through iterative optimization pipelines that often require manual verification and correction by human annotators. We introduce MOCHI (Multi-view Optimizable Correspondence of Heads from Images), a fully differentiable and registration-free alternative. Instead of relying on optimization-based registrations, we employ a pseudo-linear inverse kinematic solver in conjunction with dense 2D keypoints produced by a tracker trained only on synthetic data to directly enforce a common face topology at the vertex level. We further find that the commonly used point-to-surface distance can lead to unstable training
230core_reconstruction
high
TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractTopology-consistent dynamic model sequences are essential for applications such as animation and model editing. However, existing 4D reconstruction methods face challenges in generating high-quality topology-consistent meshes. To address this, we propose a topology-aware dynamic reconstruction framework based on Gaussian Splatting. We introduce a Gaussian topological structure that explicitly encodes spatial connectivity. This structure enables topology-aware densification and pruning, preserving the manifold consistency of the Gaussian representation. Temporal regularization terms further ensure topological coherence over time, while differentiable mesh rasterization improves mesh quality. Experimental results demonstrate that our method reconstructs topology-consistent mesh sequences with significantly higher accuracy than existing approaches. Moreover, the resulting meshes enable prec
231core_reconstruction
medium
TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancy; generation_editingcore genus=3D Reconstruction, but title/abstract signal is narrower
abstractThe dominant paradigm for high-fidelity 3D generation relies on a VAE-Diffusion pipeline, where the VAE's reconstruction capability sets a firm upper bound on generation quality. A fundamental challenge limiting existing VAEs is the \textit{representation mismatch} between ground-truth meshes and network predictions: GT meshes have arbitrary, variable topology, while VAEs typically predict fixed-structure implicit fields (\eg, SDF on regular grids). This inherent misalignment prevents establishing explicit mesh-level correspondences, forcing prior work to rely on indirect supervision signals such as SDF or rendering losses. Consequently, fine geometric details, particularly sharp features, are poorly preserved during reconstruction. To address this, we introduce TopoMesh, a sparse voxel-based VAE that unifies both GT and predicted meshes under a shared Dual Marching Cubes (DMC) topologic
232core_reconstruction
high
UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present UniSH, a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction. A key challenge in this domain is the scarcity of large-scale, annotated real-world data, forcing a reliance on synthetic datasets. This reliance introduces a significant sim-to-real domain gap, leading to poor generalization, low-fidelity human geometry, and poor alignment on in-the-wild videos. To address this, we propose an innovative training paradigm that effectively leverages unlabeled in-the-wild data. Our framework bridges strong, disparate priors from scene reconstruction and HMR, and is trained with two core components: (1) a robust distillation strategy to refine human surface details by distilling high-frequency details from an expert depth model, and (2) a two-stage supervision scheme, which first learns coarse localization on synthetic data, then fine-tunes on real
233core_reconstruction
medium
ViLearn: Accelerating Training Convergence of Image-to-3D Generation via Visibility Learning
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancy; generation_editingcore genus=3D Reconstruction, but title/abstract signal is narrower
abstractSingle-image-to-3D shape generation has seen remarkable progress, driven by latent diffusion models trained on the compressed latent space of 3D VAEs. However, the task remains intrinsically ill-posed: recovering complete 3D geometry—especially occluded surfaces—from a single view is inherently ambiguous. Existing VecSet-based approaches further exacerbate this challenge by treating shape tokens as an unordered set without explicit positional encoding. This design forces diffusion models to simultaneously learn visible correspondences from the input image and hallucinate invisible geometry within a large, permutation-invariant token space, where the lack of structure significantly hinders training efficiency and convergence stability.To address this, we propose \textit{Visibility Learning}, a training paradigm that injects visibility structure and positional inductive bias into the image
234core_reconstruction
high
ART: Articulated Reconstruction Transformer
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe introduce ART, Articulated Reconstruction Transformer—a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as a part-based prediction problem. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable to standard simulation formats. Trained o
235core_reconstruction
high
PE3R: Perception-Efficient 3D Reconstruction
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractRecent advances in 2D-to-3D perception have enabled the recovery of 3D scene semantics from unposed images. However, prevailing methods often suffer from limited generalization, reliance on per-scene optimization, and semantic inconsistencies across viewpoints. To address these limitations, we introduce PE3R, a tuning-free framework for efficient and generalizable 3D semantic reconstruction. By integrating multi-view geometry with 2D semantic priors in a feed-forward pipeline, PE3R achieves zero-shot generalization across diverse scenes and object categories without any scene-specific fine-tuning. Extensive evaluations on open-vocabulary segmentation and multi-view depth estimation show that PE3R not only achieves up to 9$\times$ faster inference but also sets new state-of-the-art accuracy in both semantic and geometric metrics. Our approach paves the way for scalable, language-driven 3D
236core_reconstruction
high
PhyGaP: Physically-Grounded Gaussians with Polarization Cues
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRecent advances in 3D Gaussian Splatting (3DGS) have demonstrated great success in modeling reflective 3D objects and their interaction with the environment via **deferred rendering (DR)**. However, existing methods often struggle with correctly reconstructing physical attributes such as albedo and reflectance, and therefore they do not support high-fidelity relighting. Observing that this limitation stems from the lack of **shape and material** information in RGB images, we present PhyGaP, a physically-grounded 3DGS method that leverages polarization cues to facilitate precise reflection decomposition and visually consistent relighting of reconstructed objects. Specifically, we design a polarimetric deferred rendering (PolarDR) process to model polarization by reflection, and a self-occlusion-aware environment map building technique (GridMap) to resolve indirect lighting of non-convex o
237core_reconstruction
high
SASNet: Spatially-Adaptive Sinusoidal Networks for INRs
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractSinusoidal neural networks (SIRENs) are powerful implicit neural representations (INRs) for low-dimensional signals in vision and graphics. By encoding input coordinates with sinusoidal functions, they enable high-frequency image and surface reconstruction. However, training SIRENs is often unstable and highly sensitive to frequency initialization: small frequencies produce overly smooth reconstructions in detailed regions, whereas large ones introduce spurious high-frequency components that manifest as noise in smooth areas such as image backgrounds. To address these challenges, we propose $\textbf{SASNet}$, a $\textit{Spatially-Adaptive Sinusoidal Network}$ that couples a $\textit{frozen frequency embedding layer}$, which explicitly fixes the network’s frequency support, with $\textit{jointly learned spatial masks}$ that localize neuron influence across the domain. This pairing stabili
238core_reconstruction
high
Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; depth_correspondence; surface_occupancy; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14×/16× (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across a
239core_reconstruction
medium
Particulate: Feed-Forward 3D Object Articulation
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editing; data_benchmarkcore genus=3D Reconstruction, but title/abstract signal is narrower
abstractWe introduce Particulate, a feed-forward model that, given a single static 3D mesh of an everyday object, predicts its 3D parts, kinematic structure, and articulation parameters.Unlike prior work on articulated 3D object modeling that is limited by costly per-object optimization and small retrieval databases or requires large vision or language foundation models, our approach is based on a flexible, scalable and lightweight transformer architecture.Trained on a diverse collection of articulated 3D assets from public datasets, Particulate accurately infers the articulated structure of novel objects, including those generated by image-to-3D models, in a single feed-forward pass.We further introduce a benchmark for articulated 3D object estimation curated from high-quality public 3D assets.Quantitative and qualitative results show that Particulate significantly outperforms state-of-the-art
240core_reconstruction
high
SPE-MVS: Spatial Position Encoding Enhanced Multi-View Stereo with Monocular Depth Priors
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractLearning-based Multi-View Stereo (MVS) methods have become the mainstream in the field, relying on the construction of cost volumes through multi-view feature similarity computation and regularization. However, existing methods depend heavily on photometric consistency across views, leading to poor performance in challenging regions, such as weakly textured or non-Lambertian surfaces. To overcome this limitation, we propose SPE-MVS, a novel MVS framework enhanced with Spatial Position Encoding (SPE). The SPE represents the 3D positional information of pixels in each image within a unified metric space, constructed using monocular depth priors. We integrate the SPE alongside image data as input and introduce a Photometric-Spatial Hybrid Feature Extractor, along with an SPE-enhanced cost volume construction module. These components incorporate spatial position-based similarity computation,
241core_reconstruction
high
OMGTex: One-stage Multi-style Facial Texture Reconstruction without Geometry Guidance
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editing; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe propose OMGTex, an end-to-end diffusion-based framework for reconstructing high-quality and editable facial UV textures from multi-style facial images. Existing texture reconstruction methods face two major limitations: (1) Fragility due to reliance on 3D geometry priors, which are difficult to estimate accurately, especially under facial occlusions or in stylized domains; and (2) A lack of semantic disentanglement, inhibiting region-specific texture editing and style transfer. Our work addresses both challenges simultaneously.Our core innovation is a geometry-free pipeline that directly maps a 2D face image to its corresponding editable UV texture. We introduce two key techniques: First, to address the challenge of UV misalignment common in diffusion generation, we introduce a gradient-guided refinement strategy at inference time, which explicitly corrects structural consistency. Sec
242core_reconstruction
high
eRetinexGS: Retinex Modeling for Low-Light Scene Enhancement via Event Streams and 3D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; dynamic_4d; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractPerception under low illumination remains a major challenge for computer vision systems, as RGB sensors often fail to capture sufficient structural and color information in extremely dark environments. Event cameras, with their high dynamic range and temporal resolution, provide complementary cues that are well suited for such conditions. In this work, we present eRetinexGS, a novel framework that jointly leverages event streams and low-light frames through 3D Gaussian Splatting for scene-level enhancement and reconstruction. Unlike previous approaches that operate on individual frames, eRetinexGS enforces geometric and photometric consistency across multiple views, bridging the gap between degraded images and noisy event signals. By introducing an event-assisted Retinex decomposition and a reflectance–illumination representation within the 3DGS pipeline, our method reconstructs normal-l
243core_reconstruction
high
Global Structure-from-Motion Meets Feedforward Reconstruction
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractStructure-from-Motion -- the process of simultaneously estimating camera poses and 3D scene structure from a collection of images -- remains a central challenge in computer vision, with many open problems yet to be solved.Recent advances in feedforward 3D reconstruction have made significant strides in overcoming persistent failure cases of classical SfM methods, particularly in scenarios characterized by low texture, limited image overlap, and symmetries.However, while feedforward approaches excel in these challenging conditions, they often face limitations regarding scalability, accuracy, and robustness, and typically fall short of classical methods in standard reconstruction settings.In this work, we systematically analyze these limitations and propose a new state-of-the-art Structure-from-Motion pipeline by combining the respective strengths of classical and feedforward methods.Exten
244core_reconstruction
high
RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; dynamic_4d; generation_editingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractTemporal retiming, the ability to reconstruct and render dynamic scenes at arbitrary timestamps, is crucial for applications such as slow-motion playback, temporal editing, and post-production. However, most existing 4D Gaussian Splatting (4DGS) methods overfit at discrete frame indices but struggle to represent continuous-time frames, leading to ghosting artifacts when interpolating between timestamps. We identify this limitation as a form of temporal aliasing and propose RetimeGS, a simple yet effective 4DGS representation that explicitly defines the temporal behavior of the 3D Gaussian and mitigates temporal aliasing. To achieve smooth and consistent interpolation, we incorporate optical flow–guided initialization and supervision, triple-rendering supervision, and other targeted strategies. Together, these components enable ghost-free, temporally coherent rendering even under large mo
245core_reconstruction
medium
ArtLLM: Generating Articulated Assets via 3D LLM
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkcore genus=3D Reconstruction, but title/abstract signal is narrower
abstractCreating interactive digital environments for gaming, robotics, and simulation relies on articulated 3D objects whose functionality emerges from their part geometry and kinematic structure. However, existing approaches remain fundamentally limited: optimization-based reconstruction methods require slow, per-object joint fitting and typically handle only simple, single-joint objects, while retrieval-based methods assemble parts from a fixed library, leading to repetitive geometry and poor generalization. To address these challenges, we introduce ArtLLM, a novel framework for generating high-quality articulated assets directly from complete 3D meshes. At its core is a 3D multimodal large language model trained on a large-scale articulation dataset curated from both existing articulation datasets and procedurally generated objects. Unlike prior work, ArtLLM autoregressively predicts a varia
246core_reconstruction
medium
Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editingcore genus=3D Reconstruction, but title/abstract signal is narrower
abstractExisting 3D editing methods rely on computationally intensive scene-by-scene iterative optimization and suffer from multi-view inconsistency. We propose an effective and fully feedforward 3D editing framework based on the TRELLIS generative backbone, capable of modifying 3D models from a single editing view. Our framework addresses two key issues: adapting training-free 2D editing to structured 3D representations, and overcoming the bottleneck of appearance fidelity in compressed 3D features. To ensure geometric consistency, we introduce Voxel FlowEdit, an edit-driven flow in the sparse voxel latent space that achieves globally consistent 3D deformation in a single pass. To restore photorealistic details, we develop a normal-guided single to multi-view generation module as an external appearance prior, successfully recovering high-frequency textures. Experiments demonstrate that our meth
247core_reconstruction
high
LaS-Comp: Zero-shot 3D Completion with Latent–Spatial Consistency
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractThis paper introduces LaS-Comp, a zero-shot and category-agnostic approach that leverages the rich geometric priors of 3D foundation models to enable 3D shape completion across diverse types of partial observations. Our contributions are threefold: First, LaS-Comp harnesses these powerful generative priors for completion through a complementary two-stage design: (i) an explicit replacement stage that preserves the partial observation geometry to ensure faithful completion; and (ii) an implicit refinement stage ensures seamless boundaries between the observed and synthesized regions. Second, our framework is training-free and compatible with different 3D foundation models. Third, we introduce Omni-Comp, a comprehensive benchmark combining real-world and synthetic data with diverse and challenging partial patterns, enabling a more thorough and realistic evaluation. Both quantitative and q
248core_reconstruction
high
Unified Primitive Proxies for Structured Shape Completion
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractStructured shape completion recovers missing geometry as primitives rather than as unstructured points, which enables primitive-based surface reconstruction. Instead of following the prevailing cascade, we rethink how primitives and points should interact, and find it more effective to decode primitives in a dedicated pathway that attends to shared shape features. Following this principle, we present UniCo, which in a single feed-forward pass predicts a set of primitives with complete geometry, semantics, and inlier membership. To drive this unified representation, we introduce primitive proxies, learnable queries that are contextualized to produce assembly-ready outputs. To ensure consistent optimization, our training strategy couples primitives and points with online target updates. Across synthetic and real-world benchmarks with four independent assembly solvers, UniCo consistently ou
249core_reconstruction
high
2D-LFM: Lifting Foundation Model without 3D supervision
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractRecent vision foundation models give the impression that 3D reconstruction from RGB is largely solved. Yet these systems struggle with object-specific 3D structure: the fine-grained geometry implied by an object’s landmarks or skeleton. In this paper, we show that when a model is given only 2D landmarks, it can recover more accurate 3D structure than state-of-the-art depth-from-RGB foundation models. Classical lifting approaches such as PAUL demonstrate this principle but do not scale beyond single categories, while methods like 3D-LFM scale but require extensive 3D supervision. We present the first lifting foundation model that learns object-specific 3D geometry using only 2D supervision. The key idea is to inject correspondence structure into the model via a positional encoding inspired by classical structure-from-motion. This simple inductive bias enables robust, object-agnostic 3D li
250core_reconstruction
high
EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; dynamic_4d; generation_editingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractAudio-driven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Few-shot methods enable instant personalization by reconstructing high-fidelity avatars from only a few seconds of video. However, achieving natural talking-head generation further requires strong emotion-aware motion modeling, and existing few-shot approaches exhibit geometric instability and audio-emotion mismatch under expressive facial motion. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, which introduces strong geometric priors for stable and interpretable motion. Building upon this, we propose a Gated Residual Motion Network (GRMN), w
251core_reconstruction
high
Dehallu3D: Hallucination-Mitigated 3D Generation from a Single Image via Cyclic View Consistency Refinement
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractLarge 3D reconstruction models have revolutionized the 3D content generation field, enabling broad applications in virtual reality and gaming. Just like other large models, large 3D reconstruction models suffer from hallucinations as well, introducing structural outliers (e.g., odd holes or protrusions) that deviate from the input data. However, unlike other large models, hallucinations in large 3D reconstruction models remain severely underexplored, leading to malformed 3D-printed objects or insufficient immersion in virtual scenes. Such hallucinations majorly originate from that existing methods reconstruct 3D content from sparsely generated multi-view images which suffer from large viewpoint gaps and discontinuities. To mitigate hallucinations by eliminating the outliers, we propose Dehallu3D for 3D mesh generation. Our key idea is to design a balanced multi-view continuity constraint
252core_reconstruction
high
FAITHFUL CONTOURING: NEAR-LOSSLESS 3D VOXEL REPRESENTATION FREE FROM ISO-SURFACE
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractAccurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiment
253core_reconstruction
medium
PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkcore genus=3D Reconstruction, but title/abstract signal is narrower
abstractExisting autoregressive (AR) methods for generating artist-designed meshes struggle to balance global structural consistency with high-fidelity local details, and are susceptible to error accumulation. To address this, we propose PartDiffuser, a novel semi-autoregressive diffusion framework for point-cloud-to-mesh generation. The method first performs semantic segmentation on the mesh and then operates in a "part-wise" manner: it employs autoregression between parts to ensure global topology, while utilizing a parallel discrete diffusion process within each semantic part to precisely reconstruct high-frequency geometric features. PartDiffuser is based on the DiT architecture and introduces a part-aware cross-attention mechanism, using point clouds as hierarchical geometric conditioning to dynamically control the generation process, thereby effectively decoupling the global and local gene
254core_reconstruction
high
PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe introduce PixARMesh, the first method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post-hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative modeling, we enrich a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream of context, pose, and mesh tokens, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing
255core_reconstruction
high
RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; surface_occupancy; generation_editingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual quality. However, existing methods struggle with semi-transparent specular surfaces that exhibit both complex reflections and clear transmission, often producing blurry reflections or overly occluded transmission. To address this, we present **RT-Splatting**, a framework that disentangles each Gaussian's geometric occupancy from its optical opacity. This factorization yields a unified surface-volume scene representation with a single set of Gaussian primitives. Our hybrid renderer interprets this representation both as a surface to capture high-frequency reflections and as a volume to preserve clear transmission. To mitigate the ambiguity in jointly optimizing reflection and transmission, we introduce Specular-Aware Gradient Gating, which suppresses misleading gradients from highly specular regions into
256core_reconstruction
high
GeoRelight: Learning Joint Geometrical Reconstruction and Relighting with Flexible Multi-Modal Diffusion Transformers
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractRelighting a person from a single photo is an attractive but ill-posed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current methods either use sequential pipelines that suffer from error accumulation, or they do not explicitly leverage 3D geometry during relighting, which limits physical consistency. Since relighting and estimation of 3D geometry are mutually beneficial tasks, we propose a unified Multi-Modal Diffusion Transformer (DiT) that jointly solves for both: **GeoRelight**. We make this possible through two key technical contributions: isotropic NDC-Orthographic Depth (iNOD), a distortion-free 3D representation compatible with latent diffusion models; and a strategic mixed-data training method that combines synthetic and auto-labeled real data. By solving geometry and relighting jointly, GeoRelight achieves better performance than
257core_reconstruction
medium
Foundry: Distilling 3D Foundation Models for the Edge
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkcore genus=3D Reconstruction, but title/abstract signal is narrower
abstractFoundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient `specialist' models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable.In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher’s token-l
258core_reconstruction
high
Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; depth_correspondence; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) has emerged as an efficient approach for achieving photorealistic rendering. Recent MLP-based variants further improve visual fidelity but introduce substantial decoding overhead during rendering. To alleviate computation cost, several pruning strategies and level-of-detail (LOD) techniques have been introduced, aiming to effectively reduce the number of Gaussian primitives in large-scale scenes. However, our analysis reveals that significant redundancy still remains due to the lack of occlusion awareness. In this work, we propose Proxy-GS, a novel pipeline that exploits a proxy to introduce Gaussian occlusion awareness from any view.At the core of our approach is a fast proxy system capable of producing precise occlusion depth maps at resolution 1000$\times$1000 under 1 ms. This proxy serves two roles: first, it guides the culling of anchors and Gaussians to
259core_reconstruction
high
SAM 3D: 3Dfy Anything in Images
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a $5:1$ win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruc
260core_reconstruction
high
SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractCompositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a ''Perception-Generation-Simulation" pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging
261core_reconstruction
high
WorldGen: From Text to Traversable and Interactive 3D Worlds
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe introduce WorldGen, a method for generating large, fully formed, navigable 3D worlds from a single text prompt. Existing approaches to 3D scene generation often trade off scene diversity, completeness, and correctness in different ways. We push this envelope by producing large scenes explicitly decomposed into individual, high-quality 3D meshes, making them compatible with standard game engines. Our approach first uses a language-driven procedural generator to lay out the scene's basic volumes and navigable regions. An image generator then establishes the scene's theme, style, and details. Next, we obtain a high-quality, compositional 3D reconstruction of the planned scene. This step first uses an image-to-3D model to perform a holistic reconstruction that implicitly determines the shape and location of all scene objects, accounting for context and navigability. The reconstruction is
262core_reconstruction
high
Confidence-Guided Multi-Scale Aggregation for Sparse-View High-Resolution 3D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; depth_correspondence; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractSparse-view 3D Gaussian Splatting (3DGS) reconstructs scenes using 3D Gaussians from sparse input views. Yet, this method is prone to overfitting, which is exacerbated at higher resolutions as the expanded dimensionality amplifies floating artifacts and reconstruction ambiguities. In this paper, we present a systematic study of 3DGS under sparse-view conditions and varying input resolutions. While prior work has overlooked resolution as a key factor in sparse-view performance, we identify and quantify a trade-off: lower-resolution inputs facilitate stable global geometry reconstruction, whereas higher-resolution inputs enable finer detail recovery but introduce high-frequency artifacts and instability. Building on this insight, we further propose **CAGS**, a Confidence-Guided Multi-Scale Aggregation that reconstructs scenes through a coarse-to-fine hierarchical optimization process‌. Our
263core_reconstruction
medium
Efficient unrolled networks for large-scale 3D inverse problems
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkcore genus=3D Reconstruction, but title/abstract signal is narrower
abstractDeep learning-based methods have revolutionized the field of imaging inverse problems, yielding state-of-the-art performance across various imaging domains. The best performing networks incorporate the imaging operator within the network architecture, typically in the form of deep unrolling. However, in large-scale problems, such as 3D imaging, most existing methods fail to incorporate the operator in the architecture due to the prohibitive amount of memory required by global forward operators, which hinder typical patching strategies. In this work, we present a domain partitioning strategy and normal operator approximations that enable the training of end-to-end reconstruction models incorporating forward operators of arbitrarily large problems into their architecture. The proposed method achieves state-of-the-art performance on 3D X-ray cone-beam tomography and 3D multi-coil accelerate
264core_reconstruction
high
EI-Part:Explode for Completion and Implode for Refinement
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editingcore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractPart-level 3D generation is crucial for various downstream applications, including gaming, film production, and industrial design. However, decomposing a 3D shape into geometrically plausible and meaningful components remains a significant challenge. Previous part-based generation methods often struggle to produce well-constructed parts, exhibiting either poor structural coherence, geometric implausibility, inaccuracy, or inefficiency. To address these challenges, we introduce EI-Part, a novel framework specifically designed to generate high-quality 3D shapes with components distinguished by structural coherence, geometric plausibility, accuracy, and generation efficiency. We propose utilizing distinct representations at different stages: an Explode state for part completion and an Implode state for geometry refinement. This strategy allows us to fully leverage spatial resolution, enabli
265core_reconstruction
medium
Fresco: Frequency–Spatial Consistent Optimization for Fine-Grained Head Avatar Modeling
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkcore genus=3D Reconstruction, but title/abstract signal is narrower
abstractWe propose Fresco, a unified optimization paradigm designed to mitigate early over-sharpening, and cross-view drifting in head avatar reconstruction. Fresco combines a Laplacian-pyramid-based frequency curriculum with UV-space consistency regularization to progressively enhance reconstruction quality. The optimization begins by stabilizing low-frequency appearance in the image domain, which suppresses spurious details and promotes reliable convergence. As learning proceeds, consistency across different viewpoints is reinforced through pixel-level alignment on shared UV texture coordinates. Finally, high-frequency components are refined under explicit frequency-band constraints, and seam boundary regularization is applied to preserve local continuity. By optimizing in a frequency- and UV-aligned space, Fresco achieves robust convergence without pseudo high-frequency artifacts and yields c
266core_reconstruction
medium
LoST: Level of Semantics Tokenization for 3D Shapes
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editingcore genus=3D Reconstruction, but title/abstract signal is narrower
abstractTokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation.However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D s
267core_reconstruction
high
SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; dynamic_4dcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractDynamic extensions of 3D Gaussian Splatting (3DGS) achieve high-quality reconstructions through neural motion fields, but per-Gaussian neural inference makes these models computationally expensive. Building on DeformableGS, we introduce Speedy Deformable 3D Gaussian Splatting (SpeeDe3DGS), which bridges this efficiency–fidelity gap through three complementary modules: Temporal Sensitivity Pruning (TSP) removes low-impact Gaussians via temporally aggregated sensitivity analysis, Temporal Sensitivity Sampling (TSS) perturbs timestamps to suppress floaters and improve temporal coherence, and GroupFlow distills the learned deformation field into shared SE(3) transformations for efficient groupwise motion. On the 50 dynamic scenes in MonoDyGauBench, integrating TSP and TSS into DeformableGS accelerates rendering by 6.78$\times$ on average while maintaining neural-field fidelity and using 10$\
268core_reconstruction
high
GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; dynamic_4dcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) has emerged as a prominent 3D representation for high-fidelity and real-time rendering. Prior work has coupled physics simulation with Gaussians, but predominantly targets soft, deformable materials, leaving brittle fracture largely unresolved. This stems from two key obstacles: the lack of volumetric interiors with coherent textures in GS representation, and the absence of fracture-aware simulation methods for Gaussians. To address these challenges, we introduce GaussianFluent, a unified framework for realistic simulation and rendering of dynamic object states. First, it synthesizes photorealistic interiors by densifying internal Gaussians guided by generative models. Second, it integrates an optimized Continuum Damage Material Point Method (CD-MPM) to enable brittle fracture simulation at remarkably high speed. Our approach handles complex scenarios includi
269core_reconstruction
high
EMR-SM: Explicit Mesh Reconstruction with Dynamic Topology Adaptation
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractReconstructing surface meshes from multi-view images has remained a core challenge in recent years. Most existing methods, whether implicit or explicit, depend on intermediate representations and post-processing steps like Marching Cubes or TSDF fusion, often resulting in artifacts and fragmented geometry. Directly optimizing explicit meshes is a promising approach. However, it presents two critical challenges. The first is how to adaptively refine mesh topology to capture detail without introducing degenerate faces. The second is how to maintain consistent UV coordinates for high-fidelity texturing as the mesh structure evolves. To overcome these, we propose EMR-SM, a novel framework that directly optimizes explicit meshes by integrating differentiable optimization with discrete topology updates. Specifically, we introduce an adaptive vertex splitting and merging strategy, along with re
270core_reconstruction
high
Faster-GS: Analyzing and Improving Gaussian Splatting Optimization
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; dynamic_4dcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRecent advances in 3D Gaussian Splatting (3DGS) have focused on accelerating optimization while preserving reconstruction quality. However, many proposed methods entangle implementation-level improvements with fundamental algorithmic modifications or trade performance for fidelity, leading to a fragmented research landscape that complicates fair comparison.In this work, we consolidate and evaluate the most effective and broadly applicable strategies from prior 3DGS research and augment them with several novel optimizations. We further investigate underexplored aspects of the framework, including numerical stability, Gaussian truncation, and gradient approximation.The resulting system, Faster-GS, provides a rigorously optimized algorithm that we evaluate across a comprehensive suite of benchmarks. Our experiments demonstrate that Faster-GS achieves up to 5$\times$ faster training while ma
271core_reconstruction
high
Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe introduce a feed-forward framework for one-shot animatable mesh head reconstruction that generates high-fidelity, directly animatable 3D head avatars from a single image. Unlike previous work that relies on time-consuming test-time optimization or extensive multi-view data, our method produces complete mesh representations with inherent animatability from a single image in a single forward pass. Our approach employs a dual shape and texture map architecture that simultaneously processes mesh vertices and texture map with extracted image features from a shared transformer backbone, allowing for coherent shape carving and appearance modeling. To prevent mesh collapse and ensure topological integrity during feed-forward deformation, we propose an iterative GRU-based decoding mechanism with progressive geometry deformation and texture refinement, coupled with a novel reprojection-based te
272core_reconstruction
medium
PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction, but title/abstract signal is narrower
abstractCurrent foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning. Recent approaches leverage vision and language foundation models to directly solve dense tasks through multi-view renderings and text queries. While promising, these pipelines require expensive inference over multiple renderings, depend heavily on large language-model (LLM) prompt engineering for captions, and fail to exploit the inherent 3D geometry of shapes. We address this gap by introducing an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds. Our pre-training approach builds on existing data engines that generate part-annotated 3D shapes by pairing multi-view SAM regions with VLM captioning. Using this data, we train a point cloud transformer encoder in two stages: (1) distillation of dense 2D
273core_reconstruction
medium
Bringing Your Portrait to 3D Presence
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction, but title/abstract signal is narrower
abstractWe present a unified framework for reconstructing animatable 3D human avatars from a single portrait across head, half-body, and full-body inputs. Our method tackles three bottlenecks: pose- and framing-sensitive feature representations, limited scalable data, and unreliable proxy-mesh estimation.We introduce a Dual-UV representation that maps image features to a canonical UV space via Core-UV and Shell-UV branches, eliminating pose- and framing-induced token shifts.We also build a factorized synthetic data manifold combining 2D generative diversity with geometry-consistent 3D renderings, supported by a training scheme that improves realism and identity consistency.A robust proxy-mesh tracker maintains stability under partial visibility. Together, these components enable strong in-the-wild generalization. Trained only on half-body synthetic data, our model achieves state-of-the-art head
274core_reconstruction
high
Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWhile 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure.We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a va
275core_reconstruction
high
CrossHOI: Learning Cross-View Representations for Monocular 3D Human-Object Interaction Reconstruction
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractReconstructing 3D human-object interaction (HOI) from monocular images is highly challenging especially when human and object are mutually occluded. Existing methods primarily rely on single-view inputs, which fundamentally limit their ability to recover occluded regions and accurately estimate contact areas. To address these challenges, we for the first time, consider to introduce novel-view feature priors to enhance monocular 3D HOI reconstruction. We first design a cross-view generator that learns to infer novel-view image features from a single-view input, enriching spatial geometry at the feature level without requiring extra inputs during inference. Guided by both real and generated view features, a spatial cross-view feature fusion module adaptively aggregates complementary cues to enhance the initial reconstruction of human and object meshes. Built upon this reconstruction, we sa
276core_reconstruction
high
EDGS: Eliminating Densification for Efficient Convergence of 3DGS
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; depth_correspondencecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting reconstructs scenes by starting from a sparse Structure-from-Motion initialization and refiningunder-reconstructed regions. This process is slow, as it requires multiple densification steps where Gaussians arerepeatedly split and adjusted, following a lengthy optimization path. Moreover, this incremental approach often yieldssuboptimal renderings in high-frequency regions. We propose a fundamentally different approach: eliminate densification with a one-step approximation of scenegeometry using triangulated pixels from dense image correspondences. This dense initialization allows us to estimatethe rough geometry of the scene while preserving rich details from input RGB images, providing each Gaussian withwell-informed color, scale, and position. As a result, we dramatically shorten the optimization path and remove theneed for densification. Unlike methods that rely
277core_reconstruction
medium
FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction, but title/abstract signal is narrower
abstractAutoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertex-coordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our ``one-face-one-token'' strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder
278core_reconstruction
high
Human Interaction-Aware 3D Reconstruction from a Single Image
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractReconstructing textured 3D human models from a single image is fundamental for AR/VR and digital human applications. However, existing methods mostly focus on single individuals and thus fail in multi-human scenes, where naive composition of individual reconstructions often leads to artifacts such as unrealistic overlaps, missing geometry in occluded regions, and distorted interactions. These limitations highlight the need for approaches that incorporate group-level context and interaction priors. We introduce a holistic method that explicitly models both group- and instance-level information. To mitigate perspective-induced geometric distortions, we first transform the input into a canonical orthographic space. Our primary component, Human Group-Instance Multi-View Diffusion (HUG-MVD), then generates complete multi-view normals and images by jointly modeling individuals and group contex
279core_reconstruction
high
Intrinsic Image Fusion for Multi-View 3D Material Reconstruction
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe introduce Intrinsic Image Fusion, a method that reconstructs high-quality physically based materials from multi-view images.Material reconstruction is highly underconstrained and typically relies on analysis-by-synthesis, which requires expensive and noisy path tracing. To better constrain the optimization, we incorporate single-view priors into the reconstruction process. We leverage a diffusion-based material estimator that produces multiple, but often inconsistent, candidate decompositions per view.To reduce the inconsistency, we fit an explicit low-dimensional parametric function to the predictions.We then propose a robust optimization framework using soft per-view prediction selection together with confidence-based soft multi-view inlier set to fuse the most consistent predictions of the most confident views into a consistent parametric material space. Finally, we use inverse pat
280core_reconstruction
medium
Learning to Infer Parameterized Representations of Plants from 3D Scans
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction, but title/abstract signal is narrower
abstractPlants frequently contain numerous organs, organized in 3D branching systems defining the plant's architecture. Reconstructing the architecture of plants from unstructured observations is challenging because of self-occlusion and spatial proximity between organs, which are often thin structures. To achieve the challenging task, we propose an approach that allows to infer a parameterized representation of the plant's architecture from a given 3D scan of a plant. In addition to the plant's branching structure, this representation contains parametric information for each plant organ, and can therefore be used directly in a variety of tasks. In this data-driven approach, we train a recursive neural network with virtual plants generated using a procedural model. After training, the network allows to infer a parametric tree-like representation based on an input 3D point cloud. Our method is ap
281core_reconstruction
medium
Learning to Solve PDEs on Neural Shape Representations
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction, but title/abstract signal is narrower
abstractSolving partial differential equations (PDEs) on shapes underpins many shape analysis and engineering tasks; yet, prevailing PDE solvers operate on polygonal/triangle meshes while modern 3D assets increasingly live as neural representations. This mismatch leaves no suitable method to solve surface PDEs directly within the neural domain, forcing explicit mesh extraction or per-instance residual training, preventing end-to-end workflows. We present a novel, mesh-free formulation that learns a local update operator conditioned on neural (local) shape attributes, enabling surface PDEs to be solved directly where the (neural) data lives. The operator integrates naturally with prevalent neural surface representations, is trained once on a single representative shape, and generalizes across shape and topology variations, enabling accurate, fast inference without explicit meshing or per-instance
282core_reconstruction
high
Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractReconstructing object geometry from radio frequency (RF) signals is fundamentally challenging due to the lensless imaging nature of RF sensing, which leads to low spatial resolution and high noise. Unlike light signals, RF signals can penetrate occlusions and thus capture information about hidden scenes. Existing Non-Line-of-Sight (NLoS) 3D neural reconstruction methods can recover coarse surfaces inside enclosed environments but often suffer from unstable optimization, noisy surface geometry, and surface ambiguity, failing to produce accurate zero-level sets from the signed distance field (SDF). These limitations largely stem from neglecting the role of Line-of-Sight (LoS) geometry outside the enclosed region, which provides valuable physical constraints for modeling signal propagation. In this paper, we introduce a Unified LoS and NLoS neural geometry reconstruction framework that leve
283core_reconstruction
high
SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; surface_occupancycore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRecent progress in feed-forward 3D Gaussian Splatting (3DGS) has notably improved rendering quality. However, the spatially uniform and highly redundant 3DGS map generated by previous feed-forward 3DGS methods limits their integration into downstream reconstruction tasks. We propose SparseSplat, the first feed-forward 3DGS model that adaptively adjusts Gaussian density according to scene structure and information richness of local regions, yielding highly compact 3DGS maps. To achieve this, we propose entropy-based probabilistic sampling, generating large, sparse Gaussians in textureless areas and assigning small, dense Gaussians to regions with rich information. Additionally, we designed a specialized point cloud network that efficiently encodes local context and decodes it into 3DGS attributes, addressing the receptive field mismatch between the general 3DGS optimization pipeline and f
284core_reconstruction
high
TokenHand: Discrete Token Representation for Efficient Hand Mesh Reconstruction
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractHand mesh reconstruction has attracted growing attention in recent years.Despite significant progress, existing methods often struggle to balance reconstruction quality and inference efficiency.In this work, we propose TokenHand, a novel framework for single-view 3D hand mesh reconstruction that achieves both high accuracy and real-time inference.Our method represents a 3D hand model using $M$ discrete tokens, each describing a specific sub-structure of the hand.This compositional representation enables efficient modeling with minimal reconstruction error.Furthermore, we reformulate hand mesh reconstruction as a classification problem rather than a regression task.Specifically, a classifier predicts the categories of the $M$ tokens from an input image, and a pre-trained decoder network subsequently reconstructs the 3D hand mesh from the predicted tokens without any post-processing.Extens
285core_reconstruction
high
TouchDream: 3D Object Completion through Imagined Touch
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractPoint cloud completion is crucial for robust 3D perception but remains challenging due to its ill-posed nature. Coarse-to-fine methods can lead to unconstrained local guesses in the absence of key structures, whereas diffusion-based approaches may introduce geometric inconsistencies. To overcome these limitations, we present TouchDream, a novel framework that leverages a diffusion model to 'dream' of tactile sensing on object surfaces, which reformulates the sensing process as a learnable generative modeling task. Unlike visual cues, tactile data provides rich local geometry that can be directly converted into 3D space for point fusion, offering a powerful guide for detail-aware completion. Specifically, our approach generate compact tactile latent representations conditioned on coarse points and sampled touch poses. A touch-guided refinement module then leverages touch features to optim
286core_reconstruction
high
CaT-GS: Efficient 3DGS Rendering for Large Scale Scenes via Inter-frame Caching and Tile Scheduling
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRecent breakthroughs in 3D Gaussian Splatting (3DGS) have advanced neural rendering with high fidelity and speed. However, its performance degrades significantly in large-scale scenes due to the computational burden of tile-based rasterization. Existing optimization efforts either require costly scene re-training or focus on narrow aspects of the pipeline, overlooking critical inefficiencies in real-world deployments. Through a comprehensive analysis, we identify three primary sources of redundancy and low GPU utilization: redundant inter-frame pre-processing, viewpoint-based occlusion redundancy, and severe tile-level load imbalance. To address these issues, we propose CaT-GS, a novel and efficient 3DGS rendering pipeline. CaT-GS introduces a speculative multi-frame preprocessing method to eliminate redundant computations across consecutive frames, and an inter-frame caching mechanism t
287core_reconstruction
high
ASFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractIndoor monocular semantic scene completion (MSSC) is notably more challenging than its outdoor counterpart due to complex spatial layouts and severe occlusions. While transformers are well suited for modeling global dependencies, their high memory cost and difficulty in reconstructing fine-grained details have limited their use in indoor MSSC. To address these limitations, we introduce ASFormer, a serialized transformer framework tailored for indoor MSSC. Our model features three key designs: (1) an Adaptive Serialized Transformer with learnable shifts that dynamically adjust receptive fields; (2) a Center-Relative Positional Encoding that captures spatial information richness; and (3) a Convolution-Modulated Layer Normalization that bridges heterogeneous representations between convolutional and transformer features. Extensive experiments on NYUv2 and Occ-ScanNet demonstrate that ASForm
288core_reconstruction
medium
Bidirectional Query-Driven Generation of Parametric CAD Sketch
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction, but title/abstract signal is narrower
abstractLearning-based CAD modeling shows great promise in automating parametric design, yet existing approaches often overlook the incremental and state-dependent nature of sketch construction. We present CADSketcher, a query-driven bidirectional framework for completing partial parametric sketches by internalizing the non-linear construction logic of interactive CAD processes. At the core of CADSketcher are two key innovations. First, a bidirectional sketch learner recovers both prior and posterior contexts from arbitrary-span partial sketches via a bidirectional query mechanism, enabling exploration of multiple plausible modeling trajectories. Second, a confidence-guided completion pipeline adaptively determines the expansion direction through a confidence gate and ensures executable instruction generation using a validity compiler, while a progressive context updater preserves sketch consist
289core_reconstruction
medium
BrepVGAE: Variational Graph Autoencoder with Unified Latent Representation for B-rep
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction, but title/abstract signal is narrower
abstractDue to the heterogeneity of faces and edges in B-rep, conventional graph-based representations is incapable of establishing a unified formulation for faces and edges, thereby constraining the capabilities of B-rep generative models. We propose a B-rep Variational Graph Auto Encoding (BrepVGAE), the first variational graph autoencoder framework capable of holistically encoding and decoding boundary representations of B-rep models.Firstly, we novelly represent both geometry faces and edges as nodes in a graph representation. We then design a sparse graph autoencoder to aggregate the complete B-rep structure into a compact global latent vector. We then construct a decoder that employs set-based generation, which uses bilinear layers to reconstruct adjacency relationships, i.e., topology, with a single latent vector. Afterwards, the same decoder generates node features for all faces and edge
290core_reconstruction
medium
Erasing Invisible Watermarks via Novel View Synthesis
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; depth_correspondencecore genus=3D Gaussian Splatting, but title/abstract signal is narrower
abstractInvisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerability of these schemes against sophisticated removal attacks remains essential to assess their reliability and guide robust design. In this work, we expose a fundamental vulnerability in invisible watermarks by reformulating watermark removal as a view synthesis problem. Our key insight is that generating a perceptually consistent alternative ``view" of the same semantic content, akin to re-observing a scene from a shifted perspective, naturally removes the embedded watermark while preserving visual fidelity. This reveals a critical gap: watermarks robust to pixel-space and frequency-domain attacks remain vulnerable to semantic-preserving viewpoint transformations. We introduce a zero-shot diffu
291core_reconstruction
medium
LoG3D: Ultra-High-Resolution 3D Shape Modeling via Local-to-Global Partitioning
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction, but title/abstract signal is narrower
abstractGenerating high-fidelity 3D contents remains a fundamental challenge due to the complexity of representing arbitrary topologies—such as open surfaces and intricate internal structures—while preserving geometric details. Prevailing methods based on signed distance fields (SDFs) are hampered by costly watertight preprocessing and struggle with non-manifold geometries, while point-cloud representations often suffer from sampling artifacts and surface discontinuities. To overcome these limitations, we propose a novel 3D variational autoencoder (VAE) framework built upon unsigned distance fields (UDFs)—a more robust and computationally efficient representation that naturally handles complex and incomplete shapes. Our core innovation is a local-to-global (LoG) architecture that processes the UDF by partitioning it into uniform subvolumes, termed UBlocks. This architecture couples 3D convolutio
292core_reconstruction
high
MatMart: Material Reconstruction of 3D Objects via Diffusion
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractApplying diffusion models to physically-based material estimation and generation has recently gained prominence. In this paper, we propose MatMart, a novel material reconstruction framework for 3D objects, offering the following advantages. First, MatMart adopts a two-stage reconstruction, starting with accurate material prediction from inputs and followed by prior-guided material generation for unobserved views, yielding high-fidelity results. Second, by utilizing progressive inference alongside the proposed view-material cross-attention (VMCA), MatMart enables reconstruction from an arbitrary number of input images, demonstrating strong scalability and flexibility. Finally, MatMart achieves both material prediction and generation capabilities through end-to-end optimization of a single diffusion model, without relying on additional pre-trained models, thereby exhibiting enhanced stabil
293core_reconstruction
medium
NeAR: Coupled Neural Asset–Renderer Stack
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction, but title/abstract signal is narrower
abstractNeural asset authoring and neural rendering have emerged as largely disjoint threads: one generates digital assets using neural networks for traditional graphics pipelines, while the other develops neural renderers that map conventional assets to images. However, the joint design of the asset representation and renderer remains largely unexplored. We argue that coupling them can unlock an end-to-end learnable graphics stack with benefits in fidelity, consistency, and efficiency. In this paper, we explore this possibility with **NeAR**: a Coupled Neural Asset–Renderer Stack. On the **asset** side, we build on Trellis-style Structured 3D Latents and introduce a lighting-homogenized neural asset: from a casually lit input, a rectified-flow backbone predicts a Lighting-Homogenized SLAT that encodes geometry and intrinsic material cues in a compact, view-agnostic latent. On the **renderer** s
294core_reconstruction
high
Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractWe present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. Specifically, we introduce a set of learnable lightweight visual prompts into a pretrained, frozen geometry foundation model to capture the knowledge of new environments while preserving the fundamental capability of the foundation model for geometry prediction. To solve the problems of missing groundtruth and the requirement of high efficiency when updating these visual prompts at test time, we introduce a local-global self-supervised learning strategy by enforcing the local and global consistency constraints on predictions. The local consistency constraints are conducted on intermediate and previously local fused results, enabling the model to be trained with high-quality pseudo groundtruth signals; the global cons
295core_reconstruction
high
Opti-NeuS: Neural Reconstruction for Dual-Layered Transparent and Opaque Objects
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstract3D reconstruction of transparent objects from multiple views has been a long-standing challenge. In contrast to opaque objects, transparent objects exhibit complex refraction that causes serious image distortion, resulting in a highly ill-posed problem. Existing reconstruction methods commonly depend on special capture devices or controlled environments, which provide more priors and simplify the modeling of refraction. More importantly, these methods lack the capability for reconstruction of mixed transparent and opaque objects, being confined to transparent or opaque materials. To address these challenges, we propose Opti-NeuS, a novel method for reconstructing transparent and opaque objects without controlled environments or additional input. Opti-NeuS incorporates a novel IoRNetwork to obtain spatially-varying IoR for tracing the refractive ray paths, which can finally model refracti
296core_reconstruction
high
PRIMU: Uncertainty Estimation for Novel Views in Gaussian Splatting from Primitive-Based Representations of Error and Coverage
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; depth_correspondencecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe introduce Primitive-based Representations of Uncertainty (PRIMU), a post-hoc uncertainty estimation (UE) framework for Gaussian Splatting (GS).Reliable UE is essential for deploying GS in safety-critical domains such as robotics and medicine.Existing approaches typically estimate Gaussian-primitive variances and rely on the rendering process to obtain pixel-wise uncertainties.In contrast, we construct primitive-level representations of error and visibility/coverage from training views, capturing interpretable uncertainty information. These representations are obtained by projecting view-dependent training errors and coverage statistics onto the primitives. Uncertainties for novel views are inferred by rendering these primitive-level representations, producing uncertainty feature maps, which are aggregate through pixel-wise regression on holdout data. We analyze combinations of uncerta
297core_reconstruction
medium
Residual Primitive Fitting of 3D Shapes with SuperFrusta
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction, but title/abstract signal is narrower
abstractWe introduce a framework for converting 3D shapes into compact and editable assemblies of analytic primitives, directly addressing the persistent trade-off between reconstruction fidelity and parsimony. Our approach combines two key contributions: a novel primitive, termed SuperFrustum, and an iterative inference algorithm, Residual Primitive Fitting (ResFit). SuperFrustum is a analytical primitive that is simultaneously (1) expressive, being able to express various common solids such as cylinders, spheres, cones & their tapered and bent forms, (2) editable, being compactly parameterized with 8 parameters, and (3) optimizable, with a sign distance field differentiable w.r.t. its parameters almost everywhere. ResFit is an unsupervised procedure that interleaves global shape analysis with local optimization, iteratively fitting primitives to the unexplained residual of a shape to discover
298core_reconstruction
high
Revisiting 3D Reconstruction Kernels as Low-Pass Filters
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstract3D reconstruction is to recover 3D signals from the sampled discrete 2D pixels, with the goal to converge continuous 3D spaces.In this paper, we revisit 3D reconstruction from the perspective of signal processing, identifying the periodic spectral extension induced by discrete sampling as the fundamental challenge.Previous 3D reconstruction kernels, such as Gaussians, Exponential functions, and Student's t distributions, serve as the low pass filters to isolate the baseband spectrum.However, their unideal low-pass property results in the overlap of high-frequency components with low-frequency components in the discrete-time signal’s spectrum.To this end, we introduce Jinc kernel with an instantaneous drop to zero magnitude exactly at the cutoff frequency, which is corresponding to the ideal low pass filters.As Jinc kernel suffers from low decay speed in the spatial domain, we further pro
299core_reconstruction
high
SparseOIT: Improving Order-Independent Transparency 3DGS via Active Set Method
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; depth_correspondencecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) has received tremendous popularity over the past few years due to its photorealistic visual appearance. However, 3DGS uses volumetric rendering that is not suitable for objects with non-lambertian or transparent materials. To remedy this issue, a family of Order-Independent Transparency (OIT) rendering methods propose to remove or modify the depth sorting step in the 3DGS rendering equation. However, the potential of OIT-based method is still underexplored. In this paper, we observe that the OIT modifications to the rendering equation significantly reduce the inter-independence among individual gaussian splats, resulting in very sparse variable dependencies that can be harnessed by specific optimization techniques such as active set method. To this end, we propose \textbf{SparseOIT}, an OIT-based 3DGS reconstruction algorithm that maintains an active set of g
300core_reconstruction
high
Stochastic Ray Tracing for the Reconstruction of 3D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; depth_correspondencecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRay-tracing-based 3D Gaussian splatting (3DGS) enjoys the generality of supporting non-pinhole camera models and relightable formulations. However, they are usually lacking in performance, partially due to the need for depth-based sorting of all intersecting Gaussians along the traced rays.In this paper, we introduce a sorting-free differentiable stochastic formulation for ray-traced 3DGS, enabling efficient reconstruction and rendering of both standard and relightable 3DGS scenes.For standard 3DGS, our method offers performance comparable to rasterization-based 3DGS and outperforms sorting-based ray tracing.For relightable 3DGS, our technique provides higher-quality reconstructions and renderings thanks to the accurate shadow and shading computation provided by fully ray-traced shadow and light rays.
301core_reconstruction
high
TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction with direct reconstruction/geometry signal
abstractJoint reconstruction of 3D human and object from a single image is an active research area, with pivotal applications in robotics and digital content creation. Despite recent advances, existing approaches suffer from two fundamental limitations. First, their reconstructions rely heavily on physical contact information, which inherently cannot capture non-contact human–object interactions, such as gazing at or pointing toward an object. Second, the reconstruction process is primarily driven by local geometric proximity, neglecting the human and object appearances that provide global context crucial for understanding holistic interactions. To address these issues, we introduce TeHOR, a framework built upon two core designs. First, beyond contact information, our framework leverages text descriptions of human–object interactions to enforce semantic alignment between the 3D reconstruction an
302core_reconstruction
medium
Underground Plant Exploration: Non-Destructive 3D Root Assessment with GPR Based on Point Graph Neural Network
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancycore genus=3D Reconstruction, but title/abstract signal is narrower
abstractThis paper introduces a novel application of machine learning in agriculture for non-destructive 3D root structure reconstruction. Plant roots are critical for providing resources for the entire plant. Ground Penetrating Radar (GPR) is a key tool for identifying subterranean objects with easy and obvious shapes, such as large pipes, but remaining challenging to assess the 3D shapes of roots. In our study, we introduce a novel approach specifically designed based on GPR signal shape priors to detect target signals and perform curve parameter regression based on multiple B-scans from GPR. This process enables the derivation of a precise curve from the detection and regression outcomes. To achieve the reconstruction of a comprehensive 3D root structure, we have developed a shape reconstruction network that processes sparse sliced 3D points through a dedicated point graph network and an upsa
303core_reconstruction
high
A Geometric Algebra-Informed 3DGS Framework for Wireless Channel Prediction
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractIn this paper, we introduce Geometric Algebra–Informed 3D Gaussian Splatting (GAI-GS), a framework for wireless modeling that couples 3D Gaussian splatting with a geometric-algebra–based attention mechanism to explicitly model ray–object interactions in complex propagation environments. GAI-GS encodes joint spatial–electromagnetic (EM) relations into token representations, enabling scene-level aggregation within a unified, end-to-end neural architecture. This design renders ray tracing for wireless propagation physically grounded, with token interactions that respect EM constraints including multipath, path-dependent attenuation, and reflection/diffraction. Through extensive evaluations on on multiple real-world indoor datasets, GAI-GS consistently surpasses current baselines across various wireless tasks.
304core_reconstruction
high
Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian splatting (3DGS) has become a vital tool for learning a radiance field from multiple posed images. Although 3DGS shows great advantages over NeRF in terms of rendering quality and efficiency, it remains a research challenge to further improve the efficiency of learning 3D Gaussians. To overcome this challenge, we propose novel training strategies and losses to shorten each Gaussian list used to render a pixel, which speeds up the splatting by involving fewer Gaussians along a ray. Specifically, we shrink the size of each Gaussian by resetting their scales regularly, encouraging smaller Gaussians to cover fewer nearby pixels, which shortens the Gaussian lists of pixels. Additionally, we introduce an entropy constraint on the alpha blending procedure to sharpen the weight distribution of Gaussians along each ray, which drives dominant weights larger while making minor weights s
305core_reconstruction
medium
OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; depth_correspondencecore genus=3D Gaussian Splatting, but title/abstract signal is narrower
abstractWe propose OMG-Avatar, a novel One-shot method that leverages a Multi-LOD (Level-of-Detail) Gaussian representation for animatable 3D head reconstruction from a single image in 0.2s. Our method enables LOD head avatar modeling using a unified model that accommodates diverse hardware capabilities and inference speed requirements. To capture both global and local facial characteristics, we employ a transformer-based architecture for global feature extraction and projection-based sampling for local feature acquisition. These features are effectively fused under the guidance of a depth buffer, ensuring occlusion plausibility. We further introduce a coarse-to-fine learning paradigm to support Level-of-Detail functionality and enhance the perception of hierarchical details. To address the limitations of 3DMMs in modeling non-head regions such as the shoulders, we introduce a multi-region decom
306core_reconstruction
high
B$^3$-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta–Bernoulli Bayesian Updates
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; generation_editingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractInteractive 3D Gaussian Splatting (3DGS) segmentation is essential for real-time editing of pre-reconstructed assets in film and game production.However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use.We propose \textbf{B$^3$-Seg (Beta--Bernoulli Bayesian Segmentation for 3DGS)}, a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under \textbf{camera-free} and \textbf{training-free} conditions.Our approach reformulates segmentation as sequential Beta--Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG).This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy $(1{-}1/e)$ approximation to the optimal view sampling policy.Experiments on multiple datasets show t
307core_reconstruction
high
BEA-GS : BEyond RAdiance Supervision in 3DGS for Precise Object Extraction
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; generation_editingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractMost Gaussian Splatting techniques that provide a 3D semantic representation of the scene don't optimize the underlying 3D geometry of the scene. This makes object-level editing or asset extraction challenging. Recent methods, like COBGS, Trace3D, and ObjectGS, acknowledge this limitation and propose approaches that modify the geometry of the scene to represent the underlying semantics. We go a step further and propose a novel solution that provides near perfect boundaries in object extraction. We do so by introducing two new losses in the optimization that take care of: 1. Modifying the geometry of visible Gaussians to respect semantic boundaries, and, 2. Modifying the geometry of non-visible Gaussians that appear once the object is extracted. Our first loss propagates gradients directly through the rasterization to allow for seamless integration within the optimization of the Gaussian
308core_reconstruction
high
Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractUnderstanding hand-object interaction from monocular videos is crucial for immersive and dexterous interactions in AR/VR and robotic applications. However, existing monocular reconstruction methods primarily assume rigid grasping and static object geometry. When applied to articulated manipulations, the continuous joint rotations and frequent component deformations introduce a strong coupling between shape and motion, leading to severe ambiguity and instability in articulation optimization under monocular observation. To address this challenge, we propose a Clay-to-Stone dual-phase framework, modeling the articulated manipulation at hierarchical granularities, enabling a progression from flexible semantic exploration to structured articulation recovery. In the CLAY phase, our method performs fine-grained control over geometric deformation, guided by inter-part semantic correlation learni
309core_reconstruction
high
ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractLifting 2D open-vocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. Mainstream methods, built on an embedding paradigm, suffer from three key flaws: (i) geometry-semantic inconsistency, where points, rather than objects, serve as the semantic basis, limiting semantic fidelity; (ii) semantic bloat from injecting gigabytes of feature data into the geometry; and (iii) semantic rigidity, as one feature per Gaussian struggles to capture rich polysemy. To overcome these limitations, we introduce ExtrinSplat, a framework built on the extrinsic paradigm that decouples geometry from semantics. Instead of embedding features, ExtrinSplat clusters Gaussians into multi-granularity, overlapping 3D object groups. A Vision-Language Model (VLM) then interprets these groups to generate lightweight textual hypotheses, creating an extrinsic index layer that natively su
310core_reconstruction
high
HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractWe introduce HyperGaussians, a novel extension of 3D Gaussian Splatting for high-quality animatable face avatars. While tremendous successes have been achieved for static faces, animatable avatars from dynamic videos still fall in the uncanny valley. The de facto standard, 3D Gaussian Splatting (3DGS), represents a face through a collection of 3D Gaussian primitives. 3DGS excels at rendering static faces, but the state-of-the-art still struggles with nonlinear deformations, complex lighting effects, and fine details. While most related works focus on predicting better Gaussian parameters from expression codes, we rethink the 3D Gaussian representation itself and how to make it more expressive. Our insights lead to a novel extension of 3D Gaussians to high-dimensional multivariate Gaussians, dubbed 'HyperGaussians'. The higher dimensionality increases expressivity through conditioning on
311core_reconstruction
high
MLLMSplat: A 2D MLLM-Powered Framework for 3D Gaussian Splatting Understanding, Generation, and Editing
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; generation_editingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting (3DGS) has emerged as a mainstream representation for 3D scenes, drawing increasing research attention to its understanding, generation, and editing. However, existing studies remain limited to low-level perception, low-quality generation, and low-efficiency editing, lagging far behind their image counterparts in the era of Multimodal Large Language Models (MLLMs). To bridge this gap, we propose MLLMSplat, a novel framework that adapts 2D MLLMs to achieve high-level understanding, high-quality generation, and high-efficiency editing of 3DGS scenes. Specifically, our comprehensive framework consists of three core designs: (1) a 3DGS tokenizer that can be seamlessly integrated into existing MLLMs in a training-free manner; (2) a 3DGS de-tokenizer that non-intrusively extends the 2D latent diffusion model in MLLMs using a dual positional encoding space, while augmentin
312core_reconstruction
high
Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRecent significant advances in 3D scene representation have been driven by 3D Gaussian Splatting (3DGS), which has enabled real-time rendering with photorealistic quality. 3DGS often requires a large number of primitives to achieve high fidelity, leading to redundant representations and high resource consumption, thereby limiting its scalability for complex or large-scale scenes. Consequently, effective pruning strategies and more expressive primitives that can reduce redundancy while preserving visual quality are crucial for practical deployment. We propose an efficient, integrated reconstruction-aware pruning strategy that adaptively determines pruning timing and refining intervals based on reconstruction quality, thus reducing model size while enhancing rendering quality. Moreover, we introduce a 3D Difference-of-Gaussians primitive that jointly models both positive and negative densi
313core_reconstruction
medium
Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; data_benchmarkcore genus=3D Gaussian Splatting, but title/abstract signal is narrower
abstractRadiance field methods (e.g.~3D Gaussian Splatting) have emerged as a powerful paradigm for novel view synthesis, yet their appearance modeling often relies on Spherical Harmonics (SH), which impose fundamental limitations.SH struggle with high-frequency signals, exhibit Gibbs ringing artifacts, and critically fail to capture specular reflections -- a key component of realistic rendering. While alternatives like Spherical Gaussians offer improvements, they introduce significant optimization complexity.We propose Spherical Voronoi (SV) as a unified framework for appearance representation in 3D Gaussian Splatting.SV partitions the directional domain into learnable regions with smooth boundaries, providing an intuitive and stable parameterization for view-dependent effects. For diffuse appearance, SV achieves competitive results while maintaining simpler optimization compared to existing al
314core_reconstruction
high
FilterGS: Traversal-Free Parallel Filtering and Adaptive Shrinkage for Large-scale LoD 3D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; data_benchmarkcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting has revolutionized neural rendering with real-time performance. However, scaling this approach to large scenes using Level-of-Detail methods faces critical challenges: inefficient serial traversal consuming over 60\% of rendering time, and redundant Gaussian-tile pairs that incur unnecessary processing overhead. To address these limitations, we propose FilterGS, featuring a parallel filtering mechanism with two complementary filters that enable efficient selection without tree traversal, coupled with a scene-adaptive Gaussian shrinkage strategy that minimizes redundancy through opacity-based scaling. Extensive experiments demonstrate that FilterGS achieves state-of-the-art rendering speeds while maintaining competitive visual quality across multiple large-scale datasets.
315core_reconstruction
high
Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; generation_editingcore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting has recently enabled fast and photorealistic reconstruction of static 3D scenes. However, dynamic editing of such scenes remains a significant challenge. We introduce a novel framework, Physics-Guided Score Distillation, to address a fundamental conflict: physics simulation provides a strong motion prior that is insufficient for photorealism , while video-based Score Distillation Sampling (SDS) alone cannot generate coherent motion for complex, multi-particle scenarios. We resolve this through a unified optimization framework where physics simulation guides Score Distillation to jointly refine the motion prior for photorealism while simultaneously optimizing appearance. Specifically, we learn a neural dynamics model that predicts particle motion and appearance, optimized end-to-end via a combined loss integrating Video-SDS for photorealism with our physics-guidance
316core_reconstruction
high
3D Gaussian Splatting at Arbitrary Resolution with Compact Proxy Anchors
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractDespite achieving high-quality rendering, 3D Gaussian Splatting suffers from aliasing when the rendering resolution changes, as it is typically trained at a fixed resolution. To address this limitation, we introduce a method that enables the model to generate resolution-adaptive 3D Gaussians under arbitrary resolution changes. In particular, built upon Scaffold-GS, we enhance the anchor feature representation by incorporating a resolution-embedding to encode continuous resolution information. From these enhanced anchor features, a pixel coverage gate dynamically forms resolution-adaptive 3D Gaussians. Furthermore, we drastically reduce storage requirements by selecting a compact subset of proxy anchors and designing a residual anchor predictor that reconstructs the unselected leaf anchors based on the proxy anchors, enabling faithful scene representation without compromising visual fidel
317core_reconstruction
high
IR-HGP: Physically-Aware Gaussian Inverse Rendering for High-Illumination Scenes via Generative Priors
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractApplying 3D Gaussian Splatting to inverse rendering, especially for relightable assets under high-illuminance conditions, remains challenging. Strong specular highlights and complex reflections complicate material-light disentanglement, often baking in shadows and losing specular detail. To address this, we introduce IR-HGP, a framework that achieves robust disentanglement using three synergistic modules: First, a Hybrid Visibility Decomposition module ensures physical visibility consistency. Second, a Generative Illumination Field Prior module infers detailed and high-dynamic range environmental lighting. Finally, a Physics-Aware Radiance Correction module stabilizes optimization and mitigates illumination artifacts. Our framework achieves SOTA material recovery and relighting performance, outperforming existing methods under challenging illumination conditions. It reconstructs the vi
318core_reconstruction
high
Learning Differentiable Hierarchies in 3D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractAlthough 3D Gaussian Splatting (3DGS) has achieved impressive performance in real-time rendering, its unordered Gaussians make level-of-detail (LoD) construction and model compression highly challenging, limiting its applicability in customized scenarios.In this work, we propose a learning-based Gaussian hierarchy representation that ranks Gaussians by their contribution to the scene, enabling flexible LoD representations across arbitrary Gaussian counts.We first introduce a unified, continuous formulation and metric for Gaussian hierarchy. Then, we introduce a hierarchy-based modulated rendering method built upon a Differentiable Decreasing Step Function, which enables efficient hierarchy learning while maintaining approximately equivalent rendering. Moreover, we develop a PDF-Guided Active-Region Sampling strategy that encourages the learned hierarchy to become widely distributed withi
319core_reconstruction
high
NG-GS: NeRF-guided 3D Gaussian Splatting Segmentation
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRecent advances in 3D Gaussian Splatting (3DGS) have enabled highly efficient and photorealistic novel view synthesis. However, segmenting objects accurately in 3DGS remains challenging due to the discrete nature of Gaussian representations, which often leads to aliasing and artifacts at object boundaries. In this paper, we introduce NG-GS, a novel framework for high-quality object segmentation in 3DGS that explicitly addresses boundary discretization. Our approach begins by automatically identifying ambiguous Gaussians at object boundaries using mask variance analysis. We then apply radial basis function (RBF) interpolation to construct a spatially continuous feature field, enhanced by multi-resolution hash encoding for efficient multi-scale representation. A joint optimization strategy aligns 3DGS with a lightweight NeRF module through alignment and spatial continuity losses, ensuring
320core_reconstruction
high
Urban-GS: A Unified 3D Gaussian Splatting Framework for Compact and High-Fidelity Aerial-to-Street Reconstruction
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRecently, 3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction, enabling efficient and high-fidelity novel view synthesis. However, seamless integration of both aerial and street view images to model urban scenes remains a significant challenge for 3DGS. This joint setting suffers from extreme view coverage disparity, complex multi-scale details, and imbalanced viewpoint distributions.In this work, we present Urban-GS, a novel framework built upon Gaussian Splatting for the compact unified reconstruction and high-fidelity rendering of urban scenes from both aerial and street views. Specifically, we first develop an Aerial-Street Joint Adaptive Densification method to resolve the densification conflicts arising from large view coverage disparity. We then introduce a Contribution-based Anchor Pruning strategy to effectively mitigate the storage overhead from captu
321core_reconstruction
high
Write Where It Matters: Policy-Guided Watermarks for 3D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRecent advances in 3D Gaussian Splatting (3DGS) enable photorealistic real-time rendering but also increase the risks of unauthorized copying and redistribution. Existing 3DGS watermarking methods typically rely on handcrafted thresholds or globally fixed hyperparameters to balance invisibility and robustness, making their embedding behavior static and scene-agnostic. We instead formulate 3DGS watermarking as a goal-directed decision process and introduce Write Where It Matters (W2M), the first reinforcement learning-based framework that adaptively learns where and how much to embed. By modeling the embedding process as a Markov Decision Process, W2M uses a lightweight policy network to allocate precise Gaussian updates directly from immediate reward feedback, iteratively. The reward incentivizes both rendering-space invisibility and decoding robustness under various image- and model-lev
322core_reconstruction
high
Z-Order Transformer for Feed-Forward Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractRecent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in photorealistic novel view synthesis. However, traditional 3DGS relies on a slow, iterative optimization process, which limits its use in scenarios demanding real-time results. To overcome this bottleneck, recent feed-forward methods aim to predict Gaussian attributes directly from images, but they often struggle with the redundancy of Gaussian primitives and rendering quality. In this paper, we introduce a transformer-based architecture specifically designed for feed-forward Gaussian Splatting. Our key insight is that spatial and semantic relationships among Gaussians can be effectively captured through a sparse attention mechanism, enabled by a Z-order strategy that organizes the unstructured Gaussian set into a spatially coherent sequence. Furthermore, we incorporate this Z-order strategy to adaptively
323core_reconstruction
high
NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstract3D Gaussian Splatting can exploit frustum culling and level-of-detail strategies to accelerate rendering of scenes containing a large number of primitives. However, the semi-transparent nature of Gaussians prevents the application of another highly effective technique: occlusion culling. We address this limitation by proposing a novel method to learn the viewpoint-dependent visibility function of all Gaussians in a trained model using a small, shared MLP across instances of an asset in a scene. By querying it for Gaussians within the viewing frustum prior to rasterization, our method can discard occluded primitives during rendering. Leveraging tensor cores for efficient computation, we integrate these neural queries directly into a novel instanced software rasterizer. Our approach outperforms the current state of the art for composed scenes in terms of VRAM usage and image quality, utili
324core_reconstruction
high
SunFaded: Illumination-Aware Gaussian Splatting for Dark Scenes with Camera-Mounted Active Lighting
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiancecore genus=3D Gaussian Splatting with direct reconstruction/geometry signal
abstractGaussian Splatting has emerged as a popular 3D representation technique, but still struggles with appearance inconsistencies, especially in dark scenes that require active illumination (e.g., camera flashes or co-moving light sources) to capture usable images, leading to dramatic local appearance fluctuations.While existing methods mainly focus on modeling global appearance changes for in-the-wild scenes, such as those caused by different times of day or weather conditions, they fail to handle the severe variations present in dark scenes with moving light sources.In this paper, we propose a novel Gaussian Splatting–based approach for constructing scene representations in dark scenes where active light sources are rigidly attached to the camera and move together with it.Within this framework, we introduce an illumination-weighted loss function that drives the representation toward the und
325core_reconstruction
medium
Where, What, Why: Toward Explainable 3D-GS Watermarking
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiancecore genus=3D Gaussian Splatting, but title/abstract signal is narrower
abstractAs 3D Gaussian Splatting becomes the de facto representation for interactive 3D assets, robust yet imperceptible watermarking is critical. We present a representation-native framework that separates where to write from how to preserve quality. A Trio-Experts module operates directly on Gaussian primitives to derive priors for carrier selection, while a Safety and Budget Aware Gate (SBAG) allocates Gaussians to watermark carriers—optimized for bit resilience under perturbation and bitrate budgets—and to visual compensators that are insulated from watermark loss. To maintain fidelity, we introduce a channel-wise group mask that controls gradient propagation for carriers and compensators, thereby limiting Gaussian parameter updates, repairing local artifacts, and preserving high-frequency details without increasing runtime. Our design yields view-consistent watermark persistence and strong
326core_reconstruction
medium
Hyper-PCN: Hypergraph-based Point Cloud Completion via High-order Correlation Modeling
3D Vision & Geometry / Point Cloud
C. cluster representativesurface_occupancy3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractPoint cloud completion is an important yet challenging problem in 3D computer vision, which aims to reconstruct complete and dense 3D shapes from partial point clouds. Although transformer-based and geometry-based approaches have made significant progress, they often struggle to capture the complex, high-order correlations inherent in point clouds. To address this limitation, we propose Hyper-PCN, a point cloud completion framework that leverages hypergraphs to explicitly model complex, higher-order correlations within incomplete inputs for more accurate completion. It comprises two key modules: Hyper Refinement Stack, designed to progressively capture coarse-to-fine high-order correlations through a series of hypergraph learning stages, and Anchor-based Hypergraph Neural Network, which employs a two-stage sampling strategy to construct collaborative hypergraphs, ensuring robust modeling
327core_reconstruction
medium
3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects
Data & Evaluation / Benchmark
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractAccurate 3D reconstruction of objects with reflective, transparent, or low-texture surfaces remains a significant challenge. Such materials often violate key assumptions in multi-view reconstruction pipelines, such as photometric consistency and the reliance on distinct geometric texture cues. Existing datasets primarily focus on diffuse, textured objects, thereby offering limited insight into performance under real-world material complexities. In this paper, we introduce 3DReflecNet, a large-scale hybrid dataset exceeding 22 TB that is specifically designed to benchmark and advance 3D vision methods for these challenging materials. 3DReflecNet combines two types of data: over 100,000 synthetic instances generated via physically-based rendering of more than 10,000 shapes, and over 1,000 real-world objects scanned using consumer RGB-D devices. Together, these data consist of more than 7 m
328core_reconstruction
medium
AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction
Remote Sensing & Earth / Remote Sensing
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractRecent advances in 4D scene reconstruction have greatly improved dynamic modeling across various domains. However, existing approaches remain limited under aerial conditions with single-view capture, wide spatial range, and dynamic objects of limited spatial footprint and large motion disparity. These challenges cause severe depth ambiguity and unstable motion estimation, making monocular aerial reconstruction inherently ill-posed.To this end, we present AeroDGS, a physics-guided 4D Gaussian splatting framework for monocular UAV videos. AeroDGS introduces a Monocular Geometry Lifting module that reconstructs reliable static and dynamic geometry from a single aerial sequence, providing a robust basis for dynamic estimation. To further resolve monocular ambiguity, we propose a Physics-Guided Optimization module that incorporates differentiable ground-support, upright-stability, and traject
329core_reconstruction
medium
Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction
Computational Imaging / Computational Imaging
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractThe task of capturing and rendering 3D dynamic scenes from 2D images has become increasingly popular in recent years.However, most conventional cameras are bandwidth-limited to 30–60 FPS, restricting these methods to static or slowly evolving scenes.While overcoming bandwidth limitations is difficult in general scenes, recent years have seen a flurry of computational imaging methods that yield high-speed videos using conventional cameras for specific scenarios (e.g., motion capture and particle image velocimetry).However, most of these methods require modifications to camera optics or the addition of mechanically moving components, limiting them to a single-view high-speed capture. Consequently, these cannot be readily used to capture a 3D representation of rapid scene motion. In this paper, we propose a novel method to capture and reconstruct a volumetric representation of a high-speed
330core_reconstruction
medium
DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractAutonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce Driving Gaussian Grounded Transformer (DGGT), a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves tem
331core_reconstruction
medium
DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractSimulation is essential to the development and evaluation of autonomous robots such as self-driving vehicles. Neural reconstruction is emerging as a promising solution as it enables simulating a wide variety of scenarios from real-world data alone in an automated and scalable way. However, while methods such as NeRF and 3D Gaussian Splatting can produce visually compelling results, they often exhibit artifacts particularly when rendering novel views, and fail to realistically integrate inserted dynamic objects, especially when they were captured from different scenes. To overcome these limitations we introduce DiffusionHarmonizer, an online generative enhancement framework that transforms renderings from such imperfect scenes into photorealistic, temporally consistent outputs. At its core is a single-step temporally-conditioned enhancer that is converted from a pretrained multi-step imag
332core_reconstruction
medium
EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
Video & Motion / Human Motion
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractHuman behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting.However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild.To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame.The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly.Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruct
333core_reconstruction
medium
EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy
Medical & Scientific Imaging / Medical Imaging
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractVolume electron microscopy (vEM) enables nanoscale 3D imaging of biological structures but remains constrained by acquisition trade-offs, leading to anisotropic volumes with limited axial resolution. Existing deep learning methods seek to restore isotropy by leveraging lateral priors; yet their assumptions break down for morphologically anisotropic structures. We present **EMGauss**, a general framework for 3D reconstruction from planar scanned 2D slices with applications in vEM, which circumvents the inherent limitations of isotropy-based approaches. Our key innovation is to reframe slice-to-3D reconstruction as a 3D dynamic scene rendering problem based on Gaussian splatting, where the progression of axial slices is modeled as the temporal evolution of 2D Gaussian point clouds. To enhance fidelity in data-sparse regimes, we incorporate a **Teacher–Student bootstrapping mechanism** that
334core_reconstruction
medium
FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractIn controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce **FaithFusion**, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications. Extensive experiments on
335core_reconstruction
medium
FVGen: Scaling 3D Scene Datasets with Certainty-Aware Free-View Generation from Scene Geometry Reconstruction
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; generation_editing; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractThe development of generalizable Novel View Synthesis (NVS) models is critically limited by the scarcity of large-scale training data with diverse and accurate camera trajectories. While real-world captures are photorealistic, they are typically sparse and discrete. Conversely, synthetic data scales but suffers from a domain gap and often lacks realistic semantics. We introduce FVGen, a novel framework that leverages the power of scene reconstruction to transform limited real-world image sequences into a scalable source of high-quality training data. Our key insight is that an imperfect reconstructed scene serves as a rich geometric proxy, but naively sampling from it amplifies artifacts. To this end, we propose a certainty-aware free-view sampling strategy that identifies novel viewpoints which are both semantically meaningful and minimally affected by reconstruction errors. We demonstr
336core_reconstruction
medium
Gau-Occ: Geometry-Completed Gaussians for Multi-Modal 3D Occupancy Prediction
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmarkdirect reconstruction/3DGS/4D title linked to core representation cluster
abstract3D semantic occupancy prediction is crucial for autonomous driving, yet vision-only approaches suffer from weak geometric cues, and existing multi-modal frameworks often depend on dense voxel or BEV tensors that impose heavy computational cost. We present **Gau-Occ**, a multi-modal framework that models the scene as a compact collection of semantic 3D Gaussians, enabling geometry-guided fusion without dense volumetric processing.To enhance geometric completeness, a learned **LiDAR Completion Diffuser (LCD)** trained on real-world priors recovers missing structures from sparse LiDAR, and the completed points are encoded as semantic Gaussian anchors.To further integrate multi-view image semantics, we introduce **Gaussian Anchor Fusion (GAF)**, a geometry-aligned aggregation module that performs anchor-guided 2D sampling, local neighborhood encoding, and cross-modal alignment. By construc
337core_reconstruction
medium
ParkGaussian: Surround-view 3D Gaussian Splatting for Autonomous Parking
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; robotics_mapping; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractParking is a critical task for autonomous driving systems (ADS), with unique challenges in crowded parking slots and GPS-denied environments. However, existing works focus on 2D parking slot perception, mapping, and localization, 3D reconstruction remains underexplored, which is crucial for capturing complex spatial geometry in parking scenarios. Naively improving the visual quality of reconstructed parking scenes does not directly benefit autonomous parking, as the key entry point for parking is the slots perception module. To address these limitations, we curate the first benchmark named ParkRecon3D, specifically designed for parking scene reconstruction. It includes sensor data from four surround-view fisheye cameras with calibrated extrinsics and dense parking slot annotations. We then propose ParkGaussian, the first framework that integrates 3D Gaussian Splatting (3DGS) for parking
338core_reconstruction
medium
PoInit-of-View: Poisoning Initialization of Views Transfers Across Multiple 3D Reconstruction Systems
Robustness & Safety / Safety
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractPoisoning input views of 3D reconstruction systems has been recently studied.However, we identify that existing studies simply backpropagate adversarial gradients through the 3D reconstruction pipeline as a whole, without uncovering the new vulnerability rooted in specific modules of the 3D reconstruction pipeline.In this paper, we argue that the structure-from-motion (SfM) initialization, as the geometric core of many widely used reconstruction systems, can be targeted to achieve strong poisoning effects. To this end, we propose PoInit-of-View, which optimizes adversarial perturbations to intentionally introduce cross-view gradient inconsistencies at projections of corresponding 3D points. These inconsistencies disrupt keypoint detection and feature matching, thereby corrupting pose estimation and triangulation within SfM, eventually resulting in low-quality rendered views. We also pro
339core_reconstruction
medium
Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractMonocular Semantic Scene Completion (SSC) aims to reconstruct complete 3D semantic scenes from a single RGB image, offering a cost-effective solution for autonomous driving and robotics. However, the inherently imbalanced nature of voxel distributions—where over 93% of voxels are empty and foreground classes are rare—poses significant challenges. Existing methods often suffer from redundant emphasis on uninformative voxels and poor generalization to long-tailed categories. To address these issues, we propose VoxSAMNet (Voxel Sparsity-Aware Modulation Network), a unified framework that explicitly models voxel sparsity and semantic imbalance. Our approach introduces: (1) a Dummy Shortcut for Feature Refinement (DSFR) module that bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention; (2) a Foreground Modulation Strategy combining Foreground Dro
340core_reconstruction
medium
TGSFormer: Scalable Temporal Gaussian Splatting for Embodied Semantic Scene Completion
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractEmbodied 3D Semantic Scene Completion (SSC) infers dense geometry and semantics from continuous egocentric observations. Most existing Gaussian-based methods rely on random initialization of many primitives within predefined spatial bounds, resulting in redundancy and poor scalability to unbounded scenes. Recent depth-guided approach alleviates this issue but remains local, suffering from latency and memory overhead as scale increases.To overcome these challenges, we propose TGSFormer, a scalable Temporal Gaussian Splatting framework for embodied SSC. It maintains a persistent Gaussian memory for temporal prediction, without relying on image coherence or frame caches.For temporal fusion, a Dual Temporal Encoder jointly processes current and historical Gaussian features through confidence-aware cross-attention.Subsequently, a Confidence-aware Voxel Fusion module merges overlapping primiti
341core_reconstruction
medium
UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes
Remote Sensing & Earth / Remote Sensing
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractIllumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments make lighting changes unavoidable.However, existing datasets either restrict capture to short time windows, thus lacking meaningful illumination diversity, or span months and seasons, where geometric and semantic changes confound the isolated study of lighting robustness.We introduce UAVLight, a controlled-yet-real benchmark for illumination-robust 3D reconstruction. Each scene is captured along
342core_reconstruction
medium
Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping; generation_editing; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractCreating realistic and simulation-ready 3D assets is crucial for autonomous driving research and virtual environment construction. However, existing 3D vehicle generation methods are often trained on synthetic data with significant domain gaps from real-world distributions. The generated models often exhibit arbitrary poses and undefined scales, resulting in poor visual consistency when integrated into driving scenes. In this paper, we present Unposed-to-3D, a novel framework that learns to reconstruct 3D vehicles from real-world driving images using image-only supervision. Our approach consists of two stages. In the first stage, we train a image-to-3D reconstruction network using posed images with known camera parameters. In the second stage, we remove camera supervision and use a camera prediction head that directly estimates the camera parameter from unposed images. The predicted pose
343core_reconstruction
medium
Video2Robo: 3DGS-based Synthetic Data from One Video Enables Scalable Robot Learning
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractScalable robot learning is hindered by the high cost of acquiring diverse, high-quality embodied data. Existing data generation approaches partially mitigate this issue but typically depend on hard-to-access hardware and labor-intensive manual effort, with limited generalization to diverse scene configurations. To overcome these limitations, we propose Video2Robo, a framework that generates high-quality and diverse robot data directly from a single human demonstration video, enabling seamless deployment on physical robots. At its core, Video2Robo leverages 3D Gaussian Splatting (3DGS) as a powerful scene representation, enabling high-fidelity rendering and explicit 3D scene editing. The framework tracks temporally consistent motion trajectories of task-relevant objects from raw video footage and identifies key task skills, guiding robots to execute tasks kinematically plausibly under nov
344core_reconstruction
medium
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractThe rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has sparked interest in extending these models to 3D scenes, with the goal of human-like visual-spatial intelligence. However, achieving deep spatial understanding comparable to human capabilities remains challenging for both model design and data acquisition. Existing methods often rely on external depth sensors for geometry capture or off-the-shelf algorithms for pre-constructing 3D maps, which limits their scalability.In this work, we introduce VLM-3R, a framework for Vision-Language Models that couples 3D reconstructive instruction tuning with scalable training data curation and a new benchmark for temporal reasoning. Specifically, VLM-3R processes monocular video frames with a geometry encoder that derives implicit 3D tokens representing scene context (spatial tokens) and camera motion (view tokens). In
345core_reconstruction
medium
WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractEditable high-fidelity 4D scenes are crucial for autonomous driving, as they can be applied to end-to-end training and closed-loop simulation. However, existing reconstruction methods are primarily limited to replicating observed scenes and lack the capability for diverse weather simulation. While image-level weather editing methods tend to introduce scene artifacts and offer poor controllability over the weather effects. To address these limitations, we propose \textbf{WeatherCity}, a novel framework for 4D urban scene reconstruction and weather editing. Specifically, we leverage a text-guided image editing model to achieve flexible editing of image weather backgrounds. To tackle the challenge of multi-weather modeling, we introduce a novel weather Gaussian representation based on shared scene features and dedicated weather-specific decoders. This representation is further enhanced with
346core_reconstruction
medium
RecEdit-Drive: 3D Reconstruction-Guided Spatiotemporal Video Editing for Autonomous Driving Scenes
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractHigh-quality video editing and processing are crucial in domains such as filmmaking and autonomous driving, where accurate visual refinement and data preparation are essential. However, it is challenging to achieve precise control over dynamic objects while maintaining spatiotemporal consistency. Current approaches typically utilize text prompts or 2D structural priors for video editing to ensure consistency, yet they struggle to effectively constrain the spatial variations of dynamic 3D objects. In this paper, we introduce $\textbf{RecEdit-Drive}$, a framework that integrates $\textbf{Spatial Feature Warping}$ and $\textbf{Spatiotemporal Collaborative Modeling}$ to effectively control 3D object variations and enhance video consistency. The spatial feature warping enhances precise control over the edited foreground 3D objects, enhancing spatial consistency in the generated videos; and th
347core_reconstruction
medium
MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; surface_occupancy; robotics_mapping3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractActive mapping aims to determine how an agent should move to efficiently reconstruct an unknown environment. Most existing approaches rely on greedy next-best-view prediction, resulting in inefficient exploration and incomplete scene reconstruction.To address this limitation, we introduce MAGICIAN a novel long-term planning framework that maximizes accumulated surface coverage gain through Imagined Gaussians, a predicted scene representation derived from a pre-trained occupancy network with strong structural priors. This representation enables efficient computation of overage gain for any novel viewpoint via fast volumetric rendering.The resulting speedup allows the integration of the gain metric into a tree-search algorithm for planning long-horizon paths.We update Imagined Gaussians and refine the planned trajectory in a closed-loop manner.Our method achieves state-of-the-art performan
348core_reconstruction
medium
Exact-GS: Mathematically Rigorous and Accurate 3D Gaussian Splatting for 3D X-ray Reconstruction
Medical & Scientific Imaging / Medical Imaging
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; surface_occupancy; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractWe propose Exact-GS, a novel mathematically rigorous and accurate 3D Gaussian Splatting model designed to perform 3D X-ray computed tomography (CT) reconstruction and novel view synthesis. Recently, 3D Gaussian Splatting achieved considerable progress at 3D representation. Unfortunately, due to the affine approximation of the projective transformation, previous 3DGS-based methods inevitably suffer from artifacts and projection inconsistencies. To address this problem, some ray tracing based methods perform integration along the ray across Gaussians. However, these methods are computationally inefficient on the forward and backward pass. We introduce a novel closed-form splatting solution for this problem with mathematically rigorous derivation. Our model is the first to achieve the same exact rendering quality as ray tracing based methods without any approximation under a splatting-based
349core_reconstruction
medium
RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cue for 3D Object Detection
Detection & Tracking / Detection
D. adjacent but useful contextgaussian_radiance; pose_calibration_localization; dynamic_4d; robotics_mappingdirect reconstruction/3DGS/4D title linked to core representation cluster
abstract4D millimeter-wave radar is a promising sensing modality for autonomous driving, yet effective 3D object detection from 4D radar and monocular images remains challenging. Existing fusion approaches either rely on instance proposals lacking global context or dense BEV grids constrained by rigid structures, lacking a flexible and adaptive representation for diverse scenes. To address this, we propose RaGS, the first framework that leverages 3D Gaussian Splatting (GS) to fuse 4D radar and monocular cues for 3D object detection. 3D GS models the scene as a continuous field of Gaussians, enabling dynamic resource allocation to foreground objects while maintaining flexibility and efficiency. Moreover, the velocity dimension of 4D radar provides motion cues that help anchor and refine the spatial distribution of Gaussians. Specifically, RaGS adopts a cascaded pipeline to construct and progressi
350core_reconstruction
medium
ST4R-Splat: Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting
Segmentation & Dense Prediction / Segmentation
D. adjacent but useful contextgaussian_radiance; dynamic_4d; robotics_mapping; data_benchmarkdirect reconstruction/3DGS/4D title linked to core representation cluster
abstractUnderstanding and segmenting objects in dynamic 4D environments from natural language is crucial yet underexplored. Existing works either perform referring segmentation in static 3D scenes or build open-vocabulary 4D language fields, but none of them supports grounding complex spatio-temporal referring descriptions in explicit 4D reconstructions. Based on 4D Gaussian Splatting(4DGS), We formalize this missing setting as Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting (STRS-4DGS): given a 4DGS representation of a dynamic scene and a referring expression, the goal is to identify the target object and segment it across both space and time, resolving where the described instance is and when it exhibits the queried state. To tackle this challenge, we propose ST4R-Splat, the first framework for STRS-4DGS. ST4R-Splat builds on deformable 4D Gaussians and introduces an Instance-A
351core_reconstruction
medium
Revisiting Pose Sensitivity in Splat-based Computed Tomography under Sparse-view Reconstruction
Medical & Scientific Imaging / Medical Imaging
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; surface_occupancy; data_benchmarkdirect reconstruction/3DGS/4D title linked to core representation cluster
abstractX-ray computed tomography (CT) reconstructs volumetric representations of objects from projection images obtained by transmitting X-rays through a target. Recent splat-based tomography, which represents a volume as a continuous distribution of 3D Gaussians, has demonstrated both high reconstruction quality and fast convergence in cone-beam sparse-view CT. However, when deployed in real CT systems with limited and non-uniform view distributions, we observe distinctive streak and strip artifacts that are far more pronounced than in conventional reconstruction methods. Through detailed analysis, we show that these artifacts primarily originate from pose inaccuracies in the acquisition geometry rather than from view sparsity itself. We revisit pose sensitivity in the splatting formulation and derive a stable gradient-based framework that jointly refines geometric parameters during reconstruc
352core_reconstruction
medium
DetectSCI: Toward Object-Guided ROI Reconstruction for High-Resolution Video Snapshot Compressive Imaging
Computational Imaging / Computational Imaging
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; data_benchmarkdirect reconstruction/3DGS/4D title linked to core representation cluster
abstractVideo snapshot compressive imaging (SCI) offers a promising alternative to high-speed cameras by encoding multiple frames into a single 2D measurement. However, SCI requires algorithms to reconstruct the high-speed video and as resolution increases, reconstruction becomes computationally expensive and memory-intensive. Much of resource is wasted on recovering large background regions that contain little useful information, highlighting the need for selective, object-driven reconstruction. Existing object detectors struggle to perform accurately on SCI measurements due to the spatial–temporal aliasing introduced by coded exposure. To address this challenge, we proposes DetectSCI, the first framework enabling object-guided region-of-interest (ROI) reconstruction for high-resolution SCI. The inside detector comprises two key components: an encoder built from weight-sharing Mamba-Implicit Mo
353core_reconstruction
medium
Generative Diffusion Priors for 3D Mapping of the Dark Universe
Medical & Scientific Imaging / Medical Imaging
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; robotics_mapping; data_benchmark3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractReconstructing the three-dimensional distribution of dark matter from weak-lensing observations is a central but highly ill-posed inverse problem in cosmology. Unlike standard 3D reconstruction with multiple viewpoints, we observe the universe from a single line of sight, through noisy shape distortions of galaxies with uncertain distances, so meaningful recovery of the 3D matter field requires strong prior assumptions. Existing methods either produce point estimates with handcrafted priors or use neural ensembles for approximate Bayesian uncertainty, and struggle to capture the non-Gaussian, filamentary structure of the cosmic web. With the advent of new high-resolution cosmological simulations, we now have an alternative source of prior knowledge that captures the nonlinear statistics of structure formation with far greater fidelity than analytic prescriptions. We leverage these simula
354core_reconstruction
medium
REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting
Multimodal & Language / VLM / MLLM
D. adjacent but useful contextgaussian_radiance; pose_calibration_localization; surface_occupancy; generation_editingdirect reconstruction/3DGS/4D title linked to core representation cluster
abstractBridging the gap between complex human instructions and precise 3D object grounding remains a significant challenge in vision and robotics. Existing 3D segmentation methods often struggle to interpret ambiguous, reasoning-based instructions, while 2D vision-language models that excel at such reasoning lack intrinsic 3D spatial understanding. In this paper, we introduce REALM, an innovative MLLM-agent framework that enables open-world reasoning-based segmentation without requiring extensive 3D-specific post-training. We perform segmentation directly on 3D Gaussian Splatting representations, capitalizing on their ability to render photorealistic novel views that are highly suitable for MLLM comprehension. As directly feeding one or more rendered views to the MLLM can lead to high sensitivity to viewpoint selection, we propose a novel Global-to-Local Spatial Grounding strategy. Specifically
355core_reconstruction
medium
GaussianPile: A Unified Sparse Gaussian Splatting Framework for Slice-based Volumetric Reconstruction
Medical & Scientific Imaging / Medical Imaging
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; surface_occupancy3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractSlice-based volumetric imaging is widely applied and it demands representations that compress aggressively while preserving internal structure for analysis. This paper introduces GaussianPile, unifying 3D Gaussian splatting with an imaging system-aware focus model to address this challenge. Our new method introduces three key innovations: (i) a slice‑aware piling strategy that positions anisotropic 3D Gaussians to model through‑slice contributions, (ii) a differentiable projection operator that encodes the finite‑thickness point spread function of the imaging acquisition system, and (iii) a compact encoding and joint optimization pipeline that simultaneously reconstructs and compresses the Gaussian sets. Our CUDA-based design retains the compression and real‑time rendering efficiency of Gaussian primitives while preserving high‑frequency internal volumetric detail. Experiments on microsc
356core_reconstruction
medium
Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement
Medical & Scientific Imaging / Medical Imaging
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancydirect reconstruction/3DGS/4D title linked to core representation cluster
abstractProspective reconstruction is crucial in many clinical applications such as MRI-guided radiotherapy, which demands accurate image reconstruction and fast motion estimation from currently acquired measurements. However, prospective reconstruction remains challenging due to ultra-sparse sampling and stringent latency requirements. In this work, we propose PDMR, an Prospective Dynamic 3D MRI Reconstruction framework with latent-space motion tracking. Our core idea is to learn an efficient and generalizable latent manifold of motion fields offline, enabling rapid online adaptation for prospective reconstruction. Specifically, we parameterize the deformation vector fields (DVFs) on a low-dimensional manifold, effectively reducing the search space for fast online adaptation, and employ a tri-plane representation to achieve geometry-aware and memory-efficient encoding of 3D motion. Experiments
357core_reconstruction
medium
MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Driving
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgaussian_radiance; depth_correspondence; robotics_mapping; data_benchmarkdirect reconstruction/3DGS/4D title linked to core representation cluster
abstractGenerative models have shown great potential in trajectory planning. Recent studies demonstrate that anchor-guided generative models are effective in modeling the uncertainty of driving behaviors and improving overall performance. However, these methods rely on discrete anchor vocabularies that must sufficiently cover the trajectory distribution during testing to ensure robustness, inducing an inherent trade-off between vocabulary size and model performance.To overcome this limitation, we propose MeanFuser, an end-to-end autonomous driving method that enhances both efficiency and robustness through three key designs. (1) We introduce Gaussian Mixture Noise (GMN) to guide generative sampling, enabling a continuous representation of the trajectory space and eliminating the dependency on discrete anchor vocabularies. (2) We introduce ``MeanFlow Identity", which models the mean velocity fiel
358core_reconstruction
medium
Adaptive Anisotropic Gaussian Splatting for Multi-contrast MRI Arbitrary-Scale Super-Resolution with Anatomy Guidance
Medical & Scientific Imaging / Medical Imaging
D. adjacent but useful contextgaussian_radiance; surface_occupancy; robotics_mapping3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractImplicit neural representation (INR) based methods learn a continuous mapping from a low-resolution (LR) target magnetic resonance (MR) image and a high-resolution (HR) reference image to achieve arbitrary-scale super-resolution (SR). However, their inherent spectral bias favors learning low-frequency (LF) components, often failing to capture the sharp transitions at anatomical boundaries and resulting in the loss of high-frequency (HF) details. Inspired by 3D Gaussian splatting, we propose GaussM²ASR (Gaussian Multi-contrast MRI Arbitrary-scale Super-Resolution), which converts the challenging task of HF anatomical reconstruction into a smoother parameter optimization problem by learning the parameters of anisotropic 2D Gaussian kernels. To handle inter-contrast discrepancies, we introduce an anatomy-guided pipeline comprising three core modules: a Structure Prior Modulation Fusion (SPM
359core_reconstruction
medium
TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextpose_calibration_localization; dynamic_4d; robotics_mapping3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstract3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass. However, when deployed in online settings such as driving scenarios, predictions are made over temporal windows, making it non-trivial to maintain consistency across time. Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry. In this work, we propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies. In addition, we adopt a point-agnostic submap registration design that is inherently robust to noisy geometry predictions. The proposed framework is fully pl
360core_reconstruction
medium
DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures
Multimodal & Language / Agentic AI
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstract3D Gaussian Splatting achieves real-time photo-realistic rendering but struggles when training images contain transient objects that violate multi-view consistency. Existing methods face a fundamental dilemma: accurate transient detection requires well-reconstructed static scenes, yet clean reconstruction depends on reliable transient masks. This circular dependency causes persistent artifacts when both components are jointly optimized from poor initialization. We present DualSplat, a two-stage framework which sidesteps this dilemma by first generating pseudo masks from reconstruction failures, then using them to guide clean scene optimization. We observe that transient objects manifest as incomplete fragments during initial training, since they appear in only a subset of views. We consolidate these failures into pseudo masks via instance-level thresholding and a feature-residual filter
361core_reconstruction
medium
RemedyGS: Defend 3D Gaussian Splatting Against Computation Cost Attacks
Robustness & Safety / Safety
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractAs a mainstream technique for 3D reconstruction, 3D Gaussian splatting (3DGS) has been applied in a wide range of applications and services. Recent studies have revealed critical vulnerabilities in this pipeline and introduced computation cost attacks that lead to malicious resource occupancies and even denial-of-service (DoS) conditions, thereby hindering the reliable deployment of 3DGS. In this paper, we propose the first effective and comprehensive black-box defense framework, named RemedyGS, against such computation cost attacks, safeguarding 3DGS reconstruction systems and services. Our pipeline comprises two key components: a detector to identify the attacked input images with poisoned textures and a purifier to recover the benign images from their attacked counterparts, mitigating the adverse effects of these attacks. Moreover, we incorporate adversarial training into the purifier
362core_reconstruction
medium
Eulerian Gaussian Splatting using Hashed Probability Pyramids
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgaussian_radiance; robotics_mapping3D Vision & Geometry paper with direct reconstruction title and abstract signal
abstractWe introduce a probabilistic splat-based radiance field framework that retains the fast rasterization and test-time efficiency of 3D Gaussian Splatting (3DGS) while replacing heuristic primitive manipulation with gradient-based optimization of a volumetric probability density. Rather than relocating, splitting, or culling Gaussians via hand-tuned densification (e.g., ADC), we treat primitive locations as samples drawn from a persistent, learnable density. We instantiate this density with a novel, memory-efficient multi-scale hierarchical grid that enables end-to-end gradient-based control over primitive population density. To stabilize stochastic training, we derive an unbiased gradient estimator with control variates that markedly reduces variance. By allowing probability mass to flow to where the loss demands, our method eliminates brittle priors and naturally explores the volume, achi
363strong_bridge
medium
Clone Deterministic 3D Worlds
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mappingsystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractA world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future physical state of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. However, existing world models often focus on random generation of open worlds, but neglect the need for high-fidelity modeling of deterministic scenarios (such as fixed-map mazes and static space robot navigation). In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone a deterministic 3D world. 1) Through diagnostic experiment, we quantitatively demonstrate that high-fidelity cloning is feasible and the primary bottleneck for long-horizon fidelity is the geometric st
364strong_bridge
medium
NeuROK: Generative 4D Neural Object Kinematics
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkdynamic/4D paper with direct reconstruction signal
abstractData-driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simulative 4D dynamics---realistic temporal deformations of static objects under various physical conditions---remains challenging and often ad hoc despite being critical for building comprehensive 3D world models. Most existing methods assume a predefined physical model and use system identification to estimate parameters, restricting these methods to specific categories and small-scale datasets. We propose that these restrictions can be overcome by learning a data-driven kinematic state parameterization for object-centric physical systems. Specifically, we learn both a latent space of all possible states of the object and a decoder that maps any sampled latent to a plausibly deformed shape of the object. We refer to this parameteriza
365strong_bridge
medium
SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancydynamic/4D paper with direct reconstruction signal
abstractLearning dense correspondences across deformable 3D shapes remains a long-standing challenge due to structural variability, non-isometric deformation, and inconsistent topology. Existing methods typically trade off generalization, geometric fidelity, and efficiency.We address this by proposing SGSoft, a unified intrinsic pipeline that (i) constructs a geodesic correspondence field on a canonical template, (ii) learns multimodal dense descriptors guided by pretrained semantic priors with this geodesic correspondence field supervision, (iii) retrieves dense correspondences in a single feed-forward pass via nearest-neighbor search in descriptor space.This formulation enables stable and topology-invariant supervision under large pose variation, structural differences, and remeshing.SGSoft achieves state-of-the-art inter-category generalization while offering the best accuracy–efficiency trad
366strong_bridge
medium
Spatia: Video Generation with Updatable Spatial Memory
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; pose_calibration_localization; dynamic_4d; surface_occupancy; generation_editingsystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractExisting video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory–aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic–static disentanglement design enhances spatial consistency throughout the generation process while preserving the model’s ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.
367strong_bridge
medium
D-Prism: Differentiable Primitives for Structured Dynamic Modeling
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancydynamic/4D paper with direct reconstruction signal
abstractCapturing both geometry and rigid motion for structured dynamic objects, like multi-part assemblies or jointed mechanisms, remains a key challenge. Existing dynamic methods, such as deformable meshes or 3DGS, rely on unstructured representations and fail to jointly model suitable geometry and articulated motion. Primitive-based methods excel at structured static scenes, but their dynamic potential is still unexplored. We propose D-Prism, the first framework to achieve high-fidelity structured dynamic modeling by extending differentiable primitives to the dynamic domain.Specifically, we bind 3DGS to primitive surfaces, leveraging their respective strengths in appearance and geometry. We introduce a deformation network to control primitive motion, ensuring it accurately matches the object's movement. Furthermore, we design a novel adaptive control strategy to dynamically adjust primitive c
368strong_bridge
medium
Dark3R: Learning Structure from Motion in the Dark
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmarkGaussian/radiance representation linked to pose/mapping/metric bridge
abstractWe introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signal-to-noise ratios (SNRs) below $-4$ dB—a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher–student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy--clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson–Gaussian noise model applied to well-exposed raw images.To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes $\sim$42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structur
369strong_bridge
medium
Perceptual 3D Simulation With Physical World Modeling
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractPredicting how a scene will evolve after a desired 3D transformation from images is a central goal in vision, graphics, and robotics. Yet unlike ideal simulators with full access to 3D geometry and dynamics, real world systems must rely on perceptual inputs and local actions that are inherently partial and incomplete. In this work, we present P3Sim, a physical world modeling system that simulates future scene states under both partial observations and incomplete 3D transformation signals. P3Sim is composed of three interacting components: a learned physical world model, a geometric conditioning module, and a persistent scene memory. The world model interprets perception as probabilistic inference over multimodal scene variables, providing predictions of the distributions of any scene variable conditioned on any combination of others. The geometric conditioning module provides a partial 3
370strong_bridge
medium
Real-Time Dynamic Scene Rendering with Controlled Compressibility and Contact Awareness
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmarkdynamic/4D paper with direct reconstruction signal
abstractExisting dynamic scene rendering methods often adopt rigid-body or direction-limited assumptions, yet real-world motion and contact routinely violate these, producing artifacts near occlusion boundaries. To address this, we introduce a unified, source-aware framework for dynamic rendering that enforces the consistency of Gaussian primitives under an explicit manifold constraints. We project predicted velocities onto physically grounded priors via efficient, parallel inner solves: (i) a Helmholtz parameterization that separates divergence-free and potential-flow motion components; (ii) an anisotropic, compressible directional prior; and (iii) an affine family that disentangles rotation from isotropic scaling. Experiments on extensive benchmarks show consistent improvements over state-of-the-art methods in reconstruction fidelity and temporal coherence. Our approach ensures physically real
371strong_bridge
medium
SceneTok: A Compressed, Diffusable Token Space for 3D Scenes
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; generation_editingGaussian/radiance representation linked to pose/mapping/metric bridge
abstractWe present SceneTok, a novel tokenizer for encoding view sets of scenes into a compressed and diffusable set of unstructured tokens. Existing approaches for 3D scene representation and generation commonly use 3D data structures or view-aligned fields. In contrast, we introduce the first method that encodes scene information into a small set of permutation invariant tokens that is disentangled from the spatial grid. The scene tokens are predicted by a multi-view tokenizer given many context views and rendered into novel views by employing a light-weight rectified flow decoder. A diffusion transformer enables scene generation on the compressed token space. We show that the compression is two orders of magnitude stronger than for other representations while still reaching state-of-the-art reconstruction quality. Further, our representation can be rendered from novel trajectories, including
372strong_bridge
medium
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractVideo world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretr
373strong_bridge
medium
ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes
3D Vision & Geometry / 3D Gaussian Splatting
B. bridge: reconstruction becomes mapping/world modelgaussian_radiance; dynamic_4d; robotics_mapping; data_benchmarkdynamic/4D paper with direct reconstruction signal
abstractMulti-period image collections are common in real-world applications. Cities are re-scanned for mapping, construction sites are revisited for progress tracking, and natural regions are monitored for environmental change. Such data form multi-period scenes, where geometry and appearance evolve. Reconstructing such scenes is an important yet underexplored problem. Existing pipelines rely on incompatible assumptions: static and in-the-wild methods enforce a single geometry, while dynamic ones assume smooth motion, both failing under long-term, discontinuous changes. To solve this problem, we introduce ChronoGS, a temporally modulated Gaussian representation that reconstructs all periods within a unified anchor scaffold. It‘s also designed to disentangle stable and evolving components, achieving temporally consistent reconstruction of multi-period scenes. To catalyze relevant research, we re
374strong_bridge
high
DROID-SLAM in the Wild
3D Vision & Geometry / Pose Estimation
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; pose_calibration_localization; robotics_mapping; data_benchmarkpose/localization bridge genus=Pose Estimation with reconstruction/map signal
abstractWe present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 8 FPS. The source code will be publicly
375strong_bridge
high
HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps
3D Vision & Geometry / Pose Estimation
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; pose_calibration_localization; robotics_mapping; data_benchmarkpose/localization bridge genus=Pose Estimation with reconstruction/map signal
abstractVisual localization on standard-definition (SD) maps has emerged as a promising low-cost and scalable solution for autonomous driving. However, existing regression-based approaches often overlook inherent geometric priors, resulting in suboptimal training efficiency and limited localization accuracy. In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and standard-definition (SD) maps. We construct input pairs that satisfy a homography constraint by projecting ground-view features into the BEV domain and enforcing semantic alignment with map features. Then we leverage homography relationships to guide feature fusion and restrict the pose outputs to a valid feasible region, which significantly improves training efficiency and localization accuracy compared to prior methods relying on attention-based fusi
376strong_bridge
medium
StreamVLO: Streaming Visual–LiDAR Odometry with Cumulative Drift Compensation
3D Vision & Geometry / Pose Estimation
B. bridge: reconstruction becomes mapping/world modelpose_calibration_localization; depth_correspondence; dynamic_4d; robotics_mappingsystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractWe propose StreamVLO, a streaming visual–LiDAR odometry framework that performs unified spatio-temporal correlation with Mamba models and tackles the long-standing cumulative drift problem via an online Cumulative Drift Compensation scheme for localization in 4D dynamic environments. Specifically, StreamVLO introduces a unified spatio-temporal correlation module built on Mamba to fuse heterogeneous visual and LiDAR cues across multi-frame clips, overcoming the limited temporal exploration of prior pairwise methods. Furthermore, a Cumulative Drift Compensation module minimizes cumulative drift by iteratively learning residual corrections from multiple historical frames in a causal manner. To strengthen spatial feature representation on salient regions, we adopt a Keypoint-Aware Auxiliary Loss with a winner-takes-all strategy. StreamVLO achieves state-of-the-art performance on two commonly
377strong_bridge
high
VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models
3D Vision & Geometry / Pose Estimation
B. bridge: reconstruction becomes mapping/world modelpose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmarkpose/localization bridge genus=Pose Estimation with reconstruction/map signal
abstractText-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird’s-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mecha
378strong_bridge
high
VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation
3D Vision & Geometry / Pose Estimation
B. bridge: reconstruction becomes mapping/world modelpose_calibration_localization; depth_correspondence; robotics_mappingpose/localization bridge genus=Pose Estimation with reconstruction/map signal
abstractAccurate global localization is crucial for autonomous driving and robotics, especially in dense urban environments where GNSS is often unreliable due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. To address this challenge, we propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to establish horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to r
379strong_bridge
high
Similarity-Consistent Likelihood Diffusion enables Hidden Person Detection from Wall Reflections
3D Vision & Geometry / Pose Estimation
B. bridge: reconstruction becomes mapping/world modelpose_calibration_localization; robotics_mappingpose/localization bridge genus=Pose Estimation with reconstruction/map signal
abstractThis paper studies passive non-line-of-sight corner-camera detection and human localization using faint indirect reflections on a visible wall. The challenge is twofold: multi-exposure wall observations are unstable and entangled with sensor nonlinearities, and mapping these observations to a hidden-view RGB image is severely underdetermined, making purely discriminative regressors brittle and unconstrained diffusion priors stochastic. To address these challenges, we introduce the Similarity-Likelihood Diffusion Network (SLD-Net), a two-stage framework that produces measurement-consistent, deterministic reconstructions. First, DeLi-Inversion forms an exposure-aware differential representation and jointly predicts an initial reconstruction and a pixel-wise precision map, yielding a heteroscedastic pseudo-likelihood. Second, SiCo-Diffusion injects this likelihood as precision-weighted ener
380strong_bridge
high
Uncertainty-Aware Modality Fusion for Unaligned RGB-T Salient Object Detection
3D Vision & Geometry / Pose Estimation
B. bridge: representation meets metric posegaussian_radiance; pose_calibration_localization; surface_occupancypose/localization bridge genus=Pose Estimation with reconstruction/map signal
abstractUnaligned RGB-T salient object detection (SOD) remains challenging due to severe cross-modal spatial discrepancies and unreliable feature fusion. Existing methods often assume perfect alignment or rely on geometric registration, which is computationally demanding and sensitive to cross-modal inconsistencies. To address these limitations, we propose an uncertainty-aware modality fusion network (UMFNet) that reformulates RGB-T SOD as an uncertainty-aware representation learning problem. Specifically, the proposed uncertainty alignment module (UAM) models pixel-wise features as Gaussian latent distributions to estimate local uncertainty and identify cross-modal consistency regions within the feature space, thereby achieving implicit alignment without explicit registration. Furthermore, the confidence-guided global modulation (CGM) mechanism leverages confidence maps derived from uncertainty
381strong_bridge
high
AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; data_benchmarkpose/localization bridge genus=Pose Estimation with reconstruction/map signal
abstractSingle-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions.First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation.Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose minimizing the feature discrepancy between on-the-flyrendered object features and obser
382strong_bridge
high
CoLoR: The Devil is in Scene Coordinate Regression for Large-Scale Visual Localization
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; data_benchmarkpose/localization bridge genus=Pose Estimation with reconstruction/map signal
abstractScene Coordinate Regression (SCR) has emerged as a memory-efficient paradigm for visual localization.While SCR has demonstrated performance comparable to classic feature matching based approaches in small-scale scenes, it has consistently underperformed in large-scale environments.Large-scale localization is hampered by two challenges: sparse co-visibility and local appearance ambiguity.In this work, we propose **CoLoR**, a novel training framework tailored for large-scale SCR.First, we explicitly and efficiently partition scene points into multi-view and single-view sets and introduce a two-stage bootstrapping paradigm to provide complete and strong supervision for all points.Second, we propose a multi-granularity retrieval feature, which unifies the conventional global and local features as retrieval-oriented representations at the image and pixel levels, respectively, to enforce featu
383strong_bridge
medium
Diff4Splat: Repurposing Video Diffusion Models for Dynamic Scene Generation
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; dynamic_4d; generation_editing; data_benchmarkdynamic/4D paper with direct reconstruction signal
abstractWe introduce Diff4Splat, a feed-forward framework for dynamic scene generation from a single image. Our method synergizes the powerful generative priors of video diffusion models with geometric and motion constraints learned from a large-scale 4D dataset. Given a single image, a camera trajectory, and an optional text prompt, our model directly predicts a dynamic scene represented by a deformable 3D Gaussian field. This approach captures appearance, geometry, and motion in a single pass, eliminating the need for test-time optimization or post-hoc processing. At the core of our framework is a video latent transformer that enhances existing video diffusion models, enabling them to jointly model spatio-temporal dependencies and predict 3D Gaussian Primitives over time. Supervised by objectives targeting appearance fidelity, geometric accuracy, and motion consistency, Diff4Splat generates hi
384strong_bridge
medium
Event6D: Event-based Novel Object 6D Pose Tracking
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; dynamic_4d; data_benchmarkdynamic/4D paper with direct reconstruction signal
abstractEvent cameras provide microsecond latency, making them suitable for 6D object pose tracking in fast, dynamic scenes where conventional RGB and depth pipelines suffer from motion blur and large pixel displacements. We introduce EventTrack6D, an event-depth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. Conditioned on the most recent depth measurement, our dual reconstruction recovers dense photometric and geometric cues from sparse event streams. Our EventTrack6D operates at over 120 FPS and maintains temporal consistency under rapid motion. To support training and evaluation, we introduce a comprehensive benchmark suite: a large-scale synthetic dataset for training and two complementary evaluation sets, including real and simulated event datasets. Trained exclus
385strong_bridge
high
PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; data_benchmarkpose/localization bridge genus=Pose Estimation with reconstruction/map signal
abstractWe present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity.PiLoT breaks this paradigm by directly registering live video stream against a geo-referenced 3D map. To achieve robust, accurate, and real-time performance, we introduce three key contributions: 1) a Dual-Thread Engine that decouples map rendering from core localization thread, ensuring both low latency while maintaining drift-free accuracy; 2) a large-scale synthetic dataset with precise geometric annotations (camera pose, depth maps). This dataset enables the training of a
386strong_bridge
medium
ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; surface_occupancy; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractRecent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given a image sequence, we leverage off-the-shelf visual-inertial SLAM,3D detection algorithms and VLMs to extract for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strateg
387strong_bridge
medium
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; dynamic_4d; generation_editing; data_benchmarkdynamic/4D paper with direct reconstruction signal
abstractSignificant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion.To this end, we collect **SpatialVID**, a dataset consisting of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions.Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subseq
388strong_bridge
high
Learning Scene Coordinate Reconstruction from Unposed Images via Pose Graph Optimization
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization; surface_occupancypose/localization bridge genus=Pose Estimation with reconstruction/map signal
abstractLearning-based structure-from-motion methods such as ACE-Zero have demonstrated strong performance in estimating camera poses and scene coordinates from unordered image collections without requiring ground truth supervision. However, the lack of global and multi-view consistency constraints in ACE-Zero can lead to pose drift and misalignment, particularly in complex or ambiguous scenes. In this work, we propose a hybrid framework that integrates pose graph optimization (PGO) into ACE-Zero to refine camera poses and suppress incorrect refinements. We construct pose graphs directly from ACE-Zero outputs by extracting relative pose constraints from predicted scene coordinates. Furthermore, we introduce an uncertainty-aware optimization strategy by estimating confidence scores using geometric priors, including epipolar and optical flow consistencies across views. Our approach improves the ro
389strong_bridge
high
Sparse–View Localization via Online Neural 3D Regression
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondencepose/localization bridge genus=Pose Estimation with reconstruction/map signal
abstractWe present ON3R, an online-trained neural regressor addressing sparse-view structureless localization, where database images have limited visual overlap and no prebuilt 3D map. Given any sparse matches between a query and a $K$-tuple of posed database views, ON3R predicts 3D coordinates for matched query keypoints, supervised by database reprojection residuals and a monocular depth prior. Afterwards, the absolute pose of the query is estimated via P3P-RANSAC and refined with lightweight bundle adjustment. Across MegaDepth, Cambridge Landmarks, and a sparsified version of Aachen Day-Night, ON3R outperforms existing methods. ON3R is particularly effective when the data is extremely sparse -- we focus on $K\leq10$ database images. The code, data splits, and SfM models will be made available for full reproducibility.
390strong_bridge
high
JUMP-Hand: Learning Joint-wise Uncertainty to Gate Mixture of View Experts for Multi-View 3D Hand Reconstruction
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization; surface_occupancypose/localization bridge genus=Pose Estimation with reconstruction/map signal
abstractIn this paper, JUMP-Hand is proposed as a novel method for multi-view 3D hand reconstruction, which is the first to introduce probabilistic joint-wise uncertainty as an explicit gating mechanism to fuse multi-view information.Existing approaches usually fuse multi-view information by naïve pooling or implicit attention.However, they overlook that each hand joint exhibits varying visibility and reliability across views, which may degrade performance by indiscriminately aggregating noisy or unreliable information.For instance, one joint may be clearly visible in one view, while another joint is occluded in that view but visible in a different view.In contrast, JUMP-Hand addresses this by introducing the core insight of Mixture of Experts (MoE) and regard each 2D view as an expert.The key idea is that the reliability of each view expert is quantified through joint-wise uncertainty modeling,
391strong_bridge
medium
MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; dynamic_4d; data_benchmarkdynamic/4D paper with direct reconstruction signal
abstractWe present MoVieS, a motion-aware view synthesis model that reconstruct 4D dynamic scenes from monocular videos in one second. It represents dynamic 3D scenes with pixel-aligned Gaussian primitives and explicitly supervises their time-varying motions. This allows, for the first time, the unified modeling of appearance, geometry and motion from monocular videos, and enables reconstruction, view synthesis and 3D point tracking within a single learning-based framework. By bridging view synthesis with geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision. As a result, it also naturally supports a wide range of zero-shot applications, such as scene flow estimation and moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive p
392strong_bridge
medium
AvatarPointillist: AutoRegressive 4D Gaussian Avatarization
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; dynamic_4d; surface_occupancydynamic/4D paper with direct reconstruction signal
abstractWe introduce AvatarPointillist, a novel framework for generating dynamic 4D Gaussian avatars from a single portrait image. At the core of our method is a decoder-only Transformer that autoregressively generates a point cloud for 3D Gaussian Splatting. This sequential approach allows for precise, adaptive construction, dynamically adjusting point density and the total number of points based on the subject's complexity. During point generation, the AR model also jointly predicts per-point binding information, enabling realistic animation. After generation, a dedicated Gaussian decoder converts the points into complete, renderable Gaussian attributes. We demonstrate that conditioning the decoder on the latent features from the AR generator enables effective interaction between stages and markedly improves fidelity. Extensive experiments validate that AvatarPointillist produces high-quality,
393strong_bridge
medium
Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; dynamic_4d; data_benchmarkdynamic/4D paper with direct reconstruction signal
abstractForecasting dynamic scenes remains a fundamental challenge in computer vision, as limited observations make it difficult to capture coherent object-level motion and long-term temporal evolution.We present Motion Group-aware Gaussian Forecasting (MoGaF), a framework for long-term scene extrapolation built upon the 4D Gaussian Splatting representation.MoGaF introduces motion-aware Gaussian grouping and group-wise optimization to enforce physically consistent motion across both rigid and non-rigid regions, yielding spatially coherent dynamic representations.Leveraging this structured space-time representation, a lightweight forecasting module predicts future motion, enabling realistic and temporally stable scene evolution.Experiments on synthetic and real-world datasets demonstrate that MoGaF consistently outperforms existing baselines in rendering quality, motion plausibility, and long-ter
394strong_bridge
medium
EmoDiffTalk:Emotion-aware Diffusion for Editable 3D Gaussian Talking Head
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; dynamic_4d; generation_editingdynamic/4D paper with direct reconstruction signal
abstractRecent photo-realistic 3D talking head via 3D Gaussian Splatting still has significant shortcoming in emotional expression manipulation, especially for fine-grained and expansive dynamics emotional editing using multi-modal control. This paper introduces a new editable 3D Gaussian talking head, i.e. EmoDiffTalk. Our key idea is a novel Emotion-aware Gaussian Diffusion, which includes an action unit (AU) prompt Gaussian diffusion process for fine-grained facial animator, and moreover an accurate text-to-AU emotion controller to provide accurate and expansive dynamic emotional editing using text input. Experiments on public EmoTalk3D and RenderMe-360 datasets demonstrate superior emotional subtlety, lip-sync fidelity, and controllability of our EmoDiffTalk over previous works, establishing a principled pathway toward high-quality, diffusion-driven, multimodal editable 3D talking-head synth
395strong_bridge
high
SAG-GNN: Semantic-Aware Guided GNN for Descriptor-Free 2D-3D Matching
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; surface_occupancypose/localization bridge genus=Pose Estimation with reconstruction/map signal
abstractImage-to-point cloud matching (2D-3D matching) establishes accurate correspondences between image keypoints and 3D points for 6-DoF camera pose estimation. Existing methods either suffer from poor generalization due to scene-specific coordinate regression requiring per-scene retraining, or incur high storage and maintenance costs from descriptor-based matching that relies on large descriptor sets. Consequently, descriptor-free approaches have gained attention by avoiding heavy storage while improving generalizability; however, most rely only on low-level geometric cues, which limits performance. Leveraging the benefits of semantics in providing context, resolving ambiguities, and enhancing robustness in challenging scenes, we propose the Semantic-Aware Guided Graph Neural Network (SAG-GNN), integrating high-level semantics into descriptor-free 2D-3D matching. Specifically, we design a co
396strong_bridge
high
HumanBA: Human-Aware Bundle Adjustment via Global Human-Camera Decoupling
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localizationpose/localization bridge genus=Pose Estimation with reconstruction/map signal
abstractRecovering global human and camera motion from monocular video is essential for world-coordinate human reconstruction but remains challenging due to entangled motions in image space. Traditional SLAM methods estimate monocular camera motion but fail in scenes dominated by foreground objects such as humans. A common workaround is to mask out dynamic objects, yet this approach becomes brittle when humans occupy most of the view or the background is too noisy, leading to unstable tracking and loss of constraints. This paper takes the opposite stance and reintegrates human motion as informative landmarks. We introduce HumanBA, a human-aware bundle adjustment framework that transforms dynamic humans into usable constraints via motion decoupling. HumanBA subtracts the human-induced component from observed joint trajectories, isolating a camera-induced (pseudo-static) component that can be saf
397strong_bridge
medium
PhysHO: Physics-Based Dynamic 3D Gaussian Human and Object from Monocular Video
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; dynamic_4ddynamic/4D paper with direct reconstruction signal
abstractPhysically plausible reconstruction of human–object dynamics from a single video remains under-explored in physics-based methods. Most prior approaches omit human-generated internal actuation by assuming motion driven solely by gravity and simple contacts. They also rely on idealized constitutive laws that underfit heterogeneous and anisotropic materials. We introduce PhysHO, which tightly couples SMPL-driven Linear Blend Skinning (LBS) with a Material Point Method (MPM) simulator to address these gaps. Our key insight is to use LBS as an interpretable actuation prior and MPM to propagate those forces through contact under physical constraints. Concretely, we derive targeted actuation with a PD controller guided by LBS trajectories and gate it per particle via a learnable LBS-impact factor so that only particles inside the SMPL volume are directly actuated. We model real materials with r
398strong_bridge
medium
CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractAutonomous driving must operate reliably across diverse surfaces to enable safe mobility. However, most driving datasets are captured on well-paved flat roads. Moreover, recent driving datasets primarily provide sparse LiDAR ground truth for images, which is insufficient for assessing fine-grained geometry in depth estimation and completion. To address these gaps, we introduce CARD, a multi-modal driving dataset that delivers quasi-dense 3D ground truth across continuous sequences rich in speed bumps, potholes, irregular surfaces and off-road segments. Our sensor suite includes synchronized global-shutter stereo cameras, front and rear LiDARs, 6-DoF poses from LiDAR-inertial odometry, per-wheel motion traces, and full calibration. Notably, our multi-LiDAR fusion yields ~500K valid depth pixels per frame, about 6.5x more than KITTI Depth Completion and 10x more on average than other publi
399strong_bridge
medium
Dexterous World Models
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractRecent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static—limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), an scene-action-conditioned video diffusion model enabling embodied interaction within static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human–scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues in the egocentric view to model action-conditioned dynamics directly. We train our model on a synthetic h
400strong_bridge
medium
DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mappingsystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractVision-based autonomous driving has gained much attention due to its low costs and excellent performance.Compared with dense BEV (Bird’s Eye View) or sparse query models, Gaussian-centric method is a comprehensive yet sparse representation by describing scene with 3D semantic Gaussians. In this paper, we introduce DLWM, a novel paradigm with Dual Latent World Models specifically designed to enable holistic gaussian-centric pre-training in autonomous driving using two stages. In the first stage, DLWM predicts 3D Gaussians from queries by self-supervised reconstructing multi-view semantic and depth images. Equipped with fine-grained contextual features, in the second stage, two latent world models are trained separately for temporal feature learning, including Gaussian-flow-guided latent prediction for downstream occupancy perception and forecasting tasks, and ego-planning-guided latent pr
401strong_bridge
medium
GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractPhysics-aware driving world model is essential for drive planning, out-of-distribution data synthesis, and closed-loop evaluation. However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We furth
402strong_bridge
medium
GSV2X: Geometry-Aware Uncertainty Modeling and Orthogonal Fusion for Robust Roadside Perception
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractReliable 3D perception from multi-view roadside sensors hinges on the robust fusion of camera and LiDAR data, a task complicated by geometric misalignments and sensor calibration errors. This paper presents GSV2X, a fusion framework that tackles these challenges through two core contributions. First, to achieve robustness against spatial uncertainty, we lift 2D image features into a unified Bird's-Eye-View (BEV) space by representing them as 3D Gaussian distributions. By incorporating learnable perturbations guided by camera geometry, our model explicitly accounts for potential calibration inaccuracies. Second, to maximize the synergy between modalities, we propose a new orthogonal fusion module. This module employs constrained attention to enforce orthogonality between camera and LiDAR features, effectively disentangling redundant information and promoting the learning of complementary
403strong_bridge
medium
NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editingsystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractIn this paper, we propose **NeoVerse**, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks.
404strong_bridge
medium
ORV: 4D Occupancy-centric Robot Video Generation
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractRecent embodied intelligence suffers from data scarcity, while conventional simulators lack visual realism. Controllable video generation is emerging as a promising data engine, yet current action-conditioned methods still fall short: generated videos are limited in fidelity and temporal consistency, poorly aligned with controls, and often constrained to singleview settings. We attribute these issues to the representational gap between sparse control inputs and dense pixel outputs. Thus, we introduce ORV, a 4D occupancy-centric framework for robot video generation that couples action priors with occupancy-derived visual priors. Concretely, we align chunked 7-DoF actions with video latents via an Action-Expert AdaLN modulation, and inject 2D renderings of 4D semantic occupancy into the generation process as soft guidance. Meanwhile, a central obstacle is the lack of occupancy data for emb
405strong_bridge
medium
Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping; generation_editing; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractRecent progress in robot learning has been driven by large-scale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for spatial generalization in manipulation tasks. To reduce repetitive data collection, we present Real2Edit2Real, a framework that generates new demonstrations by bridging 3D editability with 2D visual data through a 3D control interface. Our approach first reconstructs scene geometry from multi-view RGB observations with a metric-scale 3D reconstruction model. Based on the reconstructed geometry, we perform depth-reliable 3D editing on point clouds to generate new manipulation trajectories while geometrically correcting the robot poses to recover physically consistent depth, which serves as a reliable condition for synthesizing new demonstrations. Finally,
406strong_bridge
medium
Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkGaussian/radiance representation linked to pose/mapping/metric bridge
abstractRobust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments. However, this unstructured, in-the-wild video data is incompatible with ADS expecting structured, multi-modal sensor inputs for system validation and training purposes. To bridge this data gap, we propose Sensor2Sensor, a novel generative modeling paradigm that translates in-the-wild monocular dashcam videos into a high-fidelity, multi-modal sensor suite that we refer to as the AV log, which includes multi-view camera images and LiDAR point
407strong_bridge
medium
Stereo World Model
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; generation_editingsystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractWe present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower com
408strong_bridge
medium
U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractModeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we present **U4D**, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a "hard-to-easy" manner through two sequential stages: (1) *uncertainty-region modeling*, which reconstructs high-entropy regions with fine geometric fidelity, and (2) *uncertainty-conditioned completion*, which synthesizes the remaining a
409strong_bridge
medium
Unified Camera Positional Encoding for Controlled Video Generation
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; robotics_mapping; generation_editing; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractTransformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in three-dimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real-world cameras. We introduce **Relative Ray Encoding**, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera-controlled text-to-video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective for **Absolute Orientation Encoding**, enabling full con
410strong_bridge
medium
UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; robotics_mapping; generation_editing; data_benchmarkGaussian/radiance representation linked to pose/mapping/metric bridge
abstractRecent progress in 3D reconstruction, such as NeRFs and 3D Gaussian Splatting, has made it easy to recover geometry and appearance from images. However, these static representations remain blind to the physics that govern how objects deform and respond to forces. Building interactive 3D worlds therefore requires predicting not only shape but the underlying material properties. Prior approaches either rely on slow test-time optimization or, more recently, a fast feed-forward predictor such as Pixie. However, these models produce only a single point estimate of physical parameters and are limited to a single simulation backend, restricting both expressiveness and portability. We introduce UniPixie, a generative physics-from-pixels framework that overcomes both limitations. UniPixie predicts a controllable, continuous soft-to-stiff distribution of plausible material properties from a single
411strong_bridge
medium
WRIVINDER: Towards Spatial Intelligence for Geo-locating Ground Images onto Satellite Imagery
Remote Sensing & Earth / Remote Sensing
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractAligning ground-level imagery with geo-registered satellite maps is crucial for mapping, navigation, and situational awareness, yet remains challenging under large viewpoint gaps or when GPS is unreliable. We introduce Wrivinder, a zero-shot, geometry-driven framework that aggregates multiple ground photographs to reconstruct a consistent 3D scene and align it with overhead satellite imagery. Wrivinder combines SfM reconstruction, 3D Gaussian Splatting, semantic grounding, and monocular depth–based metric cues to produce a stable zenith-view rendering that can be directly matched to satellite context for metrically accurate camera geo-localization. To support systematic evaluation of this task—which lacks suitable benchmarks—we also release MC-Sat, a curated dataset linking multi-view ground imagery with geo-registered satellite tiles across diverse outdoor environments. Together, Wrivin
412strong_bridge
medium
HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgaussian_radiance; dynamic_4d; robotics_mapping; generation_editing; data_benchmarkGaussian/radiance representation linked to pose/mapping/metric bridge
abstractControllable driving scene generation is critical for realistic and scalable autonomous driving simulation, yet existing approaches struggle to jointly achieve photorealism and precise control. We introduce \textbf{HorizonForge}, a unified framework that reconstructs scenes as editable Gaussian Splats and Meshes, enabling fine-grained 3D manipulation and language-driven vehicle insertion. Edits are rendered through a noise-aware video diffusion process that enforces spatial and temporal consistency, producing diverse scene variations in a single feed-forward pass without per-trajectory optimization. To standardize evaluation, we further propose \textbf{HorizonSuite}, a comprehensive benchmark spanning ego- and agent-level editing tasks such as trajectory modifications and object manipulation. Extensive experiments show that Gaussian Splatting delivers substantially higher fidelity than a
413strong_bridge
medium
GEM: Generating LiDAR World Model via Deformable Mamba
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextdynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractWorld models, which simulate environmental dynamics and generate sensor observations, are gaining increasing attention in autonomous driving. However, progress in LiDAR-based world models has lagged behind those built on camera videos or occupancy data, primarily due to two core challenges: the inherent disorder of point clouds and the difficulty of distinguishing dynamic objects from static structures. To address these issues, we propose **GEM**: a **G**enerative LiDAR world model that leverages d**E**formable **M**amba architecture, significantly improving fidelity and imaginative capability. Specifically, leveraging the structural similarity between sequential laser scanning and Mamba's processing mechanism, we first tokenize LiDAR sweeps into compact representations via a custom LiDAR scene tokenizer. After unsupervised disentanglement of tokenized features via a dynamic-static separ
414strong_bridge
medium
An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving
Data & Evaluation / Benchmark
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractPanoptic occupancy prediction aims to jointly infer voxel-wise semantics and instance identities within a unified 3D scene representation. Nevertheless, progress in this field remains constrained by the absence of high-quality 3D mesh resources, instance-level annotations, and physically consistent occupancy datasets. Existing benchmarks typically provide incomplete and low-resolution geometry without instance-level annotations, limiting the development of models capable of achieving precise geometric reconstruction, reliable occlusion reasoning, and holistic 3D understanding. To address these challenges, this paper presents an instance-centric benchmark for the 3D panoptic occupancy prediction task. Specifically, we introduce ADMesh, the first unified 3D mesh library tailored for autonomous driving, which integrates over 15K high-quality 3D models with diverse textures and rich semantic
415strong_bridge
medium
Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; robotics_mapping; data_benchmarkGaussian/radiance representation linked to pose/mapping/metric bridge
abstractReproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning
416strong_bridge
medium
Multi-Scale Gaussian-Language Map for Embodied Navigation and Reasoning
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgaussian_radiance; pose_calibration_localization; surface_occupancy; robotics_mappingsystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractUnderstanding the geometric and semantic structure of environments is essential for embodied agents. Existing semantic mapping methods trade off between explicit geometry and multi-scale semantics,and lack a native interface for large models, thus requiring additional training of feature projection for semantic alignment. To this end, we propose the multi-scale Gaussian-Language Map (GLMap), which introduces three key designs: (1) explicit geometry, (2) multi-scale semantics covering both instance and region level concepts, and (3) a dual-modality interface where each semantic unit jointly stores a natural language description and a 3D Gaussian representation. The 3D Gaussians enable compact storage and fast rendering of task-relevant images via Gaussian splatting. To enable efficient incremental construction, we further propose a Gaussian Estimator that analytically derives Gaussian par
417strong_bridge
medium
Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras
Computational Imaging / Computational Imaging
D. adjacent but useful contextpose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractConventional multi-projector calibration requires projecting and capturing structured light patterns for each projector sequentially, causing calibration time and effort to increase linearly with the number of projectors. This scalability bottleneck has long limited the deployment of large-scale projection mapping systems. We present a new calibration framework that breaks this limitation by embedding cameras into the surface of the calibration target. The embedded cameras directly capture the incoming projection light, enabling the separation of simultaneously projected structured light patterns from multiple projectors according to their incident directions. Our method establishes correspondences between the optical centers of the embedded cameras and the projector pixels, allowing the intrinsic and extrinsic parameters of all projectors to be simultaneously estimated. We further intro
418strong_bridge
medium
DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mappingsystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractAlthough multimodal large language models (MLLMs) have shown remarkable capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs within a unified framework remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework for autonomous driving, performing spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through joint optimization. We term it 4D MLLM as it outputs both 3D occupancy and flow, capturing fine-grained spatial-temporal dynamics. Specifically, to capture both precise geometric information and rich appearance, our approach integrates point clouds, multi-view images and language instructions within a single MLLM architecture. Remarkably, desp
419strong_bridge
medium
OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective
Data & Evaluation / Benchmark
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractSemantic Scene Completion (SSC) is crucial for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial scenarios like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors represent the primary modality for SSC data generation, which poses challenges for most uncrewed aerial vehicles (UAVs) due to flight regulations, mass and energy constraints, and the sparsity of LiDAR-based point clouds from elevated viewpoints. To address these limitations, we introduce OccuFly, the first real-world, camera-based aerial SSC benchmark, captured at altitudes of 50m, 40m, and 30m during spring, summer, fall, and winter. OccuFly covers urban, indu
420strong_bridge
medium
Spatial Retrieval Augmented Autonomous Driving
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractExisting autonomous driving systems rely on onboard sensors (cameras, LiDAR, IMU, etc) for environmental perception. However, this paradigm is limited by the drive-time perception horizon and often fails under limited view scope, occlusion or extreme conditions such as darkness and rain. In contrast, human drivers are able to recall road structure even under poor visibility. To endow models with this "recall" ability, we propose the spatial retrieval paradigm, introducing offline retrieved geographic images as an additional input. These images are easy to obtain from offline caches (e.g, Google Maps or stored autonomous driving datasets) without requiring additional sensors, making it a plug-and-play extension for existing AD stacks.For experiments, we first extend the nuScenes dataset with geographic images retrieved via Google Maps APIs and align the new data with ego-vehicle trajector
421strong_bridge
medium
LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; dynamic_4d; robotics_mappingsystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractAccurate metric depth is critical for autonomous driving perception and simulation, yet current approaches struggle to achieve high metric accuracy, multi-view and temporal consistency, and cross-domain generalization.To address these challenges, we present MVS-Pro, a novel multi-view stereo framework that reconciles these competing objectives through two key insights: (1) Sparse but metrically accurate LiDAR observations can serve as geometric prompts to anchor depth estimation in absolute scale, and (2) deep fusion of diverse cues is essential for resolving ambiguities and enhancing robustness, while a spatio-temporal decoder ensures consistency across frames.Built upon these principles, MVS-Pro embeds the LiDAR prompt in two ways: as a hard geometric prior anchoring the cost volume, and as soft feature-wise guidance fused by a triple cues combiner.As for temporal consistency, MVS-Pro
422strong_bridge
medium
Scene Reconstruction as Mapping Priors for 3D Detection
Detection & Tracking / Detection
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractIn autonomous driving, mapping is critical for motion planning but remains an under-utilized resource for perception tasks like 3D object detection. Maps can provide robust structural priors of the static environment, suited to resolving ambiguities and correcting for sensor data sparsity or noise — issues especially prevalent for distant objects or during adverse weather conditions. However, conventional High-Definition (HD) maps are resource-intensive to obtain and maintain, which presents a challenge for achieving efficient, large-scale deployment. In this paper, we propose a scalable solution to systemically leverage mapping to improve 3D detection by overcoming two primary challenges. First, we introduce a pipeline to automatically build dense mapping priors from aggregated sensor data, eliminating the need for human labeling. Second, we design a novel Mapping Prior Augmented 3D det
423strong_bridge
medium
URScenes: A Multi-scenario Dataset for Unstructured Road Environments
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractAs autonomous driving technology transitions from small-scale validation to large-scale deployment, its development in unstructured road environments has become a critical and inevitable trend. Autonomous vehicles increasingly rely on high-quality and diverse datasets for perception systems. However, existing public datasets predominantly focus on clear-weather and urban-road scenarios, leaving a significant gap in the coverage of unstructured road environments. To bridge this gap, we construct URScenes, the first multi-scenario, open-source perception dataset for unstructured road environments. The dataset consists of 472 scenes, each lasting 30 seconds, and provides over 28K annotated samples and 119K sweeps. URScenes, for the first time, covers eight typical scenarios, including rainy, snowy, foggy, dusty, glare, night, cloudy, and sunny conditions. Additionally, URScenes supports mul
424strong_bridge
medium
QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextdynamic_4d; surface_occupancy; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractLearning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving.Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels.Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability.We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames.The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data.To enable long-range supervision and reasoning under constant memory, we intr
425strong_bridge
medium
UniGeoRS: A Unified Benchmark for Tri-view Geo-Localization
Remote Sensing & Earth / Remote Sensing
D. adjacent but useful contextpose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractCross-view geo-localization (CVGL) aims to estimate an image’s geographic location by matching it with geo-referenced images from different viewpoints, supporting applications such as autonomous driving, UAV navigation, and visual surveillance. However, due to the high cost of image collection, current CVGL datasets often suffer from limited diversity in both drone and ground imagery, which constrains model generalization. Furthermore, existing methods primarily focus on either ground-to-satellite or drone-to-satellite matching, lacking a unified framework capable of handling image matching across all three platforms: satellite, drone, and ground. To this end, we introduce the Unified Geo-localization dataset with Real-world and Synthetic imagery (UniGeoRS), a comprehensive benchmark featuring satellite, drone, and ground-view images, with a particular emphasis on the richness and divers
426strong_bridge
medium
NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextdynamic_4d; surface_occupancy; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractEmbodied navigation for long-horizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long-term planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination within a single, unified framework.Our approach empowers a single VLM to concurrently perform planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by decomposing the task, tracking its progress, and formulating the subsequent sub-goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting short-t
427strong_bridge
medium
OptiMVMap: Offline Vectorized Map Construction via Optimal Multi-vehicle Perspectives
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mappingsystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractOffline vectorized maps constitute critical infrastructure for high-precision autonomous driving and mapping services. Existing approaches rely predominantly on single ego-vehicle trajectories, which fundamentally suffer from viewpoint insufficiency: while memory-based methods extend observation time by aggregating ego-trajectory frames, they lack the spatial diversity needed to reveal occluded regions. Incorporating views from surrounding vehicles offers complementary perspectives, yet naive fusion introduces three key challenges: computational cost from large candidate pools, redundancy from near-collinear viewpoints, and noise from pose errors and occlusion artifacts.We present OptiMVMap, which reformulates multi-vehicle mapping as a select-then-fuse problem to address these challenges systematically. An Optimal Vehicle Selection (OVS) module strategically identifies a compact subset
428strong_bridge
medium
Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextdepth_correspondence; surface_occupancy; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstract3D occupancy prediction is crucial for autonomous driving perception, offering comprehensive geometric scene understanding and semantic recognition. However, existing methods struggle with geometric misalignment in view transformation due to lack of pixel-level accurate depth estimation, and severe spatial class imbalance where semantic categories exhibit strong spatial anisotropy. To address these challenges, we propose Dr.Occ, a depth- and region-guided occupancy prediction framework. Specifically, we introduce a depth-guided 2D-to-3D View Transformer (D$^2$-VFormer) that effectively leverages high-quality dense depth cues from MoGe-2 to construct reliable geometric priors, thereby enabling precise geometric alignment of voxel features. Moreover, inspired by the Mixture-of-Experts (MoE) framework, we propose a region-guided Expert Transformer (R/R$^2$-EFormer) that adaptively allocate
429strong_bridge
medium
Complementary Prototype Mapping for Efficient Multimodal Anomaly Detection
Detection & Tracking / Detection
D. adjacent but useful contextpose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractMultimodal unsupervised anomaly detection has garnered increasing attention for robust defect localization.Recent approaches rely on establishing cross-modal matching relationships under normal conditions without explicit guidance.However, in practice, a single modality may have multiple distinct representations corresponding to another modality, and such unconditional mappings struggle to adaptively capture these variations, resulting in mapping ambiguity and the misclassification of diverse yet normal variations as anomalies.Moreover, existing methods suffer from slow inference speed and high memory overhead, hindering their deployment in real-world production lines.To address these issues, we propose an efficient and effective Complementary Prototype Mapping (\textbf{CPMAD}) framework, which dynamically extracts consensus and supplementary prototypes to serve as complementary prior
430strong_bridge
medium
PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence
Recognition & Classification / Retrieval
D. adjacent but useful contextpose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractCross-view geo-localization is a critical task for UAV navigation, event detection, and aerial surveying, which establish correspondence between drone-captured and satellite imagery. Most existing approaches embed cross-view data into a joint feature space to maximize similarity between paired images. However, these methods typically assume perfect alignment of image pairs in training data, an assumption that rarely holds in practical scenarios. In real-world conditions, factors such as urban canyon effects, electromagnetic interference, and adverse weather frequently induce GPS drift, resulting in systematic **alignment shifts** where only partial correspondences exist between image pairs. Despite its prevalence, this source of noisy correspondence has received limited attention in current research.To our best knowledge, this work presents the first systematic investigation of the **Noi
431strong_bridge
medium
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractEnd-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensor data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. The recent success of the Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that expert specialization enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. First, we introduce Drive-π0, a Vision-Language-Action (VLA) baseline adapted from Embodied AI for autonomous driving, which serves as the foundation model for DriveMoE. Building on this, we strengthen perception through a carefully designed Vision MoE, where a router adaptively selects context-relevant camera views. This mechanism is inspired by human driving cognition, in which
432strong_bridge
medium
Think Before You Drive: World Model-Inspired Multimodal Grounding
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextpose_calibration_localization; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractInterpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods in AD struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we presen
433strong_bridge
medium
NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks
Learning Algorithms / Optimization
D. adjacent but useful contextdepth_correspondence; surface_occupancy; robotics_mappingsystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractPartially Supervised Multi-Task Learning (PS-MTL) aims to leverage knowledge across tasks when annotations are incomplete. Existing approaches, however, have largely focused on the simpler setting of homogeneous, dense prediction tasks, leaving the more realistic challenge of learning from structurally diverse tasks unexplored. To this end, we introduce NexusFlow, a novel, lightweight, and plug-and-play framework effective in both settings. NexusFlow introduces a set of surrogate networks with invertible coupling layers to align the latent feature distributions of tasks, creating a unified representation that enables effective knowledge transfer. The coupling layers are bijective, preserving information while mapping features into a shared canonical space. This invertibility avoids representational collapse and enables alignment across structurally different tasks without reducing expres
434strong_bridge
medium
ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextdepth_correspondence; surface_occupancy; robotics_mappingsystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstract3D semantic occupancy prediction is central to autonomous driving, yet current methods are vulnerable to long-tailed class bias and out-of-distribution (OOD) inputs, often overconfidently assigning anomalies to rare classes. We present ProOOD, a lightweight, plug-and-play method that couples prototype-guided refinement with training-free OOD scoring. ProOOD comprises (i) prototype-guided semantic imputation that fills occluded regions with class-consistent features, (ii) prototype-guided tail mining that strengthens rare-class representations to curb OOD absorption, and (iii) EchoOOD, which fuses local logit coherence with local and global prototype matching to produce reliable voxel-level OOD scores. Extensive experiments on five datasets demonstrate that ProOOD achieves state-of-the-art performance on both in-distribution 3D occupancy prediction and OOD detection. On SemanticKITTI, it
435strong_bridge
medium
Lipschitz Optimization for Formal Verification of Homographies
Robustness & Safety / Safety
D. adjacent but useful contextpose_calibration_localization; robotics_mapping; data_benchmarksystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractThe adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, aerospace, and autonomous vehicles. However, current approaches are confined to incomplete statistical verification, or robustness to $\ell_p$-norm or affine transforms which represent a limited subset of perturbations to the image formation process.In this paper, we present a formal verification approach when the capturing camera undergoes 3D motion perturbations. We first establish a closed-form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. While our formulae are grounded in the vision-based landing problem, they gene
436strong_bridge
medium
WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextpose_calibration_localization; robotics_mappingsystem bridge signal: pose/localization/mapping/world-model plus reconstruction representation
abstractCollaborative perception is vital for autonomous driving yet remains constrained by tight communication budgets. Earlier work reduced bandwidth by compressing full feature maps with fixed-rate encoders, which adapts poorly to a changing environment, and it further evolved into spatial selection methods that improve efficiency by focusing on salient regions, but this object-centric approach often sacrifices global context, weakening holistic scene understanding. To overcome these limitations, we introduce WhisperNet, a bandwidth-aware framework that proposes a novel, receiver-centric paradigm for global coordination across agents. Senders generate lightweight saliency metadata, while the receiver formulates a global request plan that dynamically budgets feature contributions across agents and features, retrieving only the most informative features. A collaborative feature routing module t
437adjacent_context
low
AVGGT: Rethinking Global Attention for Accelerating VGGT
Learning Algorithms / Efficient Models
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; depth_correspondence; surface_occupancyadjacent genus=Efficient Models; useful only if manually connected to reconstruction
abstractSince DUSt3R, models such as VGGT and $\pi^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $\pi^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by sub
438adjacent_context
low
CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation
Generative Models / Video Generation
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy; generation_editing; data_benchmarkadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractCinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional con
439adjacent_context
low
Group Editing: Edit Multiple Images in One Go
Generative Models / Image Editing
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmarkadjacent genus=Image Editing with no direct reconstruction/SLAM/map signal in title
abstractIn this paper, we tackle the problem of performing consistent and unified modifications across a set of related images. This task is particularly challenging because these images may vary significantly in pose, viewpoint, and spatial layout. Achieving coherent edits requires establishing reliable correspondences across the images, so that modifications can be applied accurately to semantically aligned regions. To address this, we propose GroupEditing, a novel framework that builds both explicit and implicit relationships among images within a group. On the explicit side, we extract geometric correspondences using VGGT, which provides spatial alignment based on visual features. On the implicit side, we reformulate the image group as a pseudo-video and leverage the temporal coherence priors learned by pre-trained video models to capture latent relationships. To effectively fuse these two t
440adjacent_context
low
MuM: Multi-View Masked Image Modeling for 3D Vision
Learning Algorithms / Self-supervised
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondenceadjacent genus=Self-supervised with no direct reconstruction/SLAM/map signal in title
abstractSelf-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM, extensive
441adjacent_context
low
VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
Segmentation & Dense Prediction / Segmentation
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmarkadjacent genus=Segmentation; useful only if manually connected to reconstruction
abstractInstance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in t
442adjacent_context
medium
Any Resolution Any Geometry: From Multi-View To Multi-Patch
Robustness & Safety / Robustness
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; depth_correspondence; surface_occupancyeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractJoint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. We address this challenge by adapting the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sa
443adjacent_context
low
MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation
Multimodal & Language / Grounding
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; surface_occupancy; data_benchmarkadjacent genus=Grounding; useful only if manually connected to reconstruction
abstractMost existing 3D referring expression segmentation (3DRES) methods rely on dense, high-quality point clouds, while real-world agents such as robots and mobile phones operate with only a few sparse RGB views and strict latency constraints. We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images. Traditional two-stage pipelines, which first reconstruct a point cloud and then perform segmentation, often yield low-quality geometry, produce coarse or degraded target regions, and run slowly. We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning through a dual-branch design. Training in this setting exposes a critical optimization barrier,
444adjacent_context
low
Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis
Recognition & Classification / Retrieval
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondenceadjacent genus=Retrieval; useful only if manually connected to reconstruction
abstractCross-view geo-spatial learning consists of two important tasks: Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS), both of which rely on establishing geometric correspondences between ground and aerial views. Recent Geometric Foundation Models (GFMs) have demonstrated strong capabilities in extracting generalizable 3D geometric features from images, but their potential in cross-view geo-spatial tasks remains underexplored. In this work, we present Geo^2, a unified framework that leverages Geometric priors from GFMs (e.g., VGGT) to jointly perform geo-spatial tasks, CVGL and bidirectional CVIS. Despite the 3D reconstruction ability of GFMs, directly applying them to CVGL and CVIS remains challenging due to the large viewpoint gap between ground and aerial imagery. We propose GeoMap, which embeds ground and aerial features into a shared 3D-aware latent space, effect
445adjacent_context
low
Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers
Learning Algorithms / Efficient Models
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; depth_correspondence; data_benchmarkadjacent genus=Efficient Models with no direct reconstruction/SLAM/map signal in title
abstractEfficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT, $\pi^3$ and MapAnything have demonstrated remarkable performance with relatively simple architectures. However, their scalability is fundamentally constrained by the quadratic complexity of global attention, which imposes a significant runtime bottleneck when processing large image sets. In this work, we empirically analyze the global attention matrix of these models and observe that the probability mass concentrates on a small subset of patch-patch interactions corresponding to cross-view geometric correspondences. Building on this insight and inspired by recent advances in large language models, we propose a training-free, block-sparse replacement for dense global attention, implemented with highly optimized kernels. Our method acc
446adjacent_context
low
G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
Multimodal & Language / VLM / MLLM
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; surface_occupancy; generation_editingadjacent genus=VLM / MLLM; useful only if manually connected to reconstruction
abstractVision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations.Experim
447adjacent_context
low
LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging
Learning Algorithms / Efficient Models
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; data_benchmarkadjacent genus=Efficient Models; useful only if manually connected to reconstruction
abstract3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10× speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: 1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; 2) token similarity acroses adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging . We analyze each token’s geometric importance, optimizing anchor token selection to better pr
448adjacent_context
low
Rewis3d: Reconstruction for Weakly-Supervised Semantic Segmentation
Segmentation & Dense Prediction / Segmentation
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; surface_occupancyadjacent genus=Segmentation; useful only if manually connected to reconstruction
abstractWe present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student–teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geo
449adjacent_context
medium
Sky2Ground: A Benchmark for Site Modeling under Varying Altitude
Remote Sensing & Earth / Remote Sensing
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; depth_correspondence; surface_occupancy; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractIn this work, we propose the problem of localizing cameras and producing renders of a scene, given multiple images captured from ground/aerial/satellite viewpoints. We introduce a dataset called Sky2Ground, which contains synthetic/real images across all 3 viewpoints, along with camera parameters, and dense depth-maps/surface-normals. Recent works have shown that transformer-based nets like VGGT are capable of inferring scene-parameters in a single-forward pass. However, we formally reveal that simply fine-tuning such models reduces performance, and can't be solved simply by bruteforce-scaling. We find the culprit to be satellite images, which inject too much noise during the learning process. Therefore, we propose SkyNet to enable learning using satellite-images. SkyNet is a two-stream neural-net, with one stream explicitly processing satellite, and another processing all modalities tog
450adjacent_context
medium
3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds
3D Vision & Geometry / Point Cloud
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; surface_occupancy; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractDespite recent progress in 3D self-supervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds from unlabeled videos. We first introduce \data, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model. We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D
451adjacent_context
low
GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation
Segmentation & Dense Prediction / Segmentation
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstruction; surface_occupancyadjacent genus=Segmentation with no direct reconstruction/SLAM/map signal in title
abstractWe introduce GeoSAM2, a prompt-controllable framework for 3D part segmentation that casts the task as multi-view 2D mask prediction. Given a textureless object, we render normal and point maps from predefined viewpoints and accept simple 2D prompts—clicks or boxes—to guide part selection. These prompts are processed by a shared SAM2 backbone augmented with LoRA and residual geometry fusion, enabling view-specific reasoning while preserving pretrained priors. The predicted masks are back-projected to the object, aggregated across views.Our method enables fine-grained, part-specific control without requiring text prompts, per-shape optimization, or full 3D labels. In contrast to global clustering or scale-based methods, prompts are explicit, spatially grounded, and interpretable. We achieve state-of-the-art class-agnostic performance on PartObjaverse-Tiny and PartNetE, outperforming both s
452adjacent_context
medium
SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead
Robotics & Embodied AI / Embodied AI
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; dynamic_4d; robotics_mappingeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractVision–Language–Action (VLA) models built on pretrained Vision–Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that incrementally extracts 4D features from 2D images. Then, to enhance the VLM’s ability to exploit both 2D images and 4D features, we introduce \textit{Fusion Tokens}, a set of learnable tokens
453adjacent_context
low
Co-Me: Confidence Guided Token Merging for Visual Geometric Transformers
Learning Algorithms / Efficient Models
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; general_reconstructionadjacent genus=Efficient Models with no direct reconstruction/SLAM/map signal in title
abstractWe propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me employs a light-weight distilled confidence predictor to rank tokens and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to $11.3\times$ and $7.2\times$ speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.
454adjacent_context
low
Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models
Learning Algorithms / Efficient Models
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; data_benchmarkadjacent genus=Efficient Models with no direct reconstruction/SLAM/map signal in title
abstractWith the emergence of 3D foundation models, such as DUSt3R, VGGT, and their variants, there is a growing interest in fine-tuning them for various downstream tasks, where using LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in geometry, texture, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA sub-spaces associated with each type of variation? 2) Are these sub-spaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions. We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA sub-space associated with each type of variation. We show that these sub-spaces are approximately disentangled. Integrating them leads to a reduced LoRA
455adjacent_context
low
Towards Hierarchical 3D Spatial Understanding in Vision-Language Models
Multimodal & Language / VLM / MLLM
A. thesis anchor: VGGT/feed-forward geometryvggt_lineage; data_benchmarkadjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title
abstractAchieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D space, and performing high-level spatial reasoning. In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex stages, from geometric perception to abstract spatial reasoning. Guided by this framework, we construct an automated pipeline that generates over 1 billion 3D spatial VQA pairs across diverse tasks and scenes for VLM supervised finetuning. We also develop an RGB-D VLM that incorporates metric-scale point maps as auxiliary inputs to further enhance spatial understanding. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple spatial understanding and reasoni
456adjacent_context
medium
Captain Safari: A Real-time World Engine
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; generation_editing; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractWorld engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers.To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dyn
457adjacent_context
medium
ESAM++: Efficient Online 3D Perception on the Edge
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractOnline 3D scene perception in real time is essential for robotics, AR/VR, and autonomous systems, particularly in edge computing scenarios where computational resources are limited and privacy is crucial. Recent state-of-the-art methods like EmbodiedSAM (ESAM) demonstrate the promise of online 3D perception by leveraging the Segment Anything Model (SAM) for real-time, fine-grained, and gen- eralized 3D instance segmentation. However, ESAM still relies on a computationally expensive 3D sparse UNet for point cloud feature extraction, which accounts for the majority of the 3D inference time, hindering its practicality on resource-constrained devices. In this paper, we propose ESAM++, a lightweight and scalable alternative for online 3D scene perception tailored to edge devices without GPU acceleration. Our method introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently capture
458adjacent_context
medium
Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; generation_editingeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractDiffusion models have achieved impressive generative quality across modalities like 2D images, videos, and 3D shapes, but their inference remains computationally expensive due to the iterative denoising process. While recent caching-based methods effectively reuse redundant computations to speed up 2D and video generation, directly applying these techniques to 3D diffusion models can severely disrupt geometric consistency. In 3D synthesis, even minor numerical errors in cached latent features accumulate, causing structural artifacts and topological inconsistencies. To overcome this limitation, we propose Fast3Dcache, a training-free geometry-aware caching framework that accelerates 3D diffusion inference while preserving geometric fidelity. Our method introduces a Predictive Caching Scheduler Constraint (PCSC) to dynamically determine cache quotas according to voxel stabilization pattern
459adjacent_context
medium
Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancyeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractOne of the most exciting applications of vision models involve pixel-level reasoning.Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level.Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction.We present a framework that learns pixel-accurate feature descriptors from videos, LILA.The core element of our training framework is linear in-context learning.LILA leverages spatio-temporal cue maps -- depth and motion -- estimated with off-the-shelf networks.Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner.We dem
460adjacent_context
medium
GeoWorld: Geometric World Models
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editingeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractEnergy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive expe
461adjacent_context
medium
Order Matters: 3D Shape Generation from Sequential VR Sketches
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractVR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD software. However, existing sketch-to-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for 3D shape generation from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates ordered VR sketches from arbitrary shapes, (ii) a dataset comprising over 20k synthetic and 900 hand-drawn sketch–shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work and generalizes effectively from synthetic to real sketches with minimal supervision. All data and models will
462adjacent_context
medium
RenderFlow: Single-Step Neural Rendering via Flow Matching
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancyeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractConventional physically-based rendering (PBR) pipelines generate photorealistic images through computationally expensive light transport simulations. Although recent deep learning approaches leverage diffusion model priors with geometry buffers (G-buffers) to produce visually compelling results without explicit scene geometry or light simulation, they remain constrained by two major limitations. First, the iterative nature of the diffusion process introduces substantial latency. Second, the inherent stochasticity of these generative models compromises physical accuracy and temporal consistency. In response to these challenges, we propose a novel, end-to-end, deterministic single-step neural rendering framework \textit{RenderFlow} built upon a flow matching paradigm. To further strengthen both rendering quality and generalization, we propose an efficient and effective module for sparse ke
463adjacent_context
medium
Spatial Matters: Position-Guided 3D Referring Expression Segmentation
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstract3D Referring Expression segmentation (3D-RES) is an emerging field that segments 3D objects in point cloud scenes based on given referring expressions. Although existing methods have achieved substantial progress, they primarily focus on semantic cues and often overlook spatial relations, which are essential for segmenting the referred objects in complex 3D scenes, especially those containing multiple visually similar instances. In this paper, we propose Position3D, a novel approach that explicitly incorporates spatial relation modeling into 3D-RES. Specifically, we introduce a spatial-aware query generation module that constructs point proxies by aggregating local context and incorporating spatial relations, from which the most text-relevant are selected as queries. Furthermore, we design a position-guided deformable attention in the decoder, which progressively refines attention to con
464adjacent_context
medium
SRGCD: Stability-Driven Region Growth Framework for 3D Change Detection
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractWith the growing accessibility of large-scale 3D point clouds from LiDAR and photogrammetric techniques, 3D change detection (3DCD) has become essential for understanding dynamic scenes. Existing methods typically formulate this as segmentation, treating each point independently for binary classification. This leads to isolated misclassified noise points inside regions. Meanwhile, feature similarity at boundaries causes boundary ambiguity. The more severe class imbalance inherent to change detection further exacerbates this issue. To address these challenges, we propose SRGCD, a Stability-Driven Region Growth Framework that redefines 3DCD as region growing rather than segmentation. Our key insight is that progressively expanding from highly confident seeds avoids pitfalls of point-wise classification while elegantly alleviating class imbalance. Specifically, we first apply strict constra
465adjacent_context
medium
StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractThe growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone.To address this challenge, we present **StereoWorld**, an **end-to-end framework** that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a **geometry-aware regularization** to ensure 3D structural fidelity.A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis.To enable large-scale training and evaluation, we curate a **high-definition stereo video dataset** containing over 11M frames aligned to natural human interpupillary distance (IPD).Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generat
466adjacent_context
medium
TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractCurrent football imitation research primarily aims to optimize reward-based objectives, such as goals scored or win rate proxies, paying less attention to accurately replicating real-world team tactical behaviors. We introduce TacSIm, a large-scale dataset and benchmark for Tactical Style Imitation in football. TacSIm imitates the acitons of all 11 players in one team in the given broadcast footage of Premier League matches under a single broadcast view. Under a offensive or defensive broadcast footage, TacSIm projects the beginning positions and actions of all 22 players from both sides onto a standard pitch coordinate system. TacSIm offers an explicit style imitation task and evaluation protocols. Tactics style imitation is measured by using spatial occupancy similarity and movement vector similarity in defined time, supporting the evaluation of spatial and temporal similarities for on
467adjacent_context
medium
Tokenizing Vector Animation for Autoregresive Generation
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; generation_editing; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractDespite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides , 3D meshes , LEGO sequences , and indoor layouts , suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometr
468adjacent_context
medium
Towards Visual Query Localization in the 3D World
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractVisual query localization (VQL) aims to predict a spatial-temporal response of the most recent occurrence from a sequence given a query. Currently, most research focuses on visual query localization from 2D videos, while its counterpart in 3D space has received little attention. In this paper, we make the first attempt to visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. Specifically, 3DVQL contains 2,002 sequences with around 170,000 frames and 6.4K response track segments from 38 object categories. Each sequence in 3DVQL is provided with multiple modalities including point clouds (PC), RGB and depth images to support flexible research. To ensure high-quality annotation, each sequence is manually annotated with multiple rounds of verification and refinement. To our best knowledge, 3DVQL is the first benchmark towards 3D multimodal visual query loc
469adjacent_context
medium
VABench: A Comprehensive Benchmark for Audio-Video Generation
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractRecent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization,
470adjacent_context
medium
VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancy; generation_editingeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractThough rectified flow models have achieved remarkable performance in image, video, and 3D generation, their practical deployments are challenged by slow inference speeds. Previous acceleration methods rely on caching and reusing, neglecting the growing mismatch between static cached values and evolving input, leading to reduced generated content fidelity.This work proposes Velocity Decomposition and Estimation (VDE), a training-free acceleration method that shifts the paradigm from caching-and-reusing to decomposing-and-estimating.VDE periodically anchors the model’s state with a full forward pass and estimates subsequent outputs analytically. VDE first decomposes the model’s velocity output into components parallel and orthogonal to the input, then exploiting the temporal predictability of the components' coefficients and the consistency of the orthogonal direction for precise, input-ad
471adjacent_context
medium
Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancyeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractRecent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet truly dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally c
472adjacent_context
medium
MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: dynamic/4D recongeneral_reconstruction; dynamic_4d; surface_occupancyeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstract3D morphing remains challenging due to the difficulty of generating semantically consistent and temporally smooth deformations, especially across categories. We present MorphAny3D, a training-free framework that leverages Structured Latent (SLAT) representations for high-quality 3D morphing. Our key insight is that intelligently blending source and target SLAT features within the attention mechanisms of 3D generators naturally produces plausible morphing sequences. To this end, we introduce Morphing Cross-Attention (MCA), which fuses source and target information for structural coherence, and Temporal-Fused Self-Attention (TFSA), which enhances temporal consistency by incorporating features from preceding frames. An orientation correction strategy further mitigates the pose ambiguity within the morphing steps. Extensive experiments show that our method generates state-of-the-art morphing
473adjacent_context
medium
Dynamic-Static Decomposition for Novel View Synthesis of Dynamic Scenes with Spiking Neurons
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancyeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractNovel view synthesis for dynamic scenes remains challenging due to complex motion variations.Recent methods represent dynamic and static regions with separate Gaussians to improve efficiency and accuracy, but inaccurate assignment of static and dynamic Gaussian primitive still limits performance.We identify two key issues, namely inaccurate mask priors and improper tag representations, which lead to boundary artifacts, loss of fine-grained motion details, and overfitting on input views, resulting in degraded side-view synthesis.To address these problems, we propose a spatio-temporally fine-grained mask field and a discontinuous dynamic–static tagging field to achieve accurate assignment of dynamic and static Gaussian primitives, enabling high-quality novel view synthesis, especially in fine-grained motions, motion boundary regions, and side viewpoints.Experiments show that our method ac
474adjacent_context
medium
Feed-forward Gaussian Registration for Head Avatar Creation and Editing
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; generation_editingeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractWe present MATCH (Multi-view Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. State-of-the-art multi-view head avatars require time-consuming head tracking, which is followed by an expensive avatar optimization, often resulting in a total creation time that exceeds one day. MATCH instead directly predicts Gaussian splat textures in correspondence from calibrated multi-view images in 0.5 seconds per frame. While the learned intra-subject correspondence across frames allows us to quickly build personalized head avatars, correspondence across subjects enables various applications such as expression transfer, optimization-free tracking, semantic editing, and identity interpolation. We learn to establish such correspondences end-to-end, with a transformer-based model that predicts textures of Gaussian
475adjacent_context
medium
FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancyeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractWe present a novel formulation for mesh-free, reduced-order simulation of deformable hyperelastic objects. Existing work in reduced-order elastodynamic simulation represents the input geometry by either meshes, which can be difficult to obtain due to challenges in scanning and triangulating complex shapes, or by neural fields that require per-shape optimization. We propose to adopt a Reproducing Kernel Particle Method (RKPM) representation, which enables the construction of reduced-order skinning weights by solving a generalized eigensystem on the Hessian matrix of the elastic energy. We demonstrate that this formulation not only leads to a 40$\times$ training speedup compared with the per-shape optimization of neural fields, but also achieves lower simulation error when evaluated against the converged results of finite element method. We show our simulation results on a wide variety of
476adjacent_context
medium
LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; robotics_mapping; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractOpen-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain impractically slow, memory-intensive, and overly complex due to iterative optimization and dense feature assignments for every Gaussian. To address these limitations, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantics only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. To further streamline inference and ensure semantic consistency, we cluster Gaussians in a single step by linking geometrically and semantically related masks in 3D. In evaluation, we assess our method on diverse benchm
477adjacent_context
medium
MVInverse: Feed-forward Multi-view Inverse Rendering in Seconds
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractMulti-view inverse rendering aims to recover geometry, materials, and illumination consistently across multiple viewpoints. Existing single-view approaches often ignore cross-view relationships, leading to inconsistent results, while multi-view optimization methods rely on slow differentiable rendering and per-scene refinement, making them computationally expensive and hard to scale. To address these limitations, we introduce a feed-forward multi-view inverse rendering framework that directly predicts spatially varying albedo, metallicity, roughness, diffuse shading, and surface normals from sequences of RGB images. By alternating attention across views, our model captures both intra-view long-range lighting interactions and inter-view material consistency, enabling coherent scene-level reasoning within a single forward pass. Due to the scarcity of real-world training data, models traine
478adjacent_context
medium
OrienPose: Orientation-Guided Novel View Synthesis for Single-Image Unseen Object Pose Estimation
3D Vision & Geometry / Pose Estimation
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancyeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractEstimating the 3D pose of unseen objects from a single image remains a fundamental yet challenging problem in computer vision, especially under a CAD model-free setting.Pioneering attempts address this issue by matching templates generated through Novel View Synthesis (NVS), which essentially aims to learn the geometric transformation from a reference to a target view. While promising, these methods can only approximate this transformation under pixel-level supervision, as the starting orientation remains undefined. In the absence of explicit geometric constraints to verify the correctness of the predicted transformation, existing methods often synthesize novel views with geometry-distorted structures or severely blurred local textures, leading to unreliable template matching and suboptimal pose estimation results. To this end, we propose OrienPose, a novel object pose estimation framew
479adjacent_context
medium
Photo3D: Advancing Photorealistic 3D Generation through Structure‑Aligned Detail Enhancement
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; generation_editing; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractAlthough recent 3D‑native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich surface details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non‑rigid motions of objects, and the limited precision of scanners.We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT‑4o‑Image model.Considering that the generated images can distort 3D structures due to their lack of multi‑view consistency, we design a structure‑aligned multi‑view synthesis pipeline and construct a detail‑enhanced multi‑view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages percep
480adjacent_context
medium
PhysHead: Simulation-Ready Gaussian Head Avatars
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancyeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractRealistic digital avatars require expressive and dynamic hair motion, yet most existing head avatar methods assume rigid hair movement.These methods often fail to disentangle hair from the head, representing it as a simple outer shell and failing to capture its natural volumetric behavior. In this paper, we address these limitations by introducing PhysHead, a hybrid representation for animatable head avatars with realistic hair dynamics learned from multi-view video. Our approach combines a 3D parametric mesh for the head with strand-based hair, which can be directly simulated using physics engines. For the appearance model, we employ Gaussian primitives attached to both the head mesh and hair segments. This representation enables the creation of photorealistic head avatars with dynamic hair behavior, such as wind-blown motion, overcoming the constraints of rigid hair in existing methods
481adjacent_context
medium
PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; generation_editingeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractExisting approaches to 3D semantic urban scene generation predominantly rely on voxel-based representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality wit
482adjacent_context
medium
Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractWe present _Relightable Holoported Characters_ (RHC), a novel person-specific method for free-view rendering and relighting of full-body and highly dynamic humans solely observed from sparse-view RGB videos at inference. In contrast to classical one-light-at-a-time (OLAT)-based human relighting, our transformer-based RelightNet predicts relit appearance within a single network pass, avoiding costly OLAT-basis capture and generation. For training such a model, we introduce a new capture strategy and dataset recorded in a multi-view lightstage, where we alternate frames lit by random environment maps with uniformly lit tracking frames, simultaneously enabling accurate motion tracking and diverse illumination as well as dynamics coverage. Inspired by the rendering equation, we derive physics-informed features that encode geometry, albedo, shading, and the virtual camera view from a coarse h
483adjacent_context
medium
REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; generation_editing; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractRecent generative models have shown strong performance in generating diverse 3D assets from 2D images, a fundamental research topic in computer vision and graphics. However, these models still struggle to generate voluminous 3D assets when the input is a flat image that provides limited 3D cues. We introduce REVIVE 3D, a two-stage, plug-and-play pipeline for generating voluminous 3D assets from flat images. In Stage 1, we construct an Inflated Prior by inflating the foreground silhouette to recover global volume and superimposing part-aware details to capture local structure. In Stage 2, 3D Latent Refinement injects Gaussian noise into the Inflated Prior's latent and then denoises it, guided by the prior's geometric cues and the backbone's pretrained 3D knowledge. By initializing the process with the encoded latent of a source mesh instead of the prior, the framework also supports 3D edi
484adjacent_context
medium
Scaling View Synthesis Transformers
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancy; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractRecently, geometry-free view synthesis transformers have achieved state-of-the-art results in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. However, the specific factors that govern how their performance scales with compute remain poorly understood. In this work, we conduct a rigorous analysis of the scaling laws for view synthesis transformers and elucidate a series of design choices for training compute-optimal NVS models. Most significantly, we find that an encoder–decoder architecture, which was previously found to be less scalable, can in fact be compute-optimal. We attribute the previously inferior performance of previous encoder–decoder methods to certain architectural choices and inconsistent training compute across comparisons. Across several compute levels, we demonstrate that our encoder–decoder architecture, which we
485adjacent_context
medium
Scaling4D: Pushing the Frontier of Video Novel View Synthesis through Large-Scale Monocular Videos
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractVideo Novel View Synthesis (VNVS) aims to render arbitrary novel viewpoints of dynamic scenes from a single-view video, but its algorithmic training faces a major challenge: the lack of large-scale multi-view video datasets. Prior methods often train on monocular data by framing it as an inpainting task, which typically leads to a train-inference gap and visual artifacts. While synthetic multi-view data can partially alleviate the data scarcity issue, its high acquisition costs and limited diversity restrict scalability. To address these problems, we propose Scaling4D, a novel strategy that theoretically avoids the train-inference gap while leveraging large-scale monocular videos for training. Specifically, we take a higher-level perspective on the problem, reformulating VNVS into a general correspondence-guided generation task. Furthermore, in conjunction with extensive real-world data,
486adjacent_context
medium
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editingeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractWe present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised globa
487adjacent_context
medium
ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and Test-time Generative Adaptation
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancyeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractWe introduce ELITE, an Efficient Gaussian head avatar synthesis from a monocular video via Learned Initialization and TEst-time generative adaptation. Prior works rely either on a 3D data prior or a 2D generative prior to compensate for missing visual cues in monocular videos. However, 3D data prior methods often struggle to generalize in-the-wild, while 2D generative prior methods are computationally heavy and prone to identity hallucination. We identify a complementary synergy between these two priors and design an efficient system that achieves high-fidelity animatable avatar synthesis with strong in-the-wild generalization. Specifically, we introduce a feed-forward Mesh2Gaussian Prior Model (MGPM) that enables fast initialization of a Gaussian avatar. To further bridge the domain gap at test time, we design a test-time generative adaptation stage, leveraging both real and synthetic
488adjacent_context
medium
DiffBMP: Differentiable Rendering with Bitmap Primitives
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancyeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractWe introduce **DiffBMP**, a scalable and efficient differentiable rendering engine for a collection of bitmap images. Our work addresses a limitation that traditional differentiable renderers are constrained to vector graphics, given that most images in the world are bitmaps. Our core contribution is a highly parallelized rendering pipeline, featuring a custom CUDA implementation for calculating gradients. This system can, for example, optimize the position, rotation, scale, color, and opacity of thousands of bitmap primitives all in under 1 min using a consumer GPU. We employ and validate several techniques to facilitate the optimization: soft rasterization via Gaussian blur, structure-aware initialization, noisy canvas, and specialized losses/heuristics for videos or spatially constrained images. We demonstrate DiffBMP is not just an isolated tool, but a practical one designed to integ
489adjacent_context
medium
WonderZoom: Multi-Scale 3D World Generation
3D Vision & Geometry / 3D Reconstruction
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; surface_occupancyeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractWe present WonderZoom, a novel approach to generating 3D scenes with contents across multiple spatial scales from a single image. Existing 3D world generation models remain limited to single-scale synthesis and cannot produce coherent scene contents at varying granularities. The fundamental challenge is the lack of a scale-aware 3D representation capable of generating and rendering content with largely different spatial sizes. WonderZoom addresses this through two key innovations: (1) scale-adaptive Gaussian surfels for generating and real-time rendering of multi-scale 3D scenes, and (2) a progressive detail synthesizer that iteratively generates finer-scale 3D contents. Our approach enables users to ``zoom into'' a 3D region and auto-regressively synthesize previously non-existent fine details from landscapes to microscopic features. Experiments demonstrate that WonderZoom significantly
490adjacent_context
medium
Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation
3D Vision & Geometry / 3D Gaussian Splatting
A. thesis anchor: representation shiftgeneral_reconstruction; gaussian_radiance; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractGenerating large-scale 3D head avatars of non-existent identities with high-fidelity and strong multi-view consistency (MVC) is essential for applications such as synthetic crowds, digital twins, and large asset libraries. For high scalability, avatars must be generated from minimal resources, without costly MV studio captures or any 3D data. In this work, we target this challenging minimal-resource setting for 3D head generation. Second, we argue that the common strategy of enforcing MVC via intermediate MV image generation is both expensive and fundamentally fragile. Instead, we analyze how MVC can be induced by design, showing that intermediate view synthesis is unnecessary. To this end, we introduce MVCHead — a fast, single-shot state space model that directly predicts Gaussians, without intermediate generation. At its core, we propose a Hierarchical State Space (HiSS) block that enf
491adjacent_context
medium
Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual–Inertial Odometry
3D Vision & Geometry / Pose Estimation
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; pose_calibration_localization; robotics_mapping; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractVisual-Inertial Odometry (VIO) is a critical component for robust ego-motion estimation, enabling foundational capabilities such as autonomous navigation in robotics and real-time 6-DoF tracking for augmented reality.Existing methods face a well-known trade-off: filter-based approaches are efficient but prone to drift, while optimization-based methods, though accurate, rely on computationally prohibitive Visual-Inertial Bundle Adjustment (VIBA) that is difficult to run on resource-constrained platforms.Rather than removing VIBA altogether, we aim to reduce how often and how heavily it must be invoked. To this end, we cast two key design choices in modern VIO, when to run the visual frontend and how strongly to trust its output, as sequential decision problems, and solve them with lightweight reinforcement learning (RL) agents. Our framework introduces a lightweight, dual-pronged RL polic
492adjacent_context
medium
Haptic Neural Fields: Bringing Tactile Interactions to 3D Rendered Scenes
3D Vision & Geometry / 3D Reconstruction
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; depth_correspondence; surface_occupancy; robotics_mappingeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractWe address the problem of making 3D scenes interactive by asking: what would objects feel like if touched in a virtual environment? State-of-the-art 3D rendering methods provide compelling visual realism, but they fall short in modeling physical interactions, such as haptic feedback. We propose a framework that learns the correspondence between user actions and tactile responses, enabling the generation of touch-based signals directly from simulated interactions in 3D scenes. Our approach leverages a neural field representation conditioned on geometry and action to synthesize material-specific tactile signals. Experiments show that the generated signals reliably convey material properties and interaction dynamics. This paves the way toward interactive, touch-aware virtual environments with realistic haptic feedback.
493adjacent_context
medium
MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts
3D Vision & Geometry / 3D Reconstruction
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; surface_occupancy; robotics_mapping; generation_editing; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractGenerating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language–driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent that supports five object-level operations. To support photorea
494adjacent_context
medium
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
3D Vision & Geometry / 3D Reconstruction
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; surface_occupancy; robotics_mapping; generation_editing; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstract3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce \textbf{PhysX-Anything}, the first \textbf{simulation-ready} physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by \textbf{193$\times$}, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning a
495adjacent_context
medium
SAGE: Scalable Agentic 3D Scene Generation for Embodied AI
3D Vision & Geometry / 3D Reconstruction
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; surface_occupancy; robotics_mapping; generation_editing; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractReal-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., “pick up a bowl and place it on the table”), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simula
496adjacent_context
medium
SE(3)-Equivariance with Geometric and Topological Guidance for Category-Level Object Pose Estimation
3D Vision & Geometry / Pose Estimation
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractObject pose estimation is a key task for embodied robots, enabling them to interact with objects effectively. Category-level object pose estimation provides a way for robots to estimate the pose of unknown objects. However, estimating object pose from point clouds alone remains challenging. In this paper, we introduce SEGPose, a novel category-level object pose estimation method based on point clouds. Unlike previous methods, SEGPose leverages geometric, topological information, and SE(3)-equivariance, enhancing the network's accuracy in pose prediction. To utilize geometric and topological features, we propose a constraint-based feature extraction and 3D reconstruction method, enabling effective object shape reconstruction. We also design an SE(3)-equivariance feature prediction network to handle pose transformations consistently across viewpoints, improving pose accuracy. Experimental
497adjacent_context
medium
Volumetric Functional Maps
3D Vision & Geometry / 3D Reconstruction
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; depth_correspondence; surface_occupancy; robotics_mappingeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractThe computation of volumetric correspondences between 3D shapes is a prominent tool for medical and industrial applications. In this work, we pave the way for spectral volume mapping, extending for the first time the functional maps framework from the surface to the volumetric setting. We show that the eigenfunctions of the volumetric Laplace operator define a functional space that is suitable for high-quality signal transfer. We also experiment with various techniques that edit this functional space, porting them to volume domains. We validate our method on novel volumetric datasets and on tetrahedralizations of well-established surface datasets, also showcasing practical applications involving both discrete and continuous signal mapping, for segmentation transfer, mesh connectivity transfer, and solid texturing. Last but not least, we show that considering the volumetric spectrum great
498adjacent_context
medium
Deep Feature Deformation Weights
3D Vision & Geometry / 3D Reconstruction
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; surface_occupancy; robotics_mapping; generation_editingeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractHandle-based mesh deformation has been a long-standing paradigm in computer graphics, enabling intuitive shape edits from sparse controls. Classic techniques offer precise and rapid deformation control. However, they solve an optimization problem with constraints defined by the choice of control handles, requiring a user to know apriori the ideal distribution of handles on the shape to accomplish the desired edit. The mapping from handle set to deformation behavior is often unintuitive and, importantly, non-semantic. Modern data-driven methods, on the other hand, leverage the data prior to obtain semantic edits, at the cost of fine-grained control and speed. We propose a technique that achieves the best of both worlds by leveraging the semantic prior of data and the precise control and speed of traditional frameworks. Our approach is surprisingly simple yet effective: deep feature proxim
499adjacent_context
medium
HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models
3D Vision & Geometry / 3D Reconstruction
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; surface_occupancy; robotics_mapping; generation_editingeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractThe 3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires extensive and tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for automatic 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with ex
500adjacent_context
medium
SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation
3D Vision & Geometry / 3D Reconstruction
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; surface_occupancy; robotics_mapping; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractRealistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision-language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer t
501adjacent_context
medium
UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes
3D Vision & Geometry / 3D Reconstruction
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; surface_occupancy; robotics_mapping; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractWe present UniTEX, a novel two-stage 3D texture generation framework to create high-quality, consistent textures for 3D assets. Existing approaches predominantly rely on UV-based models in the second stage to refine textures after reprojecting the generated multi-view images onto the 3D shapes, which introduces challenges related to topological ambiguity. To address this, we bypass the limitations of UV mapping by introducing a Large Texturing Model (LTM) that directly regresses textures in a unified 3D functional space. Moreover, to enable more effective and complete supervision of LTM, we propose to extend surface-defined textures into a continuous volumetric field to serve as an advanced training objective, which we refer to as Texture Functions (TF). Finally, we develop an advanced LoRA-based strategy for efficiently adapting large-scale 2D Diffusion Transformers (DiTs) for high-qual
502adjacent_context
medium
Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration via Heterogeneous Graphs
3D Vision & Geometry / Pose Estimation
B. bridge: reconstruction becomes mapping/world modelpose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mappingeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractImage-to-point-cloud (I2P) registration aims to align 2D images with 3D point clouds by establishing reliable 2D-3D correspondences. The drastic modality gap between images and point clouds makes it challenging to learn features that are both discriminative and generalizable, leading to severe performance drops in unseen scenarios. We address this challenge by introducing a heterogeneous graph framework that jointly refines cross-modal features and correspondences within a unified architecture. The proposed graph represents a mapping between segmented 2D and 3D regions, which enhances cross-modal feature interaction and thus improves feature discriminability. In addition, modeling the consistency among vertices and edges within the graph enables pruning of unreliable correspondences. Building on these insights, we propose a heterogeneous graph embedded I2P registration method, termed Hg-
503adjacent_context
medium
Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events
3D Vision & Geometry / 3D Gaussian Splatting
B. bridge: reconstruction becomes mapping/world modelgaussian_radiance; dynamic_4d; robotics_mappingeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractNovel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We utilize NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A pixel-wise RGB mapping field is introduced to align the above NeRF-rendered HDR pixel values with the sensor-recorded LD
504adjacent_context
medium
Lifting Unlabeled Internet-scale Data for 3D Scene Understanding
3D Vision & Geometry / 3D Reconstruction
B. bridge: reconstruction becomes mapping/world modelgeneral_reconstruction; surface_occupancy; robotics_mappingeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractAnnotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We systematically identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-level reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonst
505adjacent_context
medium
CRFT: Consistent–Recurrent Feature Flow Transformer for Cross-Modal Image Registration
3D Vision & Geometry / Pose Estimation
B. bridge: reconstruction becomes mapping/world modelpose_calibration_localization; depth_correspondence; robotics_mappingeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractWe present Consistent–Recurrent Feature Flow Transformer (CRFT), a unified coarse-to-fine framework that learns feature flow for robust cross-modal registration. CRFT learns a modality-consistent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning. To enhance geometric adaptability, an iterative discrepancy-guided attention mechanism with a Spatial Geometric Transform (SGT) recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature-level consistency. This design enables accurate alignment under large affine and scale variations while maintaining structural coheren
506adjacent_context
medium
RoboTAG: End-to-end Robot Pose Estimation via Topological Alignment Graph
3D Vision & Geometry / Pose Estimation
B. bridge: reconstruction becomes mapping/world modelpose_calibration_localization; robotics_mapping; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractEstimating robot pose from a monocular RGB image is a challenge in robotics and computer vision. Existing methods typically build networks on top of 2D visual backbones and depend heavily on labeled data for training, which is often scarce in real-world scenarios, causing a sim-to-real gap.Moreover, these approaches reduce the 3D-based problem to 2D domain, neglecting the 3D priors. To address these, we propose Robot Topological Alignment Graph (RoboTAG), which incorporates a 3D branch to inject 3D priors while enabling co-evolution of the 2D and 3D representations, alleviating the reliance on labels. Specifically, the RoboTAG consists of a 3D branch and a 2D branch, where nodes represent the states of the camera and robot system, and edges capture the dependencies between these variables or denote alignments between them. Closed loops are then defined in the graph, on which a consistenc
507adjacent_context
medium
DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces
3D Vision & Geometry / Pose Estimation
B. bridge: reconstruction becomes mapping/world modelpose_calibration_localization; robotics_mapping; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractArticulated object pose estimation is a core task in embodied AI and computer vision. Existing methods typically regress poses in a continuous space, but often struggle with 1) navigating a large, complex search space and 2) failing to incorporate intrinsic kinematic constraints. In this paper, we introduce DICArt (DIsCrete Diffusion for Articulated Object Pose Estimation), a novel framework that formulates pose estimation as a conditional discrete diffusion process. Instead of operating in a continuous domain, DICArt progressively denoises a noisy pose representation through a learned reverse diffusion procedure to recover the ground-truth pose.To improve modeling fidelity, we propose a flexible flow decider that dynamically determines whether each token should be denoised or reset, effectively balancing the real and noise distributions during diffusion. Additionally, we incorporate a h
508adjacent_context
medium
PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models
3D Vision & Geometry / Point Cloud
B. bridge: reconstruction becomes mapping/world modelsurface_occupancy; robotics_mappingeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractThe development of 3D Vision-Language Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations.To address these limitations, we propose {\mname}, a novel feature-level alignment regularization method. {\mname} explicitly supervises intermediate point cloud representations to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By trainin
509adjacent_context
medium
DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis
3D Vision & Geometry / Pose Estimation
B. bridge: representation meets metric posegaussian_radiance; pose_calibration_localization; dynamic_4d; data_benchmarkeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractImage alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. In this paper, we present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges in image alignment from a new perspective, employing a generation-based solution that showcases strong capabilities and avoids the problems associated with flow-based image warping. Specifically, we propose a Dynamics-aware Diffusion Training approach for learning conditional image generation, synthesizing a novel view for image alignment. This incorporates a Dynamics-aw
510adjacent_context
medium
FMPose: 3D Pose Estimation via Flow Matching
3D Vision & Geometry / Pose Estimation
B. bridge: representation meets metric posegaussian_radiance; pose_calibration_localization; depth_correspondenceeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractMonocular 3D pose estimation is fundamentally ill-posed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses.In particular, diffusion-based models have demonstrated strong performance, but their iterative denoising process typically requires many time steps for each prediction, making inference computationally expensive.In contrast, Flow Matching (FM) learns an ODE-based velocity field, enabling efficient generation of 3D pose samples with only a few integration steps. Inspired by this capability, we propose a novel generative pose estimation framework, FMPose, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned on 2D inputs. While the ODE trajectories are dete
511adjacent_context
medium
Landscape-Awareness for Geometric View Diffusion Model
3D Vision & Geometry / Pose Estimation
B. bridge: representation meets metric posegaussian_radiance; pose_calibration_localizationeditorial thesis/bridge bucket but weaker direct reconstruction signal
abstractAccuracy camera viewpoint estimation under sparse-view conditions remains challenging, particularly in two-view scenarios. Recent approaches leverage diffusion models such as Zero123, which synthesize novel views conditioned on relative viewpoint, and have demonstrated promising performance when repurposed for viewpoint estimation via optimization with MSE loss. However, existing methods often suffer from non-convex loss landscape with numerous local minima, which makes them sensitive to initialization and reliant on na\"ive multi-start strategies to achieve reasonable results. We analyze these optimization challenges and visualize failure cases, showing that ambiguities in object geometry, such as symmetry and self-similarity, can mislead gradient-based updates toward incorrect viewpoints. To address these limitations, we propose a score-based method that reshapes the optimization lands
512adjacent_context
low
FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; surface_occupancy; data_benchmark3D Vision & Geometry with weak but relevant signal
abstractRecent work in 3D scene understanding has begun to shift from purely spatial analysis to the more complex challenge of functional scene understanding.However, existing methods often consider functional relationships between object pairs in isolation, failing to capture the scene-wide interdependencies that humans use to resolve ambiguity. We introduce FunFact, a framework for constructing probabilistic open-vocabulary functional 3D scene graphs from posed RGB-D images. FunFact first builds an object- and part-centric 3D map and uses foundation models to propose semantically plausible functional relations. These candidates are converted into factor graph variables and constrained by both LLM-derived common-sense priors and geometric priors. This formulation enables joint probabilistic inference over all functional edges and their uncertainties, yielding substantially better-calibrated con
513adjacent_context
low
FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy3D Vision & Geometry with weak but relevant signal
abstractRegistration of multiview point clouds typically depends on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and ill-posed without holistic geometric constraints. In this paper, we propose FUSER, the first feed-forward multi-view registration transformer that processes all scans jointly in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER employs a sparse 3D CNN to encode each scan into low-resolution superpoint features preserving absolute translation cues, followed by a Geometric Alternating Attention module for efficient intra- and inter-scan reasoning. Particularly, we transfer 2D attention priors from off-the-shelf foundation models (i.e., $\pi^3$) to enhance 3D feature attention. Building upon FUSER and its estimates, we further introduce
514adjacent_context
low
Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; surface_occupancy3D Vision & Geometry with weak but relevant signal
abstractRecent advances in self-supervised learning (SSL) for point clouds have substantially improved 3D scene understanding without human annotations. Existing approaches emphasize semantic awareness by enforcing feature consistency across augmented views or by masked scene modeling. However, the resulting representations transfer poorly to localization, and often require full finetuning for strong performance. Accurate localization is a fundamental component of 3D perception, thus bridging this gap is crucial for progressing toward true 3D foundation models that supports all downstream tasks on 3D data. In this work, we introduce PointINS, a localization-oriented self-supervised framework that enriches point cloud representations through geometry-aware learning. PointINS employs an orthogonal localization branch to jointly learn high-level semantic understanding and geometric reasoning, yield
515adjacent_context
low
MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence3D Vision & Geometry with weak but relevant signal
abstractEstablishing consistent correspondences across images is essential for 3D vision tasks such as structure-from-motion (SfM), yet most existing matchers operate in a pairwise manner, often producing fragmented and geometrically inconsistent tracks when their predictions are chained across views. We propose \textbf{MV-RoMa}, a multi-view dense matching model that jointly estimates dense correspondences from a source image to multiple co-visible targets. Specifically, we design an efficient model architecture which avoids high computational cost of full cross-attention for multi-view feature interaction: (i) multi-view encoder that leverages pair-wise matching results as a geometric prior, and (ii) multi-view matching refiner that refines correspondences using pixel-wise attention. Additionally, we propose a post-processing strategy that integrates our model's consistent multi-view correspon
516adjacent_context
low
GazeShift: Unsupervised Gaze Estimation and Dataset for VR
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization; data_benchmark3D Vision & Geometry with weak but relevant signal
abstractGaze estimation is instrumental in modern virtual reality (VR) systems. Despite significant progress in remote-camera gaze estimation, VR gaze research remains constrained by data scarcity—particularly the lack of large-scale, accurately labeled datasets captured with the off-axis camera configurations typical of modern headsets. Gaze annotation is difficult since fixation on intended targets cannot be guaranteed. To address these challenges, we introduce VRGaze—the first large-scale off-axis gaze estimation dataset for VR—comprising 2.1 million near-eye infrared images collected from 68 participants. We further propose GazeShift, an attention-guided unsupervised framework for learning gaze representations without labeled data. Unlike prior redirection-based methods that rely on multi-view or 3D geometry, GazeShift is tailored to near-eye infrared imagery, achieving effective gaze–appear
517adjacent_context
low
KV-Tracker: Real-Time Pose Tracking with Transformers
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence3D Vision & Geometry with weak but relevant signal
abstractMulti-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos. Our method rapidly selects and manages a set of images as keyframes to map a scene or object via $\pi^3$~\cite{wang2025pi3} with full bidirectional attention. We then cache the global self-attention block's key-value (KV) pairs and use them as the sole scene representation for online tracking. This allows for up to $15\times$ speedup during inference without the fear of drift or catastrophic forgetting. Our caching strategy is model-agnostic and can be applied to other off-the-shelf multi-view networks without retraining.We demonstrate KV-Tracker on both scene-level tracking and the more challenging task of on-the-f
518adjacent_context
low
MoBind: Motion Binding for Fine-Grained IMU–Video Pose Alignment
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; dynamic_4d3D Vision & Geometry with weak but relevant signal
abstractWe aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchica
519adjacent_context
low
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; dynamic_4d3D Vision & Geometry with weak but relevant signal
abstractEnhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing long-form video analysis, enabling tasks such as temporal localization, action detection, and time-sensitive question answering. While reinforcement learning (RL) has recently been explored for improving temporal reasoning, existing approaches are often confined to limited task types and data, restricting their generalization across diverse temporal understanding scenarios. To address this challenge, we present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs’ temporal comprehension. We curate a multi-task corpus that exposes the model to diverse temporal structures and semantics, and build upon the Group Relative Policy Optimization (GRPO) algorithm to achieve stable and effective cross-task optimization. Specifically, we c
520adjacent_context
low
Zoo3D: Zero-Shot 3D Object Detection at Scene Level
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmark3D Vision & Geometry with weak but relevant signal
abstract3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing $Zoo3D$, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. $Zoo3D$ operates in two modes: the zero-shot $Zoo3D_{0}$, which requires no training at all, and the self-supervised $Zoo3D_{1}$, which refines 3D box prediction by training a class-agnostic detector on $Zoo3D_{0}$-generated pseu
521adjacent_context
low
TESO: Online Tracking of Essential Matrix by Stochastic Optimization
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; data_benchmark3D Vision & Geometry with weak but relevant signal
abstractReliable perception of autonomous systems relies on fusion of data from multiple sensors, which requires maintaining accurate geometric calibration during operation. This work aims to track the drift of the calibration parameters caused by mechanical stress, thermal effects, or minor accidents. We focus on five parameters of the essential matrix and propose TESO, whose core mechanisms are: 1) a robust loss function based on kernel correlation over tentative correspondences instead of robust matching and estimators, 2) an adaptive online stochastic optimization on the essential manifold. Both contribute to reduced CPU and memory requirements. TESO relies on a few hyperparameters and eliminates the need for data-driven training, enabling use in resource-constrained online perception systems. We evaluated TESO based on the geometric precision of the tracked extrinsic parameters, the rectifi
522adjacent_context
low
TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; surface_occupancy; data_benchmark3D Vision & Geometry with weak but relevant signal
abstractTables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows.End-to-end approaches rely heavily on large-scale TR data and struggle in data-constrained scenarios.To address these issues, we propose TDATR (Table Detail-Aware Table Recognition) improves end-to-end TR through table detail-aware learning and cell-level visual alignment.TDATR adopts a “perceive-then-fuse” strategy. The model first performs table detail-aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks designed under a language modeling paradigm. These tasks can naturally leverage document data from diverse scenarios to enhance model robustness.The model then int
523adjacent_context
low
ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; surface_occupancy3D Vision & Geometry with weak but relevant signal
abstractCategory-level object pose estimation aims to predict the pose and size of arbitrary objects in specific categories. Existing methods struggle with the inherent incompleteness of observed point clouds, which limits their ability to capture complete object shapes for robust pose reasoning. While point cloud completion offers a promising solution, naively treating it as a separate preprocessing step for partial observations introduces compounding errors and additional computational overhead, ultimately hindering both accuracy and efficiency.To address these challenges, we propose ComPose, a novel unified framework that tightly integrates shape completion to provide complete geometric cues for enhanced pose estimation. At the core of ComPose is a keypoint-based progressive completion module, which recovers full shape representations by progressively predicting a sparse set of keypoints and
524adjacent_context
low
UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; data_benchmark3D Vision & Geometry with weak but relevant signal
abstractEven though industrial inspection systems should be capable of recognizing unprecedented defects, most existing approaches operate under a closed-set assumption, which prevents them from detecting novel anomalies. While the visual prompting approach provides a scalable alternative, it struggles in industrial settings where subtle inter-class differences and high intra-class variance make prompt-to-region matching ambiguous and cause prompt embeddings to collapse, limiting the effectiveness of existing methods. To address these challenges, we introduce UniSpector— a Universal Inspector for open-set defect detection and segmentation. To empower defect prompt embeddings for robust recognition of novel defects, it comprises two key components: the Spatial–Spectral Prompt Encoder (SSPE) and the Contrastive Prompt Encoder (CPE). SSPE extracts orientation-invariant frequency cues and fuses them
525adjacent_context
low
Enhancing Accuracy of Uncertainty Estimation in Appearance-based Gaze Tracking with Probabilistic Evaluation and Calibration
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; data_benchmark3D Vision & Geometry with weak but relevant signal
abstractAccurate uncertainty estimation is essential for reliable appearance-based gaze tracking. However, domain shifts between training and testing often lead to incorrect uncertainty estimates, which is a problem overlooked in existing uncertainty-aware gaze tracking models. To overcome this problem efficiently, we formulate uncertainty estimation as a conditional distribution problem and treat the correction process as an output-level conditional distribution matching task. We therefore introduce a data-efficient post-hoc calibration method to align the predicted, high-error conditional distribution with the empirically observed distribution extracted from a small set of calibration samples. To more faithfully assess the accuracy of the resulting uncertainty estimates, we further introduce a new metric, Coverage Probability Error (CPE), to quantify the distribution-level mismatch between pre
526adjacent_context
low
Global-Aware Edge Prioritization for Pose Graph Initialization
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization3D Vision & Geometry with weak but relevant signal
abstractThe pose graph is a core component of Structure-from-Motion (SfM), where images act as nodes and edges encode relative poses. Since geometric verification is expensive, SfM pipelines restrict the pose graph to a sparse set of candidate edges, making initialization critical. Existing methods rely on image retrieval to connect each image to its $k$ nearest neighbors, treating pairs independently and ignoring global consistency. We address this limitation through the concept of edge prioritization, ranking candidate edges by their utility for SfM. Our approach has three components: (1) a GNN trained with SfM-derived supervision to predict globally consistent edge reliability; (2) multi-minimal-spanning-tree-based pose graph construction guided by these ranks; and (3) connectivity-aware score modulation that reinforces weak regions and reduces graph diameter. This globally informed initializ
527adjacent_context
low
Minimal Constraint Relaxation for Multiview Autocalibration
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization3D Vision & Geometry with weak but relevant signal
abstractPolynomial systems in multiview geometry are often highly over-constrained, and naïve subsampling or elimination can lead to unstable or inconsistent estimation. We revisit this issue through the lens of \emph{constraint relaxation}—the selective removal of equations to recover a finite and well-conditioned solution space. Focusing on the Kruppa equations for camera autocalibration, we introduce the notion of \emph{minimal relaxation}, a principled framework for identifying constraint subsets that preserve geometric validity while restoring solvability. Through symbolic analysis of the full three-view Kruppa system, we enumerate and classify all relaxation patterns, revealing algebraically minimal families that yield finite, well-conditioned problems.Comprehensive experiments validate this analysis across symbolic and numerical settings.Using homotopy continuation and synthetic perturbat
528adjacent_context
low
Parallel Rigidity Matters for Bundle Adjustment
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization3D Vision & Geometry with weak but relevant signal
abstractBundle adjustment is a long-standing problem in computer vision that solves for camera parameters and 3D point coordinates from 2D image observations. While there has been much work on various aspects, like adaptation to different camera models and sensors, and considerations for solving the optimization problem, in this paper, we deal with a fundamental and distinct aspect of the uniqueness of its solution. In particular, we examine the unique solvability of the 3D reconstruction problem using parallel rigidity theory. We design an algorithm to ensure that the topology of the bipartite graph formed by the camera-3D point relations in bundle adjustment does not result in independent scaling of the edges in its subgraphs. To tackle the generally large-sized bipartite graph, we leverage camera-camera relationships in 3D reconstruction problems for efficiency. We demonstrate the benefits of
529adjacent_context
low
QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization3D Vision & Geometry with weak but relevant signal
abstractIn structure from motion, quadrifocal tensors capture more information than their pairwise counterparts (essential matrices), yet they have often been thought of as impractical and only of theoretical interest. In this work, we challenge such beliefs by providing a new framework to recover $n$ cameras from the corresponding collection of quadrifocal tensors. We form the block quadrifocal tensor and show that it admits a Tucker decomposition whose factor matrices are the stacked camera matrices, and which thus has a multilinear rank of (4,4,4,4) independent of $n$. We develop the first synchronization algorithm for quadrifocal tensors, using Tucker decomposition, alternating direction method of multipliers, and iteratively reweighted least squares. We further establish relationships between the block quadrifocal, trifocal, and bifocal tensors, and introduce an algorithm that jointly syn
530adjacent_context
low
Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding
3D Vision & Geometry / Point Cloud
C. cluster representativepose_calibration_localization; surface_occupancy; data_benchmark3D Vision & Geometry with weak but relevant signal
abstractWhile recent Transformer and Mamba architectures have advanced point cloud representation learning, they are typically developed for single-task or single-domain settings. Directly applying them to multi-task domain generalization (DG) leads to degraded performance. Transformers effectively model global dependencies but suffer from quadratic attention cost and lack explicit structural ordering, whereas Mamba offers linear-time recurrence yet often depends on coordinate-driven serialization, which is sensitive to viewpoint changes and missing regions, causing structural drift and unstable sequential modeling. In this paper, we propose Structure-Aware Domain Generalization (SADG), a Mamba-based In-Context Learning framework that preserves structural hierarchy across domains and tasks. We design structure-aware serialization (SAS) that generates transformation-invariant sequences using cent
531adjacent_context
low
Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; dynamic_4d3D Vision & Geometry with weak but relevant signal
abstractDense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle temp
532adjacent_context
low
UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; dynamic_4d3D Vision & Geometry with weak but relevant signal
abstractWith the advancement of multi-modal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce $\textbf{UFVideo}$, the first Video LLM with $\textbf{unified multi-grained cooperative understanding}$ capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the $\textbf{UFVideo-Ben
533adjacent_context
low
Deformation-based In-Context Learning for Point Cloud Understanding
3D Vision & Geometry / Point Cloud
C. cluster representativepose_calibration_localization; surface_occupancy; data_benchmark3D Vision & Geometry with weak but relevant signal
abstractRecent advances in point cloud In-Context Learning (ICL) have demonstrated strong multitask capabilities. Existing approaches typically adopt a Masked Point Modeling (MPM)-based paradigm for point cloud ICL. However, MPM-based methods directly predict the target point cloud from masked tokens without leveraging geometric priors, requiring the model to infer spatial structure and geometric details solely from token-level correlations via transformers. Additionally, these methods suffer from a training–inference objective mismatch, as the model learns to predict the target point cloud using target-side information that is unavailable at inference time. To address these challenges, we propose DeformPIC, a deformation-based framework for point cloud ICL. Unlike existing approaches that rely on masked reconstruction, DeformPIC learns to deform the query point cloud under task-specific guidanc
534adjacent_context
low
Fusion of Depth and Semantic for Probabilistic Floorplan Localization
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence3D Vision & Geometry with weak but relevant signal
abstractFloorplan localization aims to estimate the camera pose of a query image with respect to a 2D floorplan, providing a lightweight and long-term stable alternative to localization based on 3D maps or large image databases for indoor robotics and AR. Recent methods frame the problem as ray-based matching, representing the image as a set of rays annotated with depth or semantic labels and aligning them with the floorplan. However, they still face challenges in addressing the complexity of indoor environments, which can be decomposed into environmental, geometric, and semantic ambiguities.To address these ambiguities, we propose a floorplan-aware probabilistic fusion framework that models both depth and semantic information within a unified architecture. Our framework also combines a distribution-based ray confidence estimator, which down-weights uncertain geometric hypotheses, with a probabi
535adjacent_context
low
PlanaReLoc: Camera Relocalization in 3D Planar Primitives via Region-based Structure Matching
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence3D Vision & Geometry with weak but relevant signal
abstractWhile structure-based relocalizers have long strived for *point* correspondences when establish or regress query-map associations, in this paper, we pioneer the use of **planar primitives** and planar 3D maps for lightweight 6-DoF camera relocalization in structured environments.Planar primitives, beyond being fundamental entities in projective geometry, also serve as region-based representations that encapsulate both structural and semantic richness.This motivates us to introduce *PlanaReLoc*, a streamlined "plane-centric" paradigm where a deep matcher associates planar primitives across the query image and the map within a learned unified embedding space, after which the 6-DoF pose is solved and refined under a robust framework.Through extensive experiments on the *ScanNet* and *12Scenes* datasets across hundreds of scenes, our method demonstrates the superiority of planar primitives i
536adjacent_context
low
LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence3D Vision & Geometry with weak but relevant signal
abstractLiDAR relocalization has attracted increasing attention as it can deliver accurate 6-DoF pose estimation in complex 3D environments. Recent learning-based regression methods offer efficient solutions by directly predicting global poses without the need for explicit map storage. However, these methods often struggle in challenging scenes due to their equal treatment of all predicted points, which is vulnerable to noise and outliers. In this paper, we propose **LEADER**, a robust LiDAR-based localization framework enhanced by a simple, yet effective geometric encoder. Specifically, a Robust Projection-based Geometric Encoder architecture which captures multi-scale geometric features is first presented to enhance descriptiveness in geometric representation. A Truncated Relative Reliability loss is then formulated to model point-wise ambiguity and mitigate the influence of unreliable predict
537adjacent_context
low
Gaze Target Estimation with Concepts
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; data_benchmark3D Vision & Geometry with weak but relevant signal
abstractEstimating human gaze targets from images in-the-wild is an important and formidable task. Existing approaches primarily employ brittle, multi-stage pipelines that require explicit inputs, like head bounding boxes and human pose, in order to identify the subject of gaze analysis. As a result, detection errors can cascade and lead to failure. Moreover, these prior works lack the flexibility of specifying the gaze analysis task via natural language prompting, an approach which has been shown to have significant benefits in convenience and scalability for other image analysis tasks. To overcome these liimtations, we introduce the **Promptable Gaze Target Estimation (PGE)** task, a new end-to-end, concept-driven paradigm for gaze analysis. PGE conditions gaze prediction on flexible user text or visual prompts (e.g., "the boy in the red shirt" or "person in point [0.52, 0.48]") to identify a
538adjacent_context
low
AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence3D Vision & Geometry with weak but relevant signal
abstractPrecise and real-time visual localization is critical for applications like AR/VR and robotics, especially on resource-constrained edge devices such as smart glasses, where battery life and heat dissipation can be primary concerns. While many efficient models exist, further reducing compute without sacrificing accuracy is essential for practical deployment. To address this, we propose asymmetric visual localization: a large Teacher model processes pre-mapped database images offline, while a lightweight Student model processes the query image online. This creates a challenge in matching features from two different models without resorting to heavy, learned matchers.We introduce AsymLoc, a novel distillation framework that aligns a Student to its Teacher through a combination of a geometry-driven matching objective and a joint detector-descriptor distillation objective, enabling fast, para
539adjacent_context
low
CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation
3D Vision & Geometry / Point Cloud
C. cluster representativedepth_correspondence; surface_occupancy; data_benchmark3D Vision & Geometry with weak but relevant signal
abstractRecent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce $\textbf{CLIPoint3D}$, the first framework for $\textit{few-shot unsupervised 3D point cloud domain adaptation}$ built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design
540adjacent_context
low
LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence3D Vision & Geometry with weak but relevant signal
abstractAdvanced autonomous systems rely on multi-sensor fusion for safer and more robust perception. To enable effective fusion, calibrating directly from natural driving scenes (i.e., target-free) with high accuracy is crucial for precise multi-sensor alignment. Existing learning-based calibration methods are typically designed for only a single pair of sensor modalities (i.e., a bi-modal setup). Unlike these methods, we propose LiREC-Net, a target-free, learning-based calibration network that jointly calibrates multiple sensor modality pairs, including LiDAR, RGB, and event data, within a unified framework. To reduce redundant computation and improve efficiency, we introduce a shared LiDAR representation that leverages features from both its 3D nature and projected depth map, ensuring better consistency across modalities. Trained and evaluated on established datasets, such as KITTI and DSEC,
541adjacent_context
low
SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; data_benchmark3D Vision & Geometry with weak but relevant signal
abstractRobust cross-view geo-localization (CVGL) remains challenging despite the surge in recent progress. Existing methods still rely on field-of-view (FoV)-specific training paradigms, where models are optimized under a fixed FoV but collapse when tested on unseen FoVs and unknown orientations. This limitation necessitates deploying multiple models to cover diverse variations. Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across diverse conditions---implicitly assuming all FoVs are equally difficult. To address this gap, we present SinGeo, a simple yet powerful framework that enables a single model to realize robust cross-view geo-localization without additional modules or explicit transformations. SinGeo employs a dual discriminative learning architecture that enhances intra-view discriminability within both ground and satel
542adjacent_context
low
Latent Action Pretraining Meets Pose Estimation
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; data_benchmark3D Vision & Geometry with weak but relevant signal
abstractThis paper revisits camera pose estimation through the lens of self-supervised pretraining, focusing on inverse-dynamics pretraining as a scalable alternative to the current trend of fully supervised training with 3D annotations. Concretely, we employ inverse- and forward-dynamics models to learn latent action representations, similar to Genie from large-scale driving videos.Our idea is simple yet effective. Existing methods use latent actions in their original capacity, that is, as action conditioning of world-models or as proxies of robot action parameters in policy networks.Our method, dubbed LA-Pose, repurposes the latent action features as inputs to a camera pose estimator, finetuned on a limited set of high-quality 3D annotations.This formulation enables accurate and generalizable pose prediction while maintaining feed-forward efficiency. Extensive experiments on driving benchmarks
543adjacent_context
low
LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; data_benchmark3D Vision & Geometry with weak but relevant signal
abstractWe present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 [89] achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces $\textbf{InsLoD-Loc}$ - the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance-level building annotations. This enables trained models to exhibit remarkable zero-shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance-level silhouette alignment, which significantly reduces pose e
544adjacent_context
low
Structure-Aware Representation Distillation for Tiny-Dense Object Segmentation
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; data_benchmark3D Vision & Geometry with weak but relevant signal
abstractDense scenes containing numerous tiny objects pose a fundamental challenge for segmentation models, where small localization errors can significantly degrade downstream measurements. We present Structure-Aware Representation Distillation (SARD), a teacher-compatible framework that transfers structural knowledge from a large teacher to a compact student via feature-space alignment rather than mask imitation. SARD constructs a structure-importance map that combines boundary salience, local density, and teacher confidence, and uses it to weight a unified representation loss integrating feature consistency, distribution alignment, and structural contrast. This encourages the student to allocate capacity to geometrically informative regions while preserving global context. Experiments on Cityscapes, ADE20K, and a challenging rock fragmentation benchmark (RockFrag) show that SARD consistently
545adjacent_context
low
UniVerse: A Unified Modulation Framework for Segmentation-Free, Disentangled Multi-Concept Personalization
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; generation_editing3D Vision & Geometry with weak but relevant signal
abstractPersonalized visual understanding has advanced significantly, yet existing approaches struggle to localize and extract specific concepts when input images contain multiple objects. Many prior methods rely heavily on segmentation-based supervision or exhibit poor compositional generalization, limiting their ability to accurately disentangle and manipulate individual concepts. In this work, we propose UniVerse, a Unified Modulation Framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers. Our method allows for composable and decomposable concept extraction, enabling fine-grained localization and representation of target objects without explicit segmentation masks. UniVerse learns to decompose complex scenes into concept-specific representations and then compose them in a unified manner, enabling robust personalization across diverse visual conte
546adjacent_context
low
Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization3D Vision & Geometry with weak but relevant signal
abstractVisual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference.To address this limitation, we propose Vision-Guided Attention (VGA), a training-free method that first constructs precise visual grounding by exploiting the semantic content of visual tokens, and then uses this grounding to guide the model’s focus toward relevant visual regions. In image captioning, VGA further refines this guidance dynamically during generation by suppressing regions that have already been described.In VGA, each token undergoes only a single forward pass, introducing a negligible latency overhead of just 4.36\%. In addition, VGA is fully
547adjacent_context
low
TLMA: Mitigating the Impact of Weakly Labeled Information for Video Anomaly Detection
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization3D Vision & Geometry with weak but relevant signal
abstractWeakly Supervised Video Anomaly Detection (WSVAD) aims to localize abnormal segments using only video-level labels during training.Although the paradigm significantly reduces annotation costs, the coarse-grained labels fail to precisely describe the full videos, resulting in the introduction of substantial Weakly Labeled Information (WLI) during training. The presence of WLI makes it difficult for the model to accurately learn the boundary between normal and abnormal behaviors, leading to misclassifications and compromising the precision of anomaly localization.To tackle the challenges posed by WLI, we propose a triplet learning strategy that selects hard segments from normal videos as anchors. By combining contrastive learning with Multiple Instance Learning (MIL) strategy, we increase the projection distance between abnormal segments and anchor samples, to reduce the interference of W
548adjacent_context
low
Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization3D Vision & Geometry with weak but relevant signal
abstractMultimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fin
549adjacent_context
low
A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection
3D Vision & Geometry / Point Cloud
C. cluster representativepose_calibration_localization; surface_occupancy3D Vision & Geometry with weak but relevant signal
abstract3D anomaly detection targets the detection and localization of defects in 3D point clouds trained solely on normal data. While a unified model improves scalability by learning across multiple categories, it often suffers from Inter-Category Entanglement (ICE)—where latent features from different categories overlap, causing the model to adopt incorrect semantic priors during reconstruction and ultimately yielding unreliable anomaly scores. To address this issue, we propose the Semantically Disentangled Unified Model for 3D Anomaly Detection, which reconstructs features conditioned on disentangled semantic representations. Our framework consists of three key components: (i) Coarse-to-Fine Global Tokenization for forming instance-level semantic identity, (ii) Category-Conditioned Contrastive Learning for disentangling category semantics, and (iii) a Geometry-Guided Decoder for semantically
550adjacent_context
low
SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization3D Vision & Geometry with weak but relevant signal
abstractExisting methods for category-level object articulation from a single 3D observation often rely on dense supervision, multi-frame inputs, or CAD templates, and still struggle to disentangle geometry from articulation or to recover explicit joint parameters. We propose SCAPO , a self-supervised framework that estimates canonical geometry, rigid part segmentation, and joint pivots, axes, and articulation states from a single RGB-D observation without ground-truth labels or category-specific models. Our SCAPO first uses an SE(3)-equivariant vector-neuron autoencoder to factor out global pose and align diverse instances into a shared canonical space. On this aligned shape, a joint-aware blend-skinning module is then designed to model part motion. We learn this representation through cycle reconstruction between observed and canonical shapes and cross-space alignment with a learnable canonica
551adjacent_context
low
Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection
3D Vision & Geometry / Point Cloud
C. cluster representativesurface_occupancy; data_benchmark3D Vision & Geometry with weak but relevant signal
abstract3D shape anomaly detection is a crucial task for industrial inspection and geometric analysis. Existing deep learning approaches typically learn representations of normal shapes and identify anomalies via out-of-distribution feature separation or decoder-based reconstruction. They often fail to generalize across diverse anomaly types and scales, such as global geometric errors (e.g., planar shifts, surface misalignments), and are sensitive to noisy or incomplete local points during training. To address these limitations, we propose a hierarchical point–patch anomaly scoring network that jointly models regional part features and local point features for robust anomaly reasoning. An adaptive patchification module integrates self-supervised decomposition to capture complex structural deviations. Beyond evaluations on public benchmarks (Anomaly-ShapeNet and Real3D-AD), we release an industri
552adjacent_context
low
PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration
3D Vision & Geometry / Point Cloud
C. cluster representativesurface_occupancy; data_benchmark3D Vision & Geometry with weak but relevant signal
abstractPoint clouds are a fundamental 3D representation in computer vision, enabling a wide range of perception tasks. However, real-world point clouds often suffer from degradations such as incompleteness, noise, outliers, and irregular density, caused by sensor limitations or occlusions. Recovering clean and detailed shapes from such degraded data is crucial for downstream applications. While existing learning-based methods achieve progress on individual tasks like completion or denoising, they typically rely on global bottleneck features, which lose fine-grained geometry and remain sensitive to varying input quality. We propose a unified 3D restoration network that directly takes point clouds as input and adaptively reconstructs high-quality geometry under diverse degradation scenarios. At the core of our approach is a Pseudo-Query module, implemented within a Transformer backbone, which ref
553adjacent_context
low
BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds
3D Vision & Geometry / Point Cloud
C. cluster representativesurface_occupancy3D Vision & Geometry with weak but relevant signal
abstractWe introduce BuildAnyPoint, a novel generative framework for structured 3D building reconstruction from point clouds with diverse distributions, such as those captured by airborne LiDAR and Structure-from-Motion.To recover artist-created building abstraction in this highly underconstrained setting, we capitalize on the role of explicit 3D generative priors in autoregressive mesh generation.Specifically, we design a Loosely Cascaded Diffusion Transformer (Loca-DiT) that initially recovers the underlying distribution from noisy or sparse points, followed by autoregressively encapsulating them into compact meshes.We first formulate distribution recovery as a conditional generation task by training latent diffusion models conditioned on input point clouds, and then tailor a decoder-only transformer for conditional autoregressive mesh generation based on the recovered point clouds.Our method
554adjacent_context
low
3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmarkadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractExisting methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, v
555adjacent_context
low
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
Multimodal & Language / VLM / MLLM
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmarkadjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title
abstractDespite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting.We tackle these issues by introducing:(a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception;(b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and(c) \ourbenchmark, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline.Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchma
556adjacent_context
low
4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models
Data & Evaluation / Benchmark
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmarkadjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title
abstractWorld Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high-fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, Embodied Intelligence, and content creation.However, prior benchmarks, however, each emphasize different evaluation dimensions and lack a unified assessment of world-realism capability.To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition–4D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as
557adjacent_context
low
BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmarkadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractText-guided dynamic 3D character generation has advanced rapidly, yet producing high-quality motion that faithfully reflects rich textual descriptions remains challenging. Existing methods tend to generate limited sub-actions or incoherent motion due to fixed-length temporal inputs and discrete frame-wise representations that fail to capture rich motion semantics. We address these limitations by representing motion with continuous differentiable B-spline curves, enabling more effective motion generation without modifying the capabilities of the underlying generative model. Specifically, our closed-form, Laplacian-regularized B-spline solver efficiently compresses variable-length motion sequences into compact representations with a fixed number of control points. Further, we introduce a normal-fusion strategy for input shape adherence along with correspondence-aware and local-rigidity
558adjacent_context
low
Breaking the 3D Dataset Bottleneck: Fast Scalable Generation of Aligned 3D Assets from Scratch for Category 6D Pose Estimation and Robotic Grasping
Data & Evaluation / Benchmark
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; generation_editing; data_benchmarkadjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title
abstractWhile 2D vision has been revolutionized by large-scale datasets like ImageNet, 3D vision remains constrained by the scarcity of high-quality, canonically aligned data. We introduce the first scalable, automated framework that generates complete category-level 6D pose datasets directly from text prompts, bypassing the need for existing 3D assets. Our method overcomes key challenges by: (1) ensuring reliable, scalable asset generation via a controlled text-to-image-to-3D pipeline; (2) enforcing built-in canonical alignment through depth-conditioned generation, achieving a 96\% pose consistency rate; and (3) enabling large-scale 6D annotation via mixed reality rendering. The pipeline produces high-quality, aligned 3D meshes in under 3 minutes per object—a 5–20$\times$ speedup over traditional scanning. We generate over 1,000 instances for each of the 153 categories in the Omni6Dpose benchma
559adjacent_context
low
Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation
Generative Models / Image Editing
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editingadjacent genus=Image Editing with no direct reconstruction/SLAM/map signal in title
abstractRecent advances in 3D scene editing using NeRF and 3DGS enable high-quality static scene editing. In contrast, dynamic scene editing remains challenging, as methods that directly extend 2D diffusion models to 4D often produce motion artifacts, temporal flickering, and inconsistent style propagation. We introduce Catalyst4D, a framework that transfers high-quality 3D edits to dynamic 4D Gaussian scenes while maintaining spatial and temporal coherence. At its core, Anchor-based Motion Guidance (AMG) builds a set of structurally stable and spatially representative anchors from both original and edited Gaussians. These anchors serve as robust region-level references, and their correspondences are established via optimal transport to enable consistent deformation propagation without cross-region interference or motion drift. Complementarily, Color Uncertainty-guided Appearance Refinement (CUA
560adjacent_context
low
Charge: A Comprehensive Benchmark and Dataset for Dynamic Novel View Synthesis
Data & Evaluation / Benchmark
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmarkadjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title
abstractThis paper presents a new dataset for Novel View Synthesis, generated from a high-quality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting-edge 4D scene reconstruction and novel view generation models. In addition to high-fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi-view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high-qualit
561adjacent_context
low
DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; generation_editing; data_benchmarkadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractGenerating dynamic and interactive 3D trees has wide applications in virtual reality, games, and world simulation. However, existing methods still face various challenges in generating structurally consistent and realistic 4D motion for complex real trees. In this paper, we propose DynamicTree, the first framework that can generate long-term, interactive 3D motion for 3DGS reconstructions of real trees. Unlike prior optimization-based methods, our approach generates dynamics in a fast feed-forward manner. The key success of our approach is the use of a compact sparse voxel spectrum to represent the tree movement. Given a 3D tree from Gaussian Splatting reconstruction, our pipeline first generates mesh motion using the sparse voxel spectrum and then binds Gaussians to deform the mesh. Additionally, the proposed sparse voxel spectrum can also serve as a basis for fast modal analysis under
562adjacent_context
low
Ego-1K – A Large-Scale Multiview Video Dataset for Egocentric Vision
Data & Evaluation / Benchmark
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkadjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title
abstractWe present Ego-1K, a large-scale, time-synchronized collection of egocentric multiview videos designed to advance neural 3D video synthesis, dynamic scene understanding, and embodied perception. The dataset contains nearly 1,000 short egocentric videos taken with a custom rig with 12 synchronous cameras surrounding a VR headset worn by the user. Scene content focuses on hand motions and hand-object interactions in different settings. We describe rig design, data processing, and calibration. Our dataset enables new ways to benchmark egocentric scene reconstruction methods. We believe this is an important area of research as smart glasses with multiple cameras become omnipresent. Our experiments demonstrate that our dataset presents unique challenges for existing 3D and 4D novel view synthesis methods due to high disparities and image motion caused by close dynamic objects and rig ego
563adjacent_context
low
Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy; generation_editingadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractScene-consistent video generation aims to create videos that explore 3D scenes based on a camera trajectory. Previous methods rely on video generation models with external memory for consistency, or iterative 3D reconstruction and inpainting, which accumulate errors during inference due to incorrect intermediary outputs, non-differentiable processes, and separate models. To overcome these limitations, we introduce ``geometry-as-context". It iteratively completes the following steps using an autoregressive camera-controlled video generation model: (1) estimates the geometry of the current view necessary for 3D reconstruction, and (2) simulates and restores novel view images rendered by the 3D scene. Under this multi-task framework, we develop the camera gated attention module to enhance the model's capability to effectively leverage camera poses. During the training phase, text contexts a
564adjacent_context
low
GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry
Segmentation & Dense Prediction / Segmentation
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancyadjacent genus=Segmentation with no direct reconstruction/SLAM/map signal in title
abstractMotion segmentation in dynamic scenes is highly challenging, as conventional methods heavily rely on estimating camera poses and point correspondences from inherently noisy motion cues. Existing statistical inference or iterative optimization techniques that struggle to mitigate the cumulative errors in multi-stage pipelines often lead to limited performance or high computational cost. In contrast, we propose a fully learning-based approach that directly infers moving objects from latent feature representations via attention mechanisms, thus enabling end-to-end feed-forward motion segmentation. Our key insight is to bypass explicit correspondence estimation and instead let the model learn to implicitly disentangle object and camera motion. Supported by recent advances in 4D scene geometry reconstruction (e.g., $\pi^3$), the proposed method leverages reliable camera poses and rich spatial
565adjacent_context
low
GM-R$^2$: Generative Matching Learning for Unsupervised Geometric Representation and Registration
Learning Algorithms / Self-supervised
D. adjacent but useful contextgaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mappingadjacent genus=Self-supervised with no direct reconstruction/SLAM/map signal in title
abstractThis paper proposes GM-R^2, a novel Generative Matching Learning framework for unsupervised geometric descriptor learning and correspondence matching. By reformulating descriptor learning as geometry-conditioned cross-view image generation, GM-R^2 leverages the proxy supervisory signal from structurally aligned view synthesis to implicitly enforce feature consistency across correspondence, enabling robust 3D matching. To instantiate GM-R^2, we introduce Denoising-Agnostic Coupled ControlNet conditioned on depth maps as the required geometry-conditioned cross-view generator. It effectively extends the single-view generation of naive ControlNet to the cross-view via coupled depth-map input design and further remove the latent noise dependency to support geometry-only inference (expected by 3D matching). Moreover, we present Zoomable Equirectangular Projection for intrinsics-free point clou
566adjacent_context
low
MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration
Medical & Scientific Imaging / Medical Imaging
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancyadjacent genus=Medical Imaging with no direct reconstruction/SLAM/map signal in title
abstractDeformable image registration (DIR) remains a fundamental yet challenging problem in medical image analysis, largely due to the prohibitively high-dimensional deformation space of dense displacement fields and the scarcity of voxel-level supervision. Existing reinforcement learning frameworks often project this space into coarse, low-dimensional representations, limiting their ability to capture spatially variant deformations. We propose MorphSeek, a fine-grained representation-level policy optimization paradigm that reformulates DIR as a spatially continuous optimization process in the latent feature space. MorphSeek introduces a stochastic Gaussian policy head atop the encoder to model a distribution over latent features, facilitating efficient exploration and coarse-to-fine refinement. The framework integrates unsupervised warm-up with weakly supervised fine-tuning through Group Relat
567adjacent_context
low
MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data
Data & Evaluation / Benchmark
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; generation_editing; data_benchmarkadjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title
abstractExisting 4D human datasets fall short for fashion-specific research, lacking either realistic garment dynamics or task-specific annotations. Synthetic datasets suffer from a realism gap, whereas real-world captures lack the detailed annotations and paired data required for virtual try-on (VTON) and size estimation tasks. To bridge this gap, we introduce MV-Fashion, a large-scale, multi-view video dataset engineered for domain-specific fashion analysis. MV-Fashion features 3,273 sequences (72.5 million frames) from 80 diverse subjects wearing 3-10 outfits each. It is designed to capture complex, real-world garment dynamics, including multiple layers and varied styling (e.g., tucked shirts, rolled sleeves). A core contribution is a rich data representation that includes pixel-level semantic annotations, ground-truth material properties like elasticity, and 3D point clouds. Crucially for VT
568adjacent_context
low
PhysGaia: A Physics-aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis
Data & Evaluation / Benchmark
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmarkadjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title
abstractWe introduce PhysGaia, a novel physics-aware dataset specifically designed for Dynamic Novel View Synthesis (DyNVS), encompassing both structured objects and unstructured physical phenomena.Unlike existing datasets that primarily focus on photorealistic reconstruction, PhysGaia is created to actively support physics-aware dynamic scene modeling.Our dataset provides complex dynamic scenarios with rich interactions among multiple objects, where they realistically collide with each other and exchange forces.Furthermore, it contains a diverse range of physical materials, such as liquid, gas, textile, and rheological substances, which moves beyond the rigid bodies prevalent in existing datasets.All scenes in PhysGaia are faithfully generated to strictly adhere to physical laws, leveraging carefully selected material-specific physics solvers. To enable quantitative evaluation of physical mode
569adjacent_context
low
PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; generation_editing; data_benchmarkadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractDespite advances in physics-based 3D motion synthesis, current methods face key limitations: reliance on pre-reconstructed 3D Gaussian Splatting (3DGS) built from dense multi-view images with time-consuming per-scene optimization; physics integration via either inflexible, hand-specified attributes or unstable, optimization-heavy guidance from video models using Score Distillation Sampling (SDS); and naïve concatenation of prebuilt 3DGS with physics modules, which ignores physical information embedded in appearance and yields suboptimal performance. To address these issues, we propose PhysGM, a feed-forward framework that jointly predicts 3D Gaussian representation and physical properties from a single image, enabling immediate simulation and high-fidelity 4D rendering. Unlike slow appearance-agnostic optimization methods, we first pre-train a physics-aware reconstruction model that dire
570adjacent_context
low
RAYNOVA: Geometry-Free Auto-Regressive 4D World Modeling with Unified Spatio-Temporal Representation
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmarkadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractWorld foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RayNova, a geometry-free world model that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RayNova constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Plücker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RayNova achieves state-of-the-art multi-vi
571adjacent_context
low
ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes
Segmentation & Dense Prediction / Segmentation
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmarkadjacent genus=Segmentation with no direct reconstruction/SLAM/map signal in title
abstractIndoor environments evolve as objects move, appear, or disappear. Capturing these dynamics requires maintaining consistent instance identities across intermittently captured 3D scans with unobserved change or, equivalently, performing 4D indoor semantic instance segmentation (SIS)---the joint task of segmenting, identifying, and temporally associating object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and 4D LiDAR approaches, which show limited performance due to their reliance on continuous temporal measurements that is uncommon in indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. It explores temporal fusion strategies to share information across observations, demonstrating that this shared context
572adjacent_context
low
STUR3D: Spatio-Temporal Unified Representation Learning for 3D Object Detection
Detection & Tracking / Detection
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkadjacent genus=Detection with no direct reconstruction/SLAM/map signal in title
abstractSurrounding-view 3D object detection is a fundamental task in autonomous driving, which aims to locate 3D objects from multiple camera views. Existing methods predominantly followed a 2D-to-3D pipeline, leveraging 2D detectors to enhance 3D detection performance. However, these methods ignored the inherent disparities in both temporal and feature dimensional representations between 2D and 3D detection, resulting in the positional deviations in 3D space. Furthermore, the absence of temporal information in 2D detection leads to object omission in occluded scenarios. To address these limitations, we propose STUR3D, a unified framework that builds spatio-temporal alignment between 2D and 3D perception. First, we project historical 3D detection features onto the 2D image plane, guiding the 2D detector to distill the requisite representations for 3D detection, thereby harmonizing feature repre
573adjacent_context
low
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
Data & Evaluation / Benchmark
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; dynamic_4d; generation_editing; data_benchmarkadjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title
abstractThe convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as
574adjacent_context
low
Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal
Data & Evaluation / Benchmark
D. adjacent but useful contextpose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkadjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title
abstractLiDAR has become an essential sensing modality in autonomous driving, robotics, and smart-city applications. However, ghost points (or ghost), which are false reflections caused by multi-path laser returns from glass and reflective surfaces, severely degrade 3D mapping and localization accuracy. Prior ghost removal rely on geometric consistency in dense point clouds, failing on mobile LiDAR's sparse, dynamic data. We address this by exploiting full-waveform LiDAR (FWL), which captures complete temporal intensity profiles rather than just peak distances, providing crucial cues for distinguishing ghosts from genuine reflections in mobile scenarios. As this is a new task, we present Ghost-FWL, the first and largest annotated mobile FWL dataset for ghost detection and removal. Ghost-FWL comprises 24K frames across 10 diverse scenes with 7.5 billion peak-level annotations, which is 100$\times
575adjacent_context
low
Learning Multi-View Spatial Reasoning from Cross-View Relations
Multimodal & Language / VLM / MLLM
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmarkadjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title
abstractVision-language models (VLMs) have achieved impressive results on single-view vision tasks, but lack the multi-view spatial reasoning capabilities essential for embodied AI systems to understand 3D environments and manipulate objects across different viewpoints. In this work, we introduce Cross-View Relations (XVR), a large-scale dataset designed to teach VLMs spatial reasoning across multiple views. XVR comprises 100K vision-question-answer samples derived from 18K diverse 3D scenes and 70K robotic manipulation trajectories, spanning three fundamental spatial reasoning tasks: Correspondence (matching objects across views), Verification (validating spatial relationships), and Localization (identifying object positions). VLMs fine-tuned on XVR achieve substantial improvements on established multi-view and robotic spatial reasoning benchmarks (MindCube and RoboSpatial). When integrated as
576adjacent_context
low
MooCap: A Multi-View Benchmark for Cow-Object-Human Interaction and Behavior Dynamics
Data & Evaluation / Benchmark
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; data_benchmarkadjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title
abstractUnderstanding animal behavior requires modeling how bodies, objects, and other agents interact over time, not simply detecting isolated actions or estimating pose frame by frame. Existing animal video datasets target pose estimation or coarse, passively observed actions, and rarely provide the structured, multi-entity interaction annotations needed to study behavioral dynamics. We introduce MooCap, a multi-view video benchmark for animal-object-human interaction understanding under controlled experimental protocols. MooCap contains 42 hours of synchronized multi-camera video from 43 individually tested cows across seven standardized interaction scenarios, including novel environment, novel object, novel human, human approach, unfamiliar conspecifics (restricted and unrestricted) and Dam reunion (restricted and unrestricted). Recordings are densely annotated with 23 fine-grained behavior
577adjacent_context
low
EMMA: Extracting Multiple physical parameters from Multimodal Data
Multimodal & Language / VLM / MLLM
D. adjacent but useful contextgaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy; data_benchmarkadjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title
abstractWe introduce EMMA, a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model. EMMA leverages a Liquid Time-Constant (LTC) network to learn latent dynamics from heterogeneous modalities while a physics-constrained loss enforces consistency with the governing differential equations. A unified feature pipeline enables consistent alignment across video trajectories, acoustic signatures, and chart-derived measurements, allowing EMMA to estimate parameters under for
578adjacent_context
low
Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos
Low-level Vision / IQA
D. adjacent but useful contextgaussian_radiance; pose_calibration_localization; depth_correspondence; dynamic_4d; data_benchmarkadjacent genus=IQA with no direct reconstruction/SLAM/map signal in title
abstractHigh Dynamic Range (HDR) user-generated (UGC) videos are rapidly proliferating across social platforms, yet most perceptual video quality assessment (VQA) systems remain tailored to Standard Dynamic Range (SDR). HDR’s higher bit depth, wide color gamut, and elevated luminance range expose distortions such as near-black crushing, highlight clipping, banding, and exposure flicker that amplify UGC artifacts and challenge SDR models. To catalyze progress, we curate \textbf{HDR-UGC-44K}, a large-scale subjective dataset of $\sim$44K videos from 6.5K sources with >1.5M crowd ratings, spanning diverse scenes, capture conditions, and compression settings. We further introduce \textbf{HDR-Q}, the first Multimodal Large Language Model (MLLM) for HDR-UGC VQA. We propose (i) a novel HDR-aware vision encoder to produce HDR-sensitive embeddings, and (ii) HDR-Aware Policy Optimization (HAPO), an RL fin
579adjacent_context
low
XR-Poser: Accurate Egocentric Human Motion Estimation for AR/VR
Video & Motion / Human Motion
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; data_benchmarkadjacent genus=Human Motion with no direct reconstruction/SLAM/map signal in title
abstractEgocentric 3D human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present XR-Poser, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training.The proposed model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget.The proposed auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher–st
580adjacent_context
low
Choreographing a World of Dynamic Objects
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; generation_editing; data_benchmarkadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractDynamic objects in our physical 4D (3D + time) world are constantly evolving, deforming, and interacting with other objects, leading to diverse 4D scene dynamics. In this paper, we study a universal generative pipeline for synthesizing this type of phenomena. Traditional rule-based graphics pipelines to create these dynamics are based on category-specific heuristics, yet are labor-intensive and not scalable. Recent learning-based methods typically demand large-scale datasets, which may not cover all object categories in interest. Our approach instead inherits the universality from the video generative models by proposing a distillation-based pipeline to extract the rich Lagrangian motion information hidden in the Eulerian representations of 2D videos. Our method is universal, versatile, and category-agnostic. We demonstrate its effectiveness by conducting experiments to generate a divers
581adjacent_context
low
R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection
Detection & Tracking / Detection
D. adjacent but useful contextpose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mappingadjacent genus=Detection with no direct reconstruction/SLAM/map signal in title
abstract4D radar–camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated T
582adjacent_context
low
HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; generation_editing; data_benchmarkadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractRecent methods have made notable progress in the visual quality of hand-object interaction video synthesis. However, most approaches rely on 2D control signals that lack spatial expressiveness and limit the utilization of synthetic 3D conditional data. To address these limitations, we propose HVG-3D, a unified framework for 3D-aware hand-object interaction (HOI) video synthesis conditioned on explicit 3D representations. To achieve a diffusion-based architecture augmented with a 3D ControlNet, which encodes geometric and motion cues from 3D inputs to enable explicit 3D reasoning during video synthesis, as well as the corresponding training and inference setting. To achieve high-quality synthesis, HVG-3D is designed with two core components: (i) a 3D-aware HOI video generation diffusion architecture that encodes geometric and motion cues from 3D inputs for explicit 3D reasoning; and (i
583adjacent_context
low
WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments
Generative Models / Diffusion
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; data_benchmarkadjacent genus=Diffusion with no direct reconstruction/SLAM/map signal in title
abstractWe present **WildRayZer**, a self-supervised framework for novel view synthesis (NVS) in dynamic environments, where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, causing ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-
584adjacent_context
low
Generating Humanless Environment Walkthroughs from Egocentric Walking Tour Videos
Generative Models / Video Generation
D. adjacent but useful contextgaussian_radiance; dynamic_4d; robotics_mapping; generation_editing; data_benchmarkadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractEgocentric ``walking tour'' videos provide a rich source of image data to develop rich and diverse visual models of environments around the world. However, the significant presence of humans in frames of these videos due to crowds and eye-level camera perspectives mitigates their usefulness in environment modeling applications. We focus on addressing this challenge by developing a generative algorithm that can realistically remove (i.e., inpaint) humans and their associated shadow effects from walking tour videos. Key to our approach is the construction of a rich semi-synthetic dataset of video clip pairs to train this generative model. Each pair in the dataset consists of an environment-only background clip, and a composite clip of walking humans with simulated shadows overlaid on the background. We randomly sourced both foreground and background components from real egocentric walking
585adjacent_context
low
GauMVC: Generative Decoupled Gaussian Representation for Human-centric Multi-view Video Compression
Learning Algorithms / Efficient Models
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; generation_editingadjacent genus=Efficient Models with no direct reconstruction/SLAM/map signal in title
abstractHuman-centric multi-view video has a clear semantic structure: a static background and dynamic human motion. We propose a generative compression framework that explicitly decouples these components. The background is modeled once with 3D Gaussian Splatting, while the human is represented by a personalized Gaussian avatar reconstructed from a sparse set of key views that are transmitted only once and driven by compact per-frame pose parameters from the Skinned Multi-Person Linear (SMPL) model. The encoder sends only three elements: the background, the key views, and the SMPL parameters, enabling high-fidelity multi-viewpoint synthesis at dramatically reduced bitrates. This shifts compression from low-level redundancy removal to semantics-aware generative modeling. Experiments across multiple human-centric datasets demonstrate superior rate–distortion performance, particularly for long and
586adjacent_context
low
PAM: A Pose–Appearance–Motion Engine for Sim-to-Real HOI Video Generation
Generative Models / Video Generation
D. adjacent but useful contextpose_calibration_localization; depth_correspondence; dynamic_4d; robotics_mapping; generation_editingadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractHand–object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of previous work, we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose–Appearance–Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, w
587adjacent_context
low
GH-NAF: Grid-Adaptive Hash-Level–Attended Neural Attenuation Fields for Discrepancy-Aware CBCT
Medical & Scientific Imaging / Medical Imaging
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; surface_occupancy; data_benchmarkadjacent genus=Medical Imaging with no direct reconstruction/SLAM/map signal in title
abstractThe advent of hash encodings has evolved neural radiance fields (NeRF)-based methods into fast and efficient 3D reconstruction techniques. In medical imaging, this framework has been extended to CT/CBCT reconstruction through neural attenuation fields (NAF), which directly model attenuation properties from projection data. Existing NeRF-based attenuation fields typically assume an idealized monoenergetic CBCT setting and therefore fail to model real-world projection inconsistencies such as scatter and noise contamination. Moreover, uniformly concatenating multi-resolution hash-grid features blends heterogeneous frequency components and noise into a single representation, causing artifacts: homogeneous regions acquire spurious high-frequency patterns, structural boundaries become blurred, and projection-induced bias propagates throughout the learned field. Given these limitations, we intr
588adjacent_context
low
Beyond Single-View Sufficiency: CVBench for Cross-View Human Understanding
Multimodal & Language / VLM / MLLM
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; data_benchmarkadjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title
abstractHuman perception of social environments is inherently a multi-view synthesis problem, requiring the integration of complementary and often occluded information across space and time. However, existing benchmarks for Multimodal Large Language Models (MLLMs) are overwhelmingly predicated on a "sufficient-view" assumption, rewarding single-view pattern recognition while failing to evaluate cross-view fusion. To address this critical gap, we introduce \textbf{CVBench}, a large-scale, multi-task benchmark for cross-view human understanding. CVBench comprises 3,000 challenging questions across 12 spatial and temporal tasks, where every item is designed with \textit{verifiable single-view insufficiency}, mandating that models synthesize disparate evidence to resolve ambiguities. Our comprehensive evaluation of state-of-the-art open and closed-source MLLMs (from InternVL to Gemini 2.5 Pro) revea
589adjacent_context
low
Personalized Audio-driven Whole-body Talking Avatars
Video & Motion / Human Motion
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; generation_editingadjacent genus=Human Motion with no direct reconstruction/SLAM/map signal in title
abstractPrior conversational 3D avatar systems map audio to parametric poses and then render, creating a lossy bottleneck where quantization, retargeting, and tracking errors accumulate. This degrades audio–motion synchronization and suppresses micro-articulations critical for realism—such as bilabial closures, cheek inflation, nasolabial motion, blinks, and fine hand gestures—especially under single-image personalization. We propose an end-to-end framework that builds a full-body, photorealistic 3D conversational avatar from a single image and drives it directly from audio, bypassing intermediate pose prediction. The avatar is modeled as a particle-based deformation field of 3D Gaussian primitives in a canonical space, with an audio-conditioned dynamics module that outputs per-particle trajectories for face, hands, and body, enabling localized high-frequency control with globally coherent motio
590adjacent_context
low
Changes in Real Time: Online Scene Change Detection with Multi-View Fusion
Detection & Tracking / Detection
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; pose_calibration_localization; data_benchmarkadjacent genus=Detection with no direct reconstruction/SLAM/map signal in title
abstractOnline Scene Change Detection (SCD) is an extremely challenging problem that requires an agent to detect relevant changes on the fly while observing the scene from unconstrained viewpoints. Existing online SCD methods are significantly less accurate than offline approaches. We present the first online SCD approach that is pose-agnostic, label-free, and ensures multi-view consistency, while operating at over 10 FPS and achieving new state-of-the-art performance, surpassing even the best offline approaches. Our method introduces a new self-supervised fusion loss to infer scene changes from multiple cues and observations, PnP-based fast pose estimation against the reference scene, and a fast change-guided update strategy for the 3D Gaussian Splatting scene representation. Extensive experiments on complex real-world datasets demonstrate that our approach outperforms both online and offlin
591adjacent_context
low
Describe Anything Anywhere At Any Moment
Multimodal & Language / VLM / MLLM
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; data_benchmarkadjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title
abstractComputer vision and robotics applications ranging from agumented reality to robot autonomy in large-scale environments require spatio-temporal memory frameworks that capture both geometric structure for accurate language-grounding as well as semantic detail. Existing methods face a tradeoff, where producing rich open-vocabulary descriptions comes at the expense of real-time performance when these descriptions have to be grounded in 3D.To address these challenges, we propose Describe Anything, Anywhere, at Any Moment (DAAAM), a novel spatio-temporal memory framework for large-scale and real-time 4D scene understanding.DAAAM introduces a novel optimization-based frontend to infer detailed semantic descriptions from localized captioning models, such as the Describe Anything Model (DAM), leveraging batch processing to speed up inference by an order of magnitude for online processing.It leve
592adjacent_context
low
DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video
Detection & Tracking / Detection
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; data_benchmarkadjacent genus=Detection with no direct reconstruction/SLAM/map signal in title
abstractReliable 4D object detection, which refers to 3D object detection in streaming video, is crucial for perceiving and understanding the real world. Existing open-set 4D object detection methods typically make predictions on a frame-by-frame basis without modeling temporal consistency, or rely on complex multi-stage pipelines that are prone to error propagation across cascaded stages. Progress in this area has been hindered by the lack of large-scale datasets that capture continuous reliable 3D bounding box (b-box) annotations. To overcome these challenges, we first introduce DA4D, a large-scale 4D detection dataset containing over 280k sequences with high-quality b-box annotations collected under diverse conditions. Building on DA4D, we propose DetAny4D, an open-set end-to-end framework that predicts 3D b-boxes directly from sequential inputs. DetAny4D fuses multi-modal features from pre-t
593adjacent_context
low
Enhancing Vision Language Models for 4D Perception
Multimodal & Language / VLM / MLLM
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; data_benchmarkadjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title
abstractDespite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about 3D motion, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection on 2D images. Second, existing datasets fail to disentangle object and camera motion. To address these, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate large-scale 400K training samples and a 2.2K-sample benchmark. Training existing models on our dataset yields performance improvements on an external benchmark
594adjacent_context
low
Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals
Generative Models / Video Generation
D. adjacent but useful contextdynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmarkadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractRecent advancements in video generation have enabled the development of ``world models'' capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives—such as elastic collisions and falling dominos—teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, inc
595adjacent_context
low
Grounded Latents for Entity-Centric 4D Scene Generation
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; generation_editingadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractAlthough recent work has explored generative modeling of 3D or 4D driving scenes, most approaches operate on dense voxel-based representations, which are computationally expensive and struggle to maintain temporal or structural consistency. These methods often produce blurred or merged entities (i.e., cars, trucks, pedestrians) and lack fine-grained control over individual scene elements. We propose to perform generative modeling in a compact, entity-centric latent space, where each grounded 3D latent represents a semantically meaningful local region of the scene. This formulation enables precise, consistent control of both foreground and background elements while preserving geometric detail. We further extend this representation to 4D by learning a motion diffusion model for both ego and dynamic actors, conditioned on the generated 3D scene, and by propagating the grounded latents throu
596adjacent_context
low
ORBIT: Benchmarking SfM in the Wild with 360° Video
Data & Evaluation / Benchmark
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; dynamic_4d; data_benchmarkadjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title
abstractStructure-from-Motion (SfM) is a cornerstone of 3D perception, yet current methods often fail when applied to complex videos involving challenging camera motions or dynamic scenes.Compounding the problem, the field lacks reliable ground-truth benchmarks for such difficult scenarios, making it hard to gauge real-world progress, or pinpoint where improvements are most needed.To address this gap, we introduce a new benchmark for evaluating camera pose estimation.Our key insight is to leverage online panoramic 360° as a source of data from which to construct challenging clips, while still enabling robust ground-truth trajectory recovery.The panoramic nature of these videos provides richer visual context for tracking camera motion, even when parts of the view are affected by blur, motion, or dynamic objects.By tracking camera motion across full 360° videos, we crop and reproject selected port
597adjacent_context
low
Curvature-Aware Captioning: Leveraging Geodesic Attention for 3D Scene Understanding
Multimodal & Language / Grounding
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; surface_occupancy; robotics_mappingadjacent genus=Grounding with no direct reconstruction/SLAM/map signal in title
abstractAccurate 3D scene description is fundamental to robotic navigation and augmented reality, yet current dense captioning methods face significant limitations in processing sparse point cloud data. Existing approaches that apply Euclidean embedding spaces struggle to simultaneously preserve fine-grained local geometric details and model exponentially growing global semantic hierarchies, leading to either inaccurate localization or disjointed, shallow scene descriptions. In this work, we propose a novel \textbf{\textsc{Curvature-Aware Captioning}} framework, integrating novel non-Euclidean geodesic attention mechanisms, to resolve the localization-contextualization conflict. Specifically, self-attention within Oblique space enforces dimensional homogeneity while establishing long-range dependencies. Bidirectional geodesic cross-attention within Lorentz space models hierarchical semantic rela
598adjacent_context
low
Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves
Video & Motion / Human Motion
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; data_benchmarkadjacent genus=Human Motion with no direct reconstruction/SLAM/map signal in title
abstractUnderstanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information, such as contact forces and motion dynamics, and are prone to frequent occlusions. To address these challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove data in HOI videos into photorealistic bare-hand representations, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures both temporal and multi-view rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we introduce HandSense, the first multi-modal HOI dataset featuring multi-view
599adjacent_context
low
ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; generation_editingadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractGenerating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two
600adjacent_context
low
Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; generation_editingadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractGenerating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial–temporal consistency—constraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multi-view trajectories, amplifying cross-view inconsistency and temporal drift.We argue that 360° video generation offers a natural solution: panoramic coverage simplifies trajectory design and provides strong global context for maintaining coherence. We introduce Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion, a controllable 360° video generation framework that synthesizes high-fidelity videos from sparse 360° inputs. The key idea is an explicit 3D Cache, reconstructed from the input, which serves as a geometric scaffold for any user-defined camera path. This allows the diffu
601adjacent_context
low
PerpetualWonder: Long-horizon Action-conditioned 4D Scene Generation
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; generation_editingadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractWe introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from lon
602adjacent_context
low
SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; generation_editingadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractImages and videos are discrete 2D projections of the 4D world (3D space + time). Most visual understanding, prediction, and generation operate directly on 2D observations, leading to suboptimal performance. We propose SeeU, a novel approach that learns the continuous 4D dynamics and generate the unseen visual contents. The principle behind SeeU is a new 2D$\to$4D$\to$2D learning framework. SeeU first reconstructs the 4D world from sparse and monocular 2D frames (2D$\to$4D). It then learns the continuous 4D dynamics on a low-rank representation and physical constraints (discrete 4D$\to$continuous 4D). Finally, SeeU rolls the world forward in time, re-projects it back to 2D at sampled times and viewpoints, and generates unseen regions based on spatial-temporal context awareness (4D$\to$2D). By modeling dynamics in 4D, SeeU achieves continuous and physically-consistent novel visual generati
603adjacent_context
low
Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance
Computational Imaging / Computational Imaging
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; dynamic_4d; surface_occupancyadjacent genus=Computational Imaging with no direct reconstruction/SLAM/map signal in title
abstractWe present a differentiable framework to adaptively compute 4D illumination conditions with respect to an object, for efficient, high-quality simultaneous acquisition of its shape and reflectance, with a unified spatial-angular structured light and a single camera. Using a simple histogram-based pixel-level probability model for depth and reflectance, we differentiably link the next illumination condition(s) with a loss that encourages the reduction in depth uncertainty. As new structured illumination is cast, corresponding image measurements are used to update the uncertainty at each pixel. Finally, a fine-tuning-based approach reconstructs the depth map and reflectance parameter maps, by minimizing the differences between all physical measurements and their simulated counterparts. The effectiveness of our framework is demonstrated on physical objects with wide variations in shape and a
604adjacent_context
low
S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
Multimodal & Language / VLM / MLLM
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; surface_occupancy; robotics_mappingadjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title
abstract3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG. However, MLLMs primarily process 2D visual inputs and struggle with understanding 3D spatial structure of scenes solely from these limited perspectives. Existing methods mainly utilize viewpoint-dependent rendering of reconstructed point clouds to provide explicit structural guidance for MLLMs in 3DVG tasks, leading to inefficiency and limited spatial reasoning. To address this issue, we propose S$^2$-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning. We introduce a spatial guidance strategy that leverages the structure awareness of feed-forward 3D reconstr
605adjacent_context
low
Endless World: Real-Time 3D-Aware Long Video Generation
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; generation_editingadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractProducing long, coherent video sequences with stable 3D structure remains a major challenge, particularly in streaming scenarios. Motivated by this, we introduce Endless World, a real-time framework for infinite, 3D-consistent video generation. To support infinite video generation, we introduce a conditional autoregressive training strategy that aligns newly generated content with existing video frames. This design preserves long-range dependencies while remaining computationally efficient, enabling real-time inference on a single GPU without additional training overhead. Moreover, our Endless World integrates global 3D-aware attention to provide continuous geometric guidance across time. Our 3D injection mechanism enforces physical plausibility and geometric consistency throughout extended sequences, addressing key challenges in long-horizon and dynamic scene synthesis. Extensive experi
606adjacent_context
low
LaRP: Efficient Multi-View Inpainting with Latent Reprojection Priors
Generative Models / Image Editing
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; depth_correspondence; generation_editingadjacent genus=Image Editing with no direct reconstruction/SLAM/map signal in title
abstractThe task of multi-view inpainting necessitates 3D consistency in the inpainted images. Most prior methods first employ single-view 2D inpainting and then enforce multi-view consistency in a post-hoc 3D optimization stage, which leads to undesirable artifacts and lengthy optimization times. The existing single-stage method, MVInpainter, uses video priors and is pose-free, making it less suitable for inputs beyond video sequences. In this paper, we propose a framework that trains an inpainting model to condition on the explicit and reliable multi-view correspondences from a 3D foundation model. Central to our framework is a cross-view conditioning architecture, LaRP, carefully designed to utilize both the generative prior of a pretrained diffusion inpainting model and the reprojected cross-view appearance latents. We additionally propose a scalable data pipeline for stable training of LaRP
607adjacent_context
low
ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; dynamic_4d; generation_editingadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractWe present ReDirector, a novel camera-controlled video retake generation method for dynamically captured variable-length videos. In particular, we rectify a common misuse of RoPE in previous works by aligning the spatiotemporal positions of the input video and the target retake. Moreover, we introduce Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that captures and integrates multi-view relationships within and across the input and target videos. By integrating camera conditions into RoPE, our method generalizes to out-of-distribution camera trajectories and video lengths, yielding improved dynamic object localization and static background preservation. Extensive experiments further demonstrate significant improvements in camera controllability, geometric consistency, and video quality across various trajectories and lengths.
608adjacent_context
low
Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; generation_editingadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractVideo diffusion models have rich world priors, but their use in spatial tasks is limited by poor control, spatial-temporal inconsistent results, and entangled scene-camera dynamics. Current approaches, such as per-task fine-tuning or post-process warping strategies, are insufficient, often introducing visual artifacts, failing to generalize, or incurring high computational costs. We introduce a novel, training-free framework that operates purely at inference time to resolve these issues. Our method is comprised of three synergistic components. First, an intra-step refinement loop injects fine-grained motion guidance during the denoising process, iteratively correcting the output to ensure strict adherence to the target camera path. Second, an optical flow-based analysis identifies and isolates motion-related channels within the latent space. This allows our framework to selectively apply
609adjacent_context
low
ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; generation_editing; data_benchmarkadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractImage-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V g
610adjacent_context
low
From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction
Video & Motion / Human Motion
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; surface_occupancy; robotics_mappingadjacent genus=Human Motion; useful only if manually connected to reconstruction
abstractTwo-hand reconstruction from monocular images is hampered by complex poses and severe occlusions, which often cause interaction misalignment and two–hand penetration. We address this by decoupling the problem into 2D structural alignment and 3D spatial interaction alignment, each handled by a tailored component. For 2D alignment, we pioneer the attempt to unify heterogeneous structural priors (keypoints, segmentation, and depth) from vision foundation models as complementary structured guidance for two-hand recovery. Instead of extracting priors prediction as explicit inputs, we propose a fusion-alignment encoder that absorbs their structural knowledge implicitly, achieving foundation-level guidance without foundation-level cost. For 3D spatial alignment, we propose a two-hand diffusion model that learns a generative mapping from interpenetrated poses to realistic, collision-free configu
611adjacent_context
low
Align Images Before You Generate
Generative Models / Diffusion
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; dynamic_4d; generation_editingadjacent genus=Diffusion with no direct reconstruction/SLAM/map signal in title
abstractMulti-image diffusion models can generate images like multi-views or videos to describe static or dynamic scenes, yet texture and structure drift persist, severely undermining the spatiotemporal consistency. Addressing this issue remains challenging, especially without any external geometric or semantic priors during the pure generative inference. In this paper, we introduce CorrAdapter, a plug-and-play adapter that discovers and exploits an innate property of the multi-image diffusion itself, aligning all output images before they are in fact generated. Specifically, CorrAdapter designs a bypass branch for transformer blocks in the multi-image diffusion model, encompassing a native correspondence constructor that builds reliable correspondences from the diffusion model's intermediate features, and an aligned area aggregator that integrates messages from only matching regions to avoid am
612adjacent_context
low
Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval
Recognition & Classification / Retrieval
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; depth_correspondence; data_benchmarkadjacent genus=Retrieval with no direct reconstruction/SLAM/map signal in title
abstractVisual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model’s dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel viewpoint-pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates "hard queries" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. F
613adjacent_context
low
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
Medical & Scientific Imaging / Medical Imaging
D. adjacent but useful contextpose_calibration_localization; dynamic_4d; robotics_mapping; data_benchmarkadjacent genus=Medical Imaging with no direct reconstruction/SLAM/map signal in title
abstractFine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce **SurgCoT,** a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across **7 surgical specialties** and **35 diverse procedures**. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue–Action Alignment, Affordance Mapping, Micro‑Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (*Question → Option → Knowledge → Clue → Answer*), where the *Knowledge* field provides essential background context and *Clue* provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants
614adjacent_context
low
PAVAS: Physics-Aware Video-to-Audio Synthesis
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; generation_editing; data_benchmarkadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractRecent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical r
615adjacent_context
low
Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields
Computational Imaging / Computational Imaging
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4dadjacent genus=Computational Imaging with no direct reconstruction/SLAM/map signal in title
abstractWith the success of static black-hole imaging, the next frontier is the dynamic and 3D imaging of black holes. Recovering the dynamic 3D gas near a black hole would reveal previously-unseen parts of the universe and inform new physics models. However, only sparse radio measurements from a single viewpoint are possible, making the dynamic 3D reconstruction problem significantly ill-posed. Previously, BH-NeRF addressed the ill-posed problem by assuming Keplerian dynamics of the gas, but this assumption breaks down near the black hole, where the strong gravitational pull of the black hole and increased electromagnetic activity complicate fluid dynamics. To overcome the restrictive assumptions of BH-NeRF, we propose *PINeRF*, a physics-informed approach that uses differentiable neural rendering to fit a 4D (time + 3D) emissivity field given EHT measurements. Our approach jointly reconstructs
616adjacent_context
low
FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; generation_editing; data_benchmarkadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractWe introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference
617adjacent_context
low
SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; generation_editing; data_benchmarkadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractWe present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter both the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video’s motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This simple yet crucial strategy enables the model to learn temporal control, directly producing the observed spa
618adjacent_context
low
Generalizable Radio-Frequency Radiance Fields for Spatial Spectrum Synthesis
Computational Imaging / Computational Imaging
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; surface_occupancyadjacent genus=Computational Imaging with no direct reconstruction/SLAM/map signal in title
abstractWe present GRaF, Generalizable Radio-Frequency (RF) Radiance Fields, a framework that models RF signal propagation to synthesize spatial spectra at arbitrary transmitter or receiver locations, where each spectrum measures signal power across all surrounding directions at the receiver. Unlike state-of-the-art methods that adapt vanilla Neural Radiance Fields (NeRF) to the RF domain with scene-specific training, GRaF generalizes across scenes to synthesize spectra. To enable this, we prove an interpolation theory in the RF domain: the spatial spectrum from a transmitter can be approximated using spectra from geographically proximate transmitters. Building on this theory, GRaF comprises two components: (i) a geometry-aware Transformer encoder that captures spatial correlations from neighboring transmitters to learn a scene-independent latent RF radiance field, and (ii) a neural ray tracing
619adjacent_context
low
Real2Sim2Real: RetinalDepth-64K for Depth Estimation in Posterior Segment Ophthalmic Surgery
Medical & Scientific Imaging / Medical Imaging
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; robotics_mapping; data_benchmarkadjacent genus=Medical Imaging with no direct reconstruction/SLAM/map signal in title
abstractAccurate depth estimation is crucial for 3D reconstruction and precise navigation in ophthalmic fundus surgery. However, acquiring annotated data remains challenging due to the impracticality of depth sensors under surgical microscopes.To overcome this limitation, we introduce RetinalDepth-64K, a novel synthetic dataset comprising 64,000 stereo image pairs across 1,280 diverse scenes, developed through a Real2Sim2Real pipeline that transforms real-world fundus surgery videos into synthetic data and facilitates model deployment in real scenarios. We analyzed key characteristics such as intricate retinal textures from real-world videos to guide the Real-to-Sim phase, enabling realistic data synthesis.To improving dataset fidelity for depth estimation, we created 3D eye models using Blender with ultra-wide-field retinal textures, glass-modeled aqueous humor, and dynamic instrument trajector
620adjacent_context
low
Wavelet-Driven 3D Anomaly Detection under Pose-Agnostic and Sparse-View
Detection & Tracking / Detection
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; pose_calibration_localizationadjacent genus=Detection with no direct reconstruction/SLAM/map signal in title
abstractPose-agnostic anomaly detection (PAD) achieves strong performance in localizing anomalies from arbitrary viewpoints when trained on densely sampled normal data. However, under sparse-view conditions, existing methods face two key challenges: (1) sparse observations lead to overfitting and geometric detail loss in 3D reconstruction; (2) limited visual cues lead to inaccurate pose estimation, compromising the reliability of subsequent anomaly localization. To address these challenges, we propose Wave-Pose3D, a wavelet-driven 3D anomaly detection framework tailored for PAD under sparse-view conditions. First, we design a structure-aware and wavelet-optimized Gaussian modeling strategy that dynamically filters unreliable regions via structural priors to mitigate overfitting and leverages high-frequency supervision to restore fine-grained geometric details. Second, to improve pose estimation
621adjacent_context
low
RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation
Multimodal & Language / Grounding
D. adjacent but useful contextpose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmarkadjacent genus=Grounding with no direct reconstruction/SLAM/map signal in title
abstractVisual-language grounding aims to establish semantic correspondences between natural language and visual entities, enabling models to accurately identify and localize target objects based on textual instructions. Existing VLG approaches focus on coarse-grained, object-level localization, while traditional robotic grasping methods rely predominantly on geometric cues and lack language guidance, which limits their applicability in language-driven manipulation scenarios. To address these limitations, we propose the RealVLG framework, which integrates the RealVLG-11B dataset and the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks. RealVLG-11B dataset provides multi-granularity annotations including bounding boxes, segmentation masks, grasp poses, contact points, and human-verified fine-grained language descriptions, covering approximately 165,000 images, ove
622adjacent_context
low
RPGFusion: 4D Radar Prior-Guided Multi-Modal Fusion for 3D Detection
Detection & Tracking / Detection
D. adjacent but useful contextdepth_correspondence; dynamic_4d; surface_occupancy; robotics_mappingadjacent genus=Detection with no direct reconstruction/SLAM/map signal in title
abstractAccurate 3D object detection in autonomous driving relies on effectively combining complementary information from multiple sensors. 4D millimeter-wave radar provides sparse yet physically reliable measurements, whose potential for enhancing sensor fusion has not been fully utilized. In this work, we propose \textbf{R}adar \textbf{P}rior \textbf{G}uided \textbf{Fusion} (\textbf{RPGFusion}), a practical 4D radar–camera fusion framework. We first generate radar prior maps that encode spatial confidence and depth cues. These priors guide image feature sampling while preventing the uneven BEV feature distribution (near-dense, far-sparse) caused by Lift-Splat-Shoot view transformation. To address the sparsity and noise inherent in point clouds, we adopt a hybrid robust encoding and sparse-to-dense feature propagation. We further introduce spatial alignment and semantic fusion modules to reconc
623adjacent_context
low
Scene Grounding in the Wild
Multimodal & Language / Grounding
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; data_benchmarkadjacent genus=Grounding with no direct reconstruction/SLAM/map signal in title
abstractReconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry.In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains
624adjacent_context
low
Correspondence-Attention Alignment for Multi-view Diffusion Models
Generative Models / Diffusion
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; depth_correspondenceadjacent genus=Diffusion with no direct reconstruction/SLAM/map signal in title
abstractMulti-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view consistency remains unclear. In this work, we first verify that the attention maps of these models acquire geometric correspondence throughout training, attending to the geometrically corresponding regions across reference and target views for view-consistent generation. However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. Notably, supervising a single attention layer is sufficient to guide the model toward learning precise co
625adjacent_context
low
Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors
Detection & Tracking / Detection
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancyadjacent genus=Detection with no direct reconstruction/SLAM/map signal in title
abstractVision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated sc
626adjacent_context
low
RAM: Recover Any 3D Human Motion in-the-Wild
Video & Motion / Human Motion
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancyadjacent genus=Human Motion with no direct reconstruction/SLAM/map signal in title
abstractRecovering 3D human motion from monocular videos in-the-wild remains challenging due to occlusions, rapid movements, and viewpoint variations. To address these challenges, we introduce **Recover-Anyone Module (RAM)**, a unified framework for real-time and accurate 3D human motion reconstruction. RAM incorporates a motion-aware semantic tracker with adaptive Kalman filtering to achieve robust identity association under severe occlusions and dynamic interactions. A memory-augmented Temporal HMR module further enhances human motion reconstruction by injecting spatio-temporal priors for consistent and smooth motion estimation. Moreover, a lightweight Predictor module forecasts future poses to maintain reconstruction continuity, while a gated combiner adaptively fuses reconstructed and predicted features to ensure coherence and robustness. Experiments on in-the-wild multi-person benchmarks su
627adjacent_context
low
Unified Video Editing as Temporal Reasoner
Generative Models / Image Editing
D. adjacent but useful contextpose_calibration_localization; dynamic_4d; robotics_mapping; generation_editingadjacent genus=Image Editing with no direct reconstruction/SLAM/map signal in title
abstractExisting video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a "seeing, reasoning, then editing" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning toke
628adjacent_context
low
HUMAPS-4D : A Multimodal Dataset for HUman Motion Analysis with Physiological and Semantic informations
Data & Evaluation / Benchmark
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; data_benchmarkadjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title
abstractCurrent advancements in human motion understanding are strongly reliant on video data. Nevertheless, privacy regulations and operational constraints increasingly restrict the use of visual data in real-world scenarios. Inferring posture through wearable sensors, such as instrumented insoles measuring plantar activation, presents itself as a promising alternative. However, the absence of large-scale multimodal datasets hinders the rigorous benchmarking of these methodologies. We introduce HUMAPS-4D, a novel multimodal dataset designed for human motion analysis, effectively bridging computer vision and biomechanics. This dataset integrates synchronized motion capture, multi-view video, IMUs, plantar pressure signals, sEMG activation patterns, and high-level semantic annotations. The data was collected from 32 subjects performing 30 actions over a total duration of 14 hours. Participants de
629adjacent_context
low
Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining
Video & Motion / Human Motion
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; data_benchmarkadjacent genus=Human Motion with no direct reconstruction/SLAM/map signal in title
abstractHuman motion recovery for real-world interaction demands both precise action details and metric-scale trajectories. Recovering absolute human pose from monocular input presents a viable solution, but faces two main challenges: (1) models' reliance on 3D training data from constrained environments limits their out-of-distribution generalization; and (2) the inherent difficulty of estimating metric-scale poses from monocular observations. This paper introduces Mocap-2-to-3, a novel framework that differs from prior HMR methods by recovering absolute poses from monocular input and leveraging abundant 2D data to enhance 3D motion recovery. To effectively utilize the action priors and diversity in large-scale 2D datasets, we reformulate 3D motion as a multi-view synthesis process and divide the training into two stages: a single-view diffusion model is first pre-trained on extensive 2D data,
630adjacent_context
low
SHands: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training
Medical & Scientific Imaging / Medical Imaging
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; data_benchmarkadjacent genus=Medical Imaging with no direct reconstruction/SLAM/map signal in title
abstractIn surgical training for medical students, proficiency development relies on expert-led skill assessment, which is costly, time-limited, difficult to scale, and its expertise remains confined to institutions with available specialists. Automated AI-based assessment offers a viable alternative, but progress is constrained by the lack of datasets containing realistic trainee errors and the multi-view variability needed to train robust computer vision approaches. To address this gap, we present Surgical-Hands (SHands), a large-scale multi-view video dataset for surgical hand-gesture and error recognition for medical training. SHands captures linear incision and suturing using five RGB cameras from complementary viewpoints, performed by 52 participants (20 experts and 32 trainees) each completing three standardized trials per procedure. The videos are annotated at the frame level with 15 ges
631adjacent_context
low
240FPS Stereo Vision from Monocular Mixed Spikes
Computational Imaging / Computational Imaging
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; dynamic_4dadjacent genus=Computational Imaging with no direct reconstruction/SLAM/map signal in title
abstractStereo vision is fundamental for enabling machines to perceive and interact with the world. While monocular stereo methods offer hardware compactness, they struggle with generalization due to reliance on data-driven priors. Binocular and multi-view systems improve accuracy but incur higher hardware complexity and data inefficiency. In this paper, we introduce a monocular solution for high-frame-rate stereo vision via temporal optical modulation. The modulation directs light from two views in a mixed manner while periodically attenuates one view at 60Hz. To capture the temporal variations introduced by this modulation, we employ a high-speed spike camera that records the mixed scene as temporally dense spikes. And the high temporal resolution of these spikes enables the construction of a linear system for efficient binocular video decoupling.Consequently, we introduce a two-stage decoding
632adjacent_context
low
EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence
Multimodal & Language / VLM / MLLM
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mappingadjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title
abstractRecent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with black-box reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for “thinking with images’’ (e.g., ChatGPT–o3 and DeepEyes) show that stepwise multimodal reasoning can emerge by interleaving hypothesis formation with active acquisition of visual evidence, but they do not address three key challenges in spatial Chain-of-Thought (CoT): building global space perception under strict token budgets, explicitly associating 3D hypotheses with video frames for verification, and designing spatially grounded rewards for reinforcement learning. To address these issues, we present \textbf{EagleVision}, a dual-stage framework for progressive spatial cognition through macro percep
633adjacent_context
low
Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis
Medical & Scientific Imaging / Medical Imaging
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; dynamic_4dadjacent genus=Medical Imaging with no direct reconstruction/SLAM/map signal in title
abstractEndoscopic video analysis is crucial for early gastrointestinal screening, but its progress is constrained by limited high-quality annotations. While self-supervised video pre-training shows promise, existing methods designed for natural videos tend to prioritize dense spatio-temporal modeling and exhibit motion bias, neglecting the static, structured semantics that are critical for clinical decision-making. To address this challenge, we propose **F**ocus-to-**P**erceive **R**epresentation **L**earning (***FPRL***), a cognition-inspired hierarchical framework that emulates the clinical examination process of endoscopic videos. ***FPRL*** first focuses on intra-frame lesion-centric regions to learn static semantics, and then perceives their evolution across frames to model contextual semantics. To achieve this, ***FPRL*** employs a hierarchical semantic modeling mechanism that explicitly
634adjacent_context
low
Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals
Computational Imaging / Computational Imaging
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mappingadjacent genus=Computational Imaging with no direct reconstruction/SLAM/map signal in title
abstractRobust 3D environmental perception is critical for applications like autonomous navigation and robotics, yet existing optical sensors like cameras and LiDAR fail in adverse conditions such as smoke, fog, and non-ideal lighting. While specialized radar systems can operate in these conditions, their reliance on bespoke, ultra-wideband hardware and licensed spectrum limits their scalability and cost-effectiveness. This paper introduces Rascene, a novel framework that enables high-fidelity 3D imaging by repurposing ubiquitous mmWave OFDM communication signals. Recognizing that a single-frame RF signal is inherently sparse, noisy, and highly ambiguous, the key innovation of Rascene is a multi-frame 3D imaging framework designed to fuse information from signals captured across multiple, arbitrary poses. This framework leverages a spatially adaptive fusion mechanism to find geometric consensus
635adjacent_context
low
Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
Multimodal & Language / VLM / MLLM
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mappingadjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title
abstractSpatial reasoning is the process of locating target objects based on spatial relations in 3D scenes, which plays a crucial role in developing intelligent embodied agents. Due to the limited availability of 3D scene-language paired data, it is challenging to train models with strong reasoning ability from scratch. Previous approaches have attempted to inject 3D scene representations into the input space of Large Language Models (LLMs) and leverage the pretrained comprehension and reasoning abilities for spatial reasoning. However, models encoding absolute positions struggle to extract spatial relations from prematurely fused features, while methods explicitly encoding all spatial relations (which is quadratic in the number of objects) as input tokens suffer from poor scalability. To address these limitations, we propose QuatRoPE, a novel positional embedding method with an input length th
636adjacent_context
low
Action Motifs: Self-Supervised Hierarchical Representation of Human Body Movements
Video & Motion / Human Motion
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; data_benchmarkadjacent genus=Human Motion with no direct reconstruction/SLAM/map signal in title
abstractEffective human behavior modeling requires a representation of the human body movement that capitalizes on its compositionality. We propose a hierarchical representation consisting of Action Atoms which capture the atomic joint movements and Action Motifs which are formed by their temporal compositions and encode similar body movements found across different overall human actions. We derive A4Mer, a nested latent Transformer to learn this hierarchical representation from human pose data in a fully self-supervised manner. A4Mer splits a 3D pose sequence into variable-length segments and represents each segment as a single latent token (Action Atoms). Through bottom-up representation learning, temporal patterns composed of these Action Atoms, which capture meaningful temporal spans of reusable, semantic segments of body movements, naturally emerge (Action Motifs). A4Mer achieves this with
637adjacent_context
low
Gloria: Consistent Character Video Generation via Content Anchors
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; generation_editingadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractDigital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the "memory", leading to suboptimal consistency.Recognizing that character video generation inherently resembles an ``outside-looking-in" scenario. In this work, we propose represent the character’s visual attributes through a compact set of anchor frames.This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Conditio
638adjacent_context
low
Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection
Detection & Tracking / Detection
D. adjacent but useful contextpose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmarkadjacent genus=Detection with no direct reconstruction/SLAM/map signal in title
abstractIntegrating LiDAR and camera inputs into a unified Bird’s-Eye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, existing methods suffer from spatial misalignment between LiDAR and camera features, which causes inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from calibration inaccuracies and rolling shutter effect.The key insight of this work is that locations of these projection errors are not random but highly predictable, as they are concentrated at object-background boundaries which 2D detectors can reliably identify. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth Calib
639adjacent_context
low
In Pursuit of Pixel Supervision for Visual Pre-training
Learning Algorithms / Self-supervised
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; robotics_mappingadjacent genus=Self-supervised with no direct reconstruction/SLAM/map signal in title
abstractPixels provide a lightweight, scalable way to encode the physical world, preserving rich visual information with minimal human inductive bias. We demonstrate that visual pre-training using pixel supervision alone can learn desirable visual properties and produce strong representations, while remaining simple, stable, and efficient. We present Pixo, a capable self-supervised model trained by purely predicting pixels. It is instantiated on the masked autoencoding (MAE) framework, but enhances MAE with a deeper decoder, larger-block masking, and additional class tokens. It is trained on 2B web-crawled images with a self-curated strategy. Pixo performs well on many downstream tasks, covering monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), object segmentation (e.g., SAM 2), and embodied AI. We will release the training code and pre-traine
640adjacent_context
low
Fast Reasoning Segmentation for Images and Videos
Segmentation & Dense Prediction / Segmentation
D. adjacent but useful contextdepth_correspondence; surface_occupancy; robotics_mapping; data_benchmarkadjacent genus=Segmentation with no direct reconstruction/SLAM/map signal in title
abstractReasoning segmentation enables open-set object segmentation via implicit text queries, therefore serving as a foundation for embodied agents that should operate autonomously in real-world environments. However, existing methods for reasoning segmentation require multimodal large language models with billions of parameters that exceed the computational capabilities of edge devices that typically deploy the embodied AI systems. Distillation offers a pathway to compress these models while preserving their capabilities. Yet, existing distillation approaches fail to transfer the multi-step reasoning capabilities that reasoning segmentation demands, as they focus on matching output predictions and intermediate features rather than preserving reasoning chains. The emerging paradigm of reasoning over digital twin representations presents an opportunity for more effective distillation by re-frami
641adjacent_context
low
GenMatter: Perceiving Physical Objects with Generative Matter Models
Segmentation & Dense Prediction / Segmentation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; data_benchmarkadjacent genus=Segmentation with no direct reconstruction/SLAM/map signal in title
abstractHuman visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion and appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates
642adjacent_context
low
InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization
Generative Models / Image Editing
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; generation_editingadjacent genus=Image Editing with no direct reconstruction/SLAM/map signal in title
abstractWe address the task of multi-view image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify the scene according to a textual instruction while preserving consistency across all views.Existing methods, based on per-scene neural fields or temporal attention mechanisms, struggle in this setting, often producing artifacts and incoherent edits. We propose InstructMix2Mix (I-Mix2Mix), a framework that distills the editing capabilities of a 2D diffusion model into a pretrained multi-view diffusion model, leveraging its data-driven 3D prior for cross-view consistency. A key contribution is replacing the conventional neural field consolidator in Score Distillation Sampling (SDS) with a multi-view diffusion student, which requires novel adaptations: incremental student updates across timesteps, a spec
643adjacent_context
low
MatLat: Material Latent Space for PBR Texture Generation
Generative Models / Diffusion
D. adjacent but useful contextdepth_correspondence; surface_occupancy; robotics_mapping; data_benchmarkadjacent genus=Diffusion with no direct reconstruction/SLAM/map signal in title
abstractWe propose a generative framework for producing high-quality PBR textures on a given 3D mesh. As large-scale PBR texture datasets are scarce, our approach focuses on effectively leveraging the embedding space and diffusion priors of pretrained latent image generative models while learning a material latent space, **MatLat**, through targeted fine-tuning. Unlike prior methods that freeze the embedding network and thus lead to distribution shifts when encoding additional PBR channels and hinder subsequent diffusion training, we fine-tune the pretrained VAE so that new material channels can be incorporated with minimal latent distribution deviation. We further show that correspondence-aware attention alone is insufficient for cross-view consistency unless the latent-to-image mapping preserves locality. To enforce this locality, we introduce a regularization in the VAE fine-tuning that crops
644adjacent_context
low
Plenoptic Video Generation
Generative Models / Video Generation
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; generation_editingadjacent genus=Video Generation with no direct reconstruction/SLAM/map signal in title
abstractCamera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against l
645adjacent_context
low
MambaSIC: Mamba-based Stereo Image Compression with Bi-directional Multi-reference Entropy Model
Learning Algorithms / Efficient Models
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; robotics_mappingadjacent genus=Efficient Models with no direct reconstruction/SLAM/map signal in title
abstractStereo image compression (SIC) has become increasingly vital with its applications surging in fields such as 3D reconstruction and autonomous navigation. Previous methods leverage cross-attention to model inter-view redundancy and employ autoregressive entropy models to predict probability distributions, achieving impressive rate-distortion performance. However, they suffer from slow coding speed due to the quadratic complexity of cross-attention mechanisms and the spatial autoregressive iterations of the entropy models. To address these limitations, we propose MambaSIC, which introduces two key innovations. First, we propose a Mamba-based stereo visual state space block (stereo VSSB) that leverages its linear complexity and long-range modeling capabilities to more rapidly and efficiently capture redundancy information between the two views. Second, to accelerate the compression process
646adjacent_context
low
Learning Diffeomorphism for Medical Image Registration with Time-Embedded Architectures Using Semigroup Regularization
Medical & Scientific Imaging / Medical Imaging
D. adjacent but useful contextpose_calibration_localization; dynamic_4d; robotics_mappingadjacent genus=Medical Imaging with no direct reconstruction/SLAM/map signal in title
abstractDiffeomorphic image registration (DIR) seeks topology-preserving transformations and is fundamental in medical imaging. Existing DIR methods rely on integration schemes (e.g., scaling-and-squaring) and multiple regularizers to enforce invertibility. We introduce **SGDIR**, a continuous-time registration framework, parameterized by known time-embedded backbones, that models diffeomorphisms using only a single semigroup-based regularization, eliminating explicit integration and auxiliary constraints. We mathematically prove that this formulation directly learns the flow of an underlying ODE, inherently enforcing inverse and cycle consistencies. We evaluate on eight 2D and 3D MR and CT datasets. Under strict semigroup enforcement, our model achieves near-perfect diffeomorphism (near-zero folding) and significantly outperforms existing diffeomorphic methods, while remaining competitive with
647adjacent_context
low
SPREAD: Spatial-Physical Reasoning via gEometry Aware Diffusion
Generative Models / Diffusion
D. adjacent but useful contextsurface_occupancy; robotics_mapping; generation_editing; data_benchmarkadjacent genus=Diffusion with no direct reconstruction/SLAM/map signal in title
abstractAutomated 3D scene generation is pivotal for applications spanning virtual reality, digital content creation, and Embodied AI. While computer graphics prioritizes aesthetic layouts, vision and robotics demand scenes that mirror real-world complexity which current data-driven methods struggle to achieve due to limited unstructured training data and insufficient spatial and physical modeling. We propose SPREAD, a diffusion-based framework that jointly learns spatial and physical relationships through a graph transformer, explicitly conditioning on posed scene point clouds for geometric awareness. Moreover, our model integrates differentiable guidance for collision avoidance, relational constraint, and gravity, ensuring physically coherent scenes without sacrificing relational context. Our experiments on 3D-FRONT and ProcTHOR datasets demonstrate state-of-the-art performance in spatial-rela
648adjacent_context
low
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
Multimodal & Language / VLM / MLLM
D. adjacent but useful contextgeneral_reconstruction; robotics_mapping; data_benchmarkadjacent genus=VLM / MLLM with no direct reconstruction/SLAM/map signal in title
abstractVision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents.We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input.To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning.Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an
649adjacent_context
low
Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation
Generative Models / Diffusion
D. adjacent but useful contextgaussian_radiance; robotics_mapping; generation_editingadjacent genus=Diffusion with no direct reconstruction/SLAM/map signal in title
abstractIn stroke-based rendering, search methods often get trapped in local minima due to discrete stroke placement, while differentiable optimizers lack structural awareness and produce unstructured layouts. To bridge this gap, we propose a dual representation that couples discrete polylines with continuous Bézier control points via a bidirectional mapping mechanism. This enables collaborative optimization: local gradients refine global stroke structures, while content-aware stroke proposals help escape poor local optima. Our representation further supports Gaussian-splatting-inspired initialization, enabling highly parallel stroke optimization across the image. Experiments show that our approach reduces the number of strokes by 30–50\%, achieves more structurally coherent layouts, and improves reconstruction quality, while cutting optimization time by 30–40\% compared to existing differentiab
650adjacent_context
low
AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
Data & Evaluation / Benchmark
D. adjacent but useful contextpose_calibration_localization; robotics_mapping; data_benchmarkadjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title
abstractAssembling objects from parts requires understanding multimodal instructions, linking them to 3D components, and predicting physically plausible 6-DoF motions for each assembly step. Existing datasets focus on simplified scenarios, overlooking shape complexities and assembly trajectories in industrial assemblies. We introduce AssemblyBench, a synthetic dataset of 2,789 industrial objects with multimodal instruction manuals, corresponding 3D part models, and part assembly trajectories. We also propose a transformer-based model, AssemblyDyno, which uses the instructional manual and the 3D shape of each part to jointly predict assembly order and part assembly trajectories. AssemblyDyno outperforms prior works in both assembly pose estimation and trajectory feasibility, where the latter is evaluated by our physics-based simulations.
651adjacent_context
low
SceMoS: Local Scene-Aware Human Motion Synthesis by Planning with Geometry-Grounded Tokens
Video & Motion / Human Motion
D. adjacent but useful contextsurface_occupancy; robotics_mapping; data_benchmarkadjacent genus=Human Motion with no direct reconstruction/SLAM/map signal in title
abstractSynthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent (“walk to the couch”) and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a top-down bird’s-eye-view (BEV) image of the scene, encoded with DINOv2 features, as the scene representation, a
652adjacent_context
low
PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis
Low-level Vision / IQA
D. adjacent but useful contextgeneral_reconstruction; gaussian_radianceadjacent genus=IQA with no direct reconstruction/SLAM/map signal in title
abstractDiffusion models are promising for sparse-view novel view synthesis (NVS), as they can generate pseudo-ground-truth views to aid 3D reconstruction pipelines like 3D Gaussian Splatting (3DGS). However, these synthesized images often contain photometric and geometric inconsistencies, and their direct use for supervision can impair reconstruction. To address this, we propose Partial-Reference Image Quality Assessment (PR-IQA), a framework that evaluates diffusion-generated views using reference images from different poses, eliminating the need for ground truth. PR-IQA first computes a geometrically consistent partial quality map in overlapping regions. It then performs quality completion to inpaint this partial map into a dense, full-image map. This completion is achieved via a cross-attention mechanism that incorporates reference-view context, ensuring cross-view consistency and enabling t
653adjacent_context
low
Turbo-GS: Accelerating 3D Gaussian Fitting for High-Resolution Radiance Fields
Learning Algorithms / Efficient Models
D. adjacent but useful contextgeneral_reconstruction; gaussian_radianceadjacent genus=Efficient Models with no direct reconstruction/SLAM/map signal in title
abstractNovel-view synthesis plays a crucial role in computer vision with applications in 3D reconstruction, mixed reality, and robotics. Recent approaches, such as 3D Gaussian Splatting (3DGS), have emerged as state-of-the-art solutions, offering high-quality novel view synthesis in real time. However, training 3DGS models remains slow, particularly for high-resolution images, often requiring hours to fit a scene with 200 views. In this work, we aim to accelerate the fitting process by reducing computational overhead and improving learning efficiency. Specifically, we introduce a dilated rendering technique that renders only a subset of pixels instead of the full image, significantly reducing computational costs. To enhance learning efficiency, we develop a convergence-aware budget control mechanism that balances the addition of new Gaussians with the optimization of existing ones. Additionally
654adjacent_context
low
ENC-Bench: A Benchmark for Evaluating Multimodal Large Language Models in Electronic Navigational Chart Understanding
Data & Evaluation / Benchmark
D. adjacent but useful contextpose_calibration_localization; robotics_mapping; data_benchmarkadjacent genus=Benchmark with no direct reconstruction/SLAM/map signal in title
abstractElectronic Navigational Charts (ENCs) are the safety-critical backbone of modern maritime navigation, yet it remains unclear whether multimodal large language models (MLLMs) can reliably interpret them. Unlike natural images or conventional charts, ENCs encode regulations, bathymetry, and route constraints via standardized vector symbols, scale-dependent rendering, and precise geometric structure---requiring specialized maritime expertise for interpretation. We introduce ENC-Bench, the first benchmark dedicated to professional ENC understanding. ENC-Bench contains 20,490 expert-validated samples from 840 authentic National Oceanic and Atmospheric Administration (NOAA) ENCs, organized into a three-level hierarchy: Perception (symbol and feature recognition), Spatial Reasoning (coordinate localization, bearing, distance), and Maritime Decision-Making (route legality, safety assessment, em
655adjacent_context
low
Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments
Detection & Tracking / Detection
D. adjacent but useful contextsurface_occupancy; robotics_mapping; data_benchmarkadjacent genus=Detection with no direct reconstruction/SLAM/map signal in title
abstractIncremental 3D object perception is a critical step toward embodied intelligence in dynamic indoor environments. However, existing incremental 3D detection methods rely on extensive annotations of novel classes for satisfactory performance. To address this limitation, we propose FI3Det, a Few-shot Incremental 3D Detection framework that enables efficient 3D perception with only a few novel samples by leveraging vision-language models (VLMs) to learn knowledge of unseen categories. FI3Det introduces a VLM-guided unknown object learning module in the base stage to enhance perception of unseen categories. Specifically, it employs VLMs to mine unknown objects and extract comprehensive representations, including 2D semantic features and class-agnostic 3D bounding boxes. To mitigate noise in these representations, a weighting mechanism is further designed to re-weight the contributions of poin
656adjacent_context
low
Translating Signals to Languages for sEMG-Based Activity Recognition
Recognition & Classification / Classification
D. adjacent but useful contextsurface_occupancy; robotics_mapping; data_benchmarkadjacent genus=Classification with no direct reconstruction/SLAM/map signal in title
abstractSurface electromyography (sEMG) signal-based activity recognition has attracted increasing research attention in recent years. To develop accurate sEMG signal-based activity recognizers, numerous approaches have been proposed. Some studies focus on designing larger and more expressive model architectures to enhance the representational capacity of sEMG signals, while others aim to enrich model priors through large-scale pretraining, thereby improving recognition performance. Recently, large language models (LLMs) have shown remarkable generalization and reasoning capabilities in natural language processing, whose implicit knowledge, learned from extensive linguistic descriptions of actions, opens new possibilities for interpreting sEMG signals and inferring activity intentions. Motivated by this, we propose LLM-sEMG, a novel framework that leverages LLMs as sEMG activity recognizers. Wit
657adjacent_context
low
Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images
Generative Models / Diffusion
D. adjacent but useful contextgaussian_radiance; robotics_mappingadjacent genus=Diffusion with no direct reconstruction/SLAM/map signal in title
abstractWe present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground-level AND by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photos to predict Gaussian splats in a unified 3D coordinate frame. By aligning ground and bird's-eye feature representations, our model improves scene coverage and novel-view synthesis, compared to ground imagery alone. We train on curated georeferenced data sets and paired satellite--terrain data, mined from open mapping services.
658adjacent_context
low
PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation
Medical & Scientific Imaging / Medical Imaging
D. adjacent but useful contextgaussian_radiance; pose_calibration_localizationadjacent genus=Medical Imaging with no direct reconstruction/SLAM/map signal in title
abstractBrain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion detection and radiotherapy target delineation. However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networks often overlook clinically observed spatial priors of tumor occurrence, leading to redundant feature computation over extensive background regions. To address this issue, we propose PGR-Net (Prior-Guided Region Network)—an explicit ROI-aware framework that incorporates a data-driven spatial prior set to capture the distribution and scale characteristics of tumor lesions, providing global guidance for more stable segmentation. Leveraging these priors, PGR-Net introduces a hierarchical Top-K ROI decision mechanism that progressively selects the most confident lesion candidate regions
659adjacent_context
low
Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation
Generative Models / Diffusion
D. adjacent but useful contextgaussian_radiance; robotics_mappingadjacent genus=Diffusion with no direct reconstruction/SLAM/map signal in title
abstractWe propose Decoupled Residual Denoising Diffusion models (DRDD) for unified and data-efficient image-to-image (I2I) translation. While diffusion models have advanced I2I translation in terms of quality and diversity, we uncover a previously under-explored property in diffusion models. Crucially, beyond its conventional role of manifold lifting (i.e., moving data off low-dimensional manifolds), injecting Gaussian noise facilitates domain harmonization by implicitly aligning feature distributions across domains, a property particularly advantageous for unified I2I translation. However, existing diffusion models prematurely erode this harmonization effect, as noise and residuals are simultaneously removed in a single coupled diffusion process. To address this, DRDD decouples the diffusion process into two sequential and independent diffusion stages: (1) a stochastic noise diffusion for doma
660likely_noise
low
3D-Object Perception Transformer (3PT)
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractCurrent approaches to zero-shot 3D-object perception typically rely on ensembles of frozen foundation models.This limits deep object understanding and cross-domain generalization, making performance inadequate for real-world deployment. The 3D-Object Perception Transformer (3PT) addresses this limitation by unifying detection, segmentation, and 6DoF pose estimation in a single framework, directly trained for 3D-object perception. Based on two large-scale trained Transformers that specialize in 2D and 3D object-centric scene understanding respectively, 3PT continuously refines its object representations without depth input, enhancing 3D understanding by incorporating multi-view information. 3PT surpasses task-specialized models for detection and pose estimation, often achieving double-digit percentage improvements on the diverse BOP-benchmarks. Achieving high accuracy and robustness, \alg
661likely_noise
low
ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractSymmetry detection is a fundamental problem in computer vision, and symmetries serve as powerful priors for downstream tasks. However, existing learning-based methods for detecting 3D symmetries from single images have been almost exclusively trained and evaluated on object-centric or synthetic datasets, and thus fail to generalize to real-world scenes. Furthermore, due to the inherent scale ambiguity of monocular inputs, which makes localizing the 3D plane an ill-posed problem, many existing works only predict the plane's orientation. In this paper, we address these limitations by presenting the first framework for detecting *3D-grounded reflectional symmetries* from single, in-the-wild RGB images, focusing on architectural landmarks. We introduce two key innovations: (1) a scalable data annotation pipeline to automatically curate a large-scale dataset of architectural symmetries, ArchS
662likely_noise
low
Extend3D: Town-scale 3D Generation
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancy; generation_editingweak or indirect keyword match
abstractIn this paper, we propose Extend3D, a novel training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces of object-centric models in representing wide scenes, we extend the latent space in $x$ and $y$ directions. Then, by dividing the extended latent into overlapping patches, we use the object-centric 3D generative model on each patch and couple them at each time step. Since object-centric models are sub-optimal for sub-scene generation, we use the input image and point cloud extracted from a depth estimator as priors to enable this process. Using the point cloud prior, we initialize the scene structure and refine the occluded region iteratively with under-noised SDEdit. Also, both priors are used to optimize the extended latent during the denoising process so that the denoisi
663likely_noise
low
Fast Markov Random Field Optimisation for Topologically Noisy 3D Shape Matching
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractIn many real world applications of non-rigid shape matching, the shapes are subject to topological noise (i.e. varying genus). In this paper, we propose a novel formulation based on Markov Random Fields (MRF) that can handle these cases with topological noise. The solutions to our optimisation problem can be approximated efficiently using the alpha expansion algorithm, which gives rise to theoretical approximation guarantees. In particular, we cast non-rigid 3D shape matching as a multi-labelling problem in which each triangle of the source shape is assigned a label that represents the matching to a specific surface element on the target shape. We propose a novel pairwise term that imposes that our matching prefers solutions in which neighbouring triangles on the source shape remain close on the target shape. Further, by exploiting the specific structure of our label space, we show that
664likely_noise
low
From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancyweak or indirect keyword match
abstractShape matching is a fundamental task in computer graphics and vision, with deep functional map methods emerging as a preferred solution. However, existing approaches primarily focus on learning informative feature representations by constraining both pointwise and functional maps, while overlooking the optimization of a crucial component: the spectral basis, which plays a key role in the (deep) functional maps pipeline. This oversight leads to suboptimal matching performance. Furthermore, these approaches mostly rely on conventional functional map techniques, such as time-consuming functional map solvers, which incur substantial computational overhead. To address those, we introduce Advanced Functional Maps, which generalizes standard functional maps from fixed basis functions to learnable basis functions, supported by rigorous theoretical guarantees. In this framework, the spectral basi
665likely_noise
low
Native and Compact Structured Latents for 3D Generation
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancy; generation_editing; data_benchmarkkeyword noise pattern without direct reconstruction signal
abstractRecent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models compris
666likely_noise
low
Pano360: Perspective to Panoramic Vision with Geometric Consistency
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractPrior panorama stitching approaches heavily rely on pairwise feature correspondences and are unable to leverage geometric consistency across multiple views. This leads to severe distortion and misalignment, especially in challenging scenes with weak textures, large parallax, and repetitive patterns.Given that multi-view geometric correspondences can be directly constructed in 3D space, making them more accurate and globally consistent, we extend the 2D alignment task to the 3D photogrammetric space. We adopt a novel transformer-based architecture to achieve 3D awareness and aggregate global information across all views. It directly utilizes camera poses to guide image warping for global alignment in 3D space and employs a multi-feature joint optimization strategy to compute the seams.Additionally, to establish an evaluation benchmark and train our network, we collected a large-scale data
667likely_noise
low
PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; data_benchmarkweak or indirect keyword match
abstract6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and ge
668likely_noise
low
Rethinking 2D-3D Registration: A Novel Network for High-Value Zone Selection and Representation Consistency Alignment
3D Vision & Geometry / Pose Estimation
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancyweak or indirect keyword match
abstractBoth detection-then-match and detection-free methods have been extensively studied for image-to-point cloud registration, yet they still face significant challenges. The detection-then-match approach emphasizes high-quality correspondences but is limited by the availability of repeatable keypoints, making it susceptible to errors from incorrect matches. In contrast, detection-free methods aim for dense correspondences using a coarse-to-fine strategy to mitigate matching errors. However, non-overlapping regions and low-quality matches still introduce inaccuracies, and the differences between image texture and point cloud structure cause inconsistent region representations, increasing the likelihood of incorrect matches.To address these challenges, we propose two innovative modules: the High-Value Zone Reinforced Selection Module (HZRS) and the Zone Representation Consistency Alignment Mod
669likely_noise
low
SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; surface_occupancy; generation_editing; data_benchmarkweak or indirect keyword match
abstractWe propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor
670likely_noise
low
UniCorn: Unified Correspondence Transformer Across 2D and 3D
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancyweak or indirect keyword match
abstractVisual correspondence across image-to-image (2D-2D), image-to-point cloud (2D-3D), and point cloud-to-point cloud (3D-3D) geometric matching forms the foundation for numerous 3D vision tasks. Despite sharing a similar problem structure, current methods use task-specific designs with separate models for each modality combination. We present UniCorn, the first correspondence model with shared weights that unifies geometric matching across all three tasks. Our key insight is that Transformer attention naturally captures cross-modal feature similarity. We propose a dual-stream decoder that maintains separate appearance and positional feature streams. This design enables end-to-end learning through stack-able layers while supporting flexible query-based correspondence estimation across heterogeneous modalities. Our architecture employs modality-specific backbones followed by shared encoder an
671likely_noise
low
Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editing; data_benchmarkweak or indirect keyword match
abstractRealistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualize the city through a top-down planning strategy that defines a hierarchical “City–District–Grid” structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a pr
672likely_noise
low
SonoWorld: From One Image to a 3D Audio-Visual Scene
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editing; data_benchmarkweak or indirect keyword match
abstractTremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation.
673likely_noise
low
Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractSpatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world. Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dat
674likely_noise
low
Lafite : A Generative Latent Field for 3D Native Texturing
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editing; data_benchmarkweak or indirect keyword match
abstractGenerating detailed and seamless textures for 3D meshes remains an open challenge. Recent image and video generation models, empowered by large-scale visual priors, are capable of producing highly detailed images and are thus promising for multi-view texture synthesis. However, evaluating texture quality involves multiple dimensions beyond visual fidelity. Multi-view back-projection often introduces seams and inconsistencies between different views or near occluded regions, while direct generation on UV-unwrapped maps suffers from UV distortions and ambiguities.Generating textures directly in 3D space offers an inherent advantage in ensuring continuity and spatial coherence, making it a critical and worthwhile research direction. Therefore, we systematically investigate 3D-native texture generation from the perspectives of representation and generation, and present current best practices
675likely_noise
low
MatE: Material Extraction from Single-Image via Geometric Prior
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractThe creation of high-fidelity, physically-based rendering (PBR) materials remains a bottleneck in many graphics pipelines, typically requiring specialized equipment and expert-driven post-processing. To democratize this process, we present MatE, a novel method for generating tileable PBR materials from a single image taken under unconstrained, real-world conditions. Given an image and a user-provided mask, MatE first performs coarse rectification using an estimated depth map as a geometric prior, and then employs a dual-branch diffusion model. Leveraging a learned consistency from rotation-aligned and scale-aligned training data, this model further rectify residual distortions from the coarse result and translate it into a complete set of material maps, including albedo, normal, roughness and height. Our framework achieves invariance to the unknown illumination and perspective of the inp
676likely_noise
low
Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractEstablishing semantic correspondence without supervision is essential for handling diverse in-the-wild images where annotations are scarce.While recent 2D foundation models offer powerful features, adapting them for unsupervised learning via nearest-neighbor pseudo-labels has key limitations: it operates locally, ignoring structural relationships, and consequently its reliance on 2D appearance fails to resolve geometric ambiguities arising from symmetries or repetitive features.In this work, we address this by reformulating pseudo-label generation as a Fused Gromov-Wasserstein (FGW) problem, which jointly optimizes inter-feature similarity and intra-structural consistency. Our framework, Shape-of-You (SoY), leverages a 3D foundation model to define this intra-structure in the geometric space, resolving abovementioned ambiguity. However, since FGW is a computationally prohibitive quadrati
677likely_noise
low
3D Instance Models are Implicit Generalizable Spatial Learners
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editing; data_benchmarkweak or indirect keyword match
abstractGeneralization remains the central challenge for interactive 3D scene generation. Existing learning‑based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts.We instead reprogram a pre‑trained 3D instance generator to act as a scene‑level learner via, replacing dataset-bounded supervision with model-centric spatial supervision.This reprogramming unlocks the generator's transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions.Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator’s transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues.Replacing widely used canonical space, we instantiate this insight with a view‑centric formul
678likely_noise
low
Velox: Learning Representations of 4D Geometry and Appearance
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; dynamic_4d; surface_occupancyweak or indirect keyword match
abstractWe introduce a framework for learning latent representations of 4D objects which are descriptive, faithfully capturing object geometry and appearance; compressive, aiding in downstream efficiency; and accessible, requiring minimal input, i.e., an unstructured dynamic point cloud, to construct. Specifically, Velox trains an encoder to compress spatiotemporal color point clouds into a set of *dynamic shape tokens*. These tokens are supervised using two complementary decoders: a 4D surface decoder, which models the time-varying surface distribution capturing the geometry; and a Gaussian decoder, which maps the tokens to 3D Gaussians, helping learn appearance.To demonstrate the utility of our representation, we evaluate it across three downstream tasks—video-to-4D generation, 3D tracking, and cloth simulation via image-to-4D generation—and observe strong performances in all settings.
679likely_noise
low
Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editing; data_benchmarkweak or indirect keyword match
abstractEditing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing open-vocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task.We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space.Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explici
680likely_noise
low
HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editing; data_benchmarkweak or indirect keyword match
abstractThe emergence of virtual reality has necessitated the generation of detailed and customizable 3D hand models for interaction in the virtual world. However, the current methods for 3D hand model generation are both expensive and cumbersome, offering very little customizability to the users. While recent advancements in zero-shot text-to-3D synthesis have enabled the generation of diverse and customizable 3D models using Score Distillation Sampling (SDS), they do not generalize very well to 3D hand model generation, resulting in unnatural hand structures, view-inconsistencies and loss of details. To address these limitations, we introduce HandDreamer, the first method for zero-shot 3D hand model generation from text prompts. Our findings suggest that view-inconsistencies in SDS is primarily caused due to the ambiguity in the probability landscape described by the text prompt, resulting in
681likely_noise
low
MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmarkweak or indirect keyword match
abstract3D pose transfer aims to transfer the pose-style of a source mesh to a target character while preserving both the target's geometry and the source's pose characteristic. Existing methods are largely restricted to characters with similar structures and fail to generalize to category-free settings (e.g., transferring a humanoid's pose to a quadruped). The key challenge lies in the structural and transformation diversity inherent in distinct character types, which often leads to mismatched regions and poor transfer quality. To address these issues, we first construct a million-scale pose dataset across hundreds of distinct characters. We further propose MimiCAT, a cascade-transformer model designed for category-free 3D pose transfer. Instead of relying on strict one-to-one correspondence mappings, MimiCAT leverages semantic keypoint labels to learn a novel soft correspondence that enables f
682likely_noise
low
Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; dynamic_4d; surface_occupancyweak or indirect keyword match
abstractPredicting scene dynamics from visual observations is challenging. Existing methods capture dynamics only within observed boundaries failing to extrapolate far beyond the training sequence. Node-RF (Neural ODE-based NeRF) overcomes this limitation by integrating Neural Ordinary Differential Equations (NODEs) with dynamic Neural Radiance Fields (NeRFs), enabling a continuous-time, spatiotemporal representation that generalizes beyond observed trajectories at constant memory cost. From visual input, Node-RF learns an implicit scene state that evolves over time via an ODE solver, propagating feature embeddings via differential calculus. A NeRF-based renderer interprets calculated embeddings to synthesize arbitrary views for long-range extrapolation. Training on multiple motion sequences with shared dynamics allows for generalization to unseen conditions. Our experiments demonstrate that Nod
683likely_noise
low
Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editing; data_benchmarkweak or indirect keyword match
abstractConstructing computer-aided design (CAD) models is labor-intensive but essential for engineering and manufacturing. Recent advances in Large Language Models (LLMs) have inspired the LLM-based CAD generation by representing CAD as command sequences. But these methods struggle in practical scenarios because command sequence representation does not support entity selection (e.g. faces or edges), limiting its ability to support complex editing operations such as chamfer or fillet. Further, the discretization of a continuous variable during sketch and extrude operations may result in topological errors. To address these limitations, we present Pointer-CAD, a novel LLM-based CAD generation framework that leverages a pointer-based command sequence representation to explicitly incorporate the geometric information of B-rep models into sequential modeling. In particular, Pointer-CAD decomposes CA
684likely_noise
low
MHopReg: Efficient Hierarchical Multi-Hop Graph Search for Point Cloud Registration
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractOutlier rejection for correspondence-based point cloud registration confronts two fundamental challenges in real-world scenarios. First, low-overlap regions yield sparse and fragmented inlier distributions that are difficult to discover using conventional one-step global search strategies. Second, large-scale scenes present dense correspondence inputs that impose stringent requirements on the accuracy-efficiency trade-off of search algorithms. To this end, we propose a hierarchical multi-hop graph search framework that progressively refines correspondences to address these challenges. Our method constructs a compatibility graph with transformation-invariant embeddings to predict correspondence confidence, establishing the foundation for cluster-balanced seed sampling that ensures comprehensive coverage across fragmented regions. These strategically selected seeds subsequently drive hiera
685likely_noise
low
Universal 3D Shape Matching via Coarse-to-Fine Language Guidance
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancyweak or indirect keyword match
abstractEstablishing dense correspondences between shapes is a crucial task in computer vision and graphics, while prior approaches depend on near-isometric assumptions and homogeneous subject types (i.e., only operate for human shapes). However, building semantic correspondences for cross-category objects remains challenging and has received relatively little attention. To achieve this, we propose UniMatch, a semantic-aware, coarse-to-fine framework for constructing dense semantic correspondences between strongly non-isometric shapes without restricting object categories. The key insight is to lift "coarse" semantic cues into "fine" correspondence, which is achieved through two stages. In the "coarse" stage, we perform class-agnostic 3D segmentation to obtain non-overlapping semantic parts and prompt multimodal large language models (MLLMs) to identify part names. Then, we employ pretrained vis
686likely_noise
low
Homaloidal parametrization for detecting critical two-view configurations
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractWe consider the problem of identifying degenerate configurations while estimating the fundamental matrix from (at least) 8 point correspondences. It is known that such configurations correspond to an ill-posed estimation of the fundamental matrix, so it is important to identify them in practice. So far, a practical degeneracy test is only available for the cases of planar scenes and pure rotation, while the case of the general critical surface (e.g., a hyperboloid/cone/cylinder containing 3D points and camera centres) is less studied, and the only available method is highly unstable, involving a pre-computed fundamental matrix. In this paper, we propose a novel degeneracy test for detecting points on the critical surface. By exploiting the geometry of the so-called ``homaloidal net of conics'', we are able to design a simple and very practical test that requires the linear estimation o
687likely_noise
low
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editingweak or indirect keyword match
abstractReinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization,
688likely_noise
low
Beyond Reassembly: Fractured Object Recovery with Missing Parts
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractWe propose a novel learning-based task named fractured object recovery. Unlike previous fractured object reassembly task that only targets aligning existing parts with overlaps, our task aims to not only reassemble irrelevant parts but also predict missing parts, resulting in a complete shape recovery immediately. Our task coincides with practical experiences, where the prior knowledge of similar shapes can be leverage in the reassembly process, such that even non-overlapping parts can be reasoned into adequate locations. We also present the first learning model for the proposed task by correlating features of both existing and missing parts using a transformer, where the latter is naturally represented as missing tokens. Hence, our model can jointly estimate the poses of the existing parts and predict the shapes of the missing parts. To facilitate the task, we introduce a new dataset ba
689likely_noise
low
CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editingweak or indirect keyword match
abstractControllable, high-fidelity mesh editing remains a significant challenge in the domain of 3D content creation. Existing generative methods often struggle with complex geometries and fail to preserve fine-scale details. We propose CraftMesh, a novel framework for high-fidelity generative mesh manipulation based on Poisson Seamless Fusion. Our key insight is to decompose mesh editing into a pipeline that leverages the strengths of 2D image editing and 3D generative modeling: we first edit a 2D reference image, then generate a 3D mesh corresponding to the edited region, and fuse it seamlessly into the original mesh through a Joint Geometry and Appearance Fusion framework built on a hybrid SDF/Mesh representation to enable Poisson Geometry Blending and Poisson Texture Harmonization. Experimental results demonstrate that CraftMesh outperforms state-of-the-art methods, delivering improved stru
690likely_noise
low
Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractHigh-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we pre
691likely_noise
low
Parallelised Differentiable Straightest Geodesics for 3D Meshes
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancyweak or indirect keyword match
abstractMachine learning has been progressively generalised to operate within non-Euclidean domains, but geometrically accurate methods for learning on surfaces are still falling behind. The lack of closed-form Riemannian operators, the non-differentiability of their discrete counterparts, and poor parallelisation capabilities have been the main obstacles to the development of the field on meshes. A principled framework to compute the exponential map on Riemannian surfaces discretised as meshes is straightest geodesics, which also allows to trace geodesics and parallel-transport vectors as a by-product. We provide a parallel GPU implementation and derive two different methods for differentiating through the straightest geodesics, one leveraging an extrinsic proxy function and one based upon a geodesic finite differences scheme. After proving our parallelisation performance and accuracy, we demon
692likely_noise
low
PhysGen: Physically Grounded 3D Shape Generation for Industrial Design
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancyweak or indirect keyword match
abstractExisting generative models for 3D shapes can synthesize high-fidelity and visually plausible shapes. For certain classes of shapes that have undergone an engineering design process, the realism of the shape is tightly coupled with the underlying physical properties, e.g., aerodynamic efficiency for automobiles. Since existing methods lack knowledge of such physics, they are unable to use this knowledge to enhance the realism of shape generation. Motivated by this, we propose a unified physics-based 3D shape generation pipeline, with a focus on industrial design applications. Specifically, we introduce a new flow matching model with explicit physical guidance, consisting of an alternating update process. We iteratively perform a velocity-based update and a physics-based refinement, progressively adjusting the latent code to align with the desired 3D shapes and physical properties. We furt
693likely_noise
low
XPaintNet: An eXtreme Lightweight Framework for Stereoscopic Conversion without Inpainting Network
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; depth_correspondence; surface_occupancyweak or indirect keyword match
abstractWith the rapid growth of stereoscopic 3D devices, real-time stereoscopic conversion has become increasingly essential. However, most existing approach rely on depth estimation, forward warping, and heavy inpainting network, resulting in high computational cost and artifacts near occlusion boundaries. Diffusion-based models have also been explored, but they suffer from iterative sampling and geometric inconsistency, making them unsuitable for real-time deployment. To address these issues, we propose Bi-Warp, a simple yet effective approach that synthesizes the right view without inpainting network by leveraging warping operations. Our approach estimates backward flow, approximates the corresponding forward flow, and generates two candidate right views via bidirectional warping. A learnable mask adaptively fuses the candidates, preserving left–right geometric consistency. Building on Bi-Wa
694likely_noise
low
FILTR: Extracting Topological Features from Pretrained 3D Models
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractRecent advances in pretraining 3D point cloud encoders (e.g., Point-BERT, Point-MAE) have produced powerful models, whose abilities are typically evaluated on geometric or semantic tasks. At the same time, topological descriptors have been shown to provide informative summaries of a shape's multiscale structure. In this paper we pose the question whether topological information can be derived from features produced by 3D encoders. To address this question, we first introduce DONUT, a synthetic benchmark with controlled topological complexity, and propose FILTR (Filtration Transformer), a learnable framework to predict persistence diagrams directly from frozen encoders. FILTR adapts a transformer decoder to treat diagram generation as a set prediction task. Our analysis on DONUT reveals that existing encoders retain only limited global topological signals, yet FILTR successfully leverages
695likely_noise
low
FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractAutoregressive models can generate high-quality 3D meshes by sequentially producing vertices and faces, but their token-by-token decoding results in slow inference, limiting practical use in interactive and large-scale applications.We present FlashMesh, a fast and high-fidelity mesh generation framework that rethinks autoregressive decoding through a predict-correct-verify paradigm. The key insight is that mesh tokens exhibit strong structural and geometric correlations that enable confident multi-token speculation. FlashMesh leverages this by introducing a speculative decoding scheme tailored to the commonly used hourglass transformer architecture, enabling parallel prediction across face, point, and coordinate levels.Extensive experiments show that FlashMesh achieves up to a 2$\times$ speedup over standard autoregressive models while also improving generation fidelity. Our results demo
696likely_noise
low
FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractWe introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and fl
697likely_noise
low
LATTICE: Democratize High-Fidelity 3D Generation at Scale
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editingkeyword noise pattern without direct reconstruction signal
abstractWe present LATTICE, a new framework for high-fidelity 3D asset generation that bridges the quality and scalability gap between 3D and 2D generative models. While 2D image synthesis benefits from fixed spatial grids and well-established transformer architectures, 3D generation remains fundamentally more challenging due to the need to predict both spatial structure and detailed geometric surfaces from scratch. These challenges are exacerbated by the computational complexity of existing 3D representations and the lack of structured and scalable 3D asset encoding schemes. To address this, we propose VoxSet, a semi-structured representation that compresses 3D assets into a compact set of latent vectors anchored to a coarse voxel grid, enabling efficient and position-aware generation. VoxSet retains the simplicity and compression advantages of prior VecSet methods while introducing explicit st
698likely_noise
low
POLAR: A Portrait OLAT Dataset and Generative Framework for Illumination-Aware Face Modeling
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractFace relighting aims to synthesize realistic portraits under novel illumination while preserving identity and geometry. However, progress remains constrained by the limited availability of large-scale, physically consistent illumination data. To address this, we introduce POLAR, a large-scale and physically calibrated One-Light-at-a-Time (OLAT) dataset containing over 200 subjects captured under 156 lighting directions, multiple views, and diverse expressions. Building upon POLAR, we develop a flow-based generative model POLARNet that predicts per-light OLAT responses from a single portrait, capturing fine-grained and direction-aware illumination effects while preserving facial identity. Unlike diffusion or background-conditioned methods that rely on statistical or contextual cues, our formulation models illumination as a continuous, physically interpretable transformation between lighti
699likely_noise
low
Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editingweak or indirect keyword match
abstractWe often aim to generate images that are both photorealistic and 3D-consistent, adhering to precise geometry, material, and viewpoint controls.Typically, this is achieved by fine-tuning an image generator, pre-trained on billions of real images, using renders of synthetic 3D assets, where annotations for control signals are available.While this approach can learn the desired controls, it often compromises the realism of the images due to domain gap between photographs and renders. We observe that this issue largely arises from the model learning an unintended association between the presence of control signals and the synthetic appearance of the images.To address this, we introduce Realiz3D, a lightweight framework that decouples controls and visual domain.The key idea is to explicitly learn visual domain, real or synthetic, separately from other control signals by introducing a co-varia
700likely_noise
low
Think-Then-Generate: Structural Chain-of-Thought Reasoning for Consistent 3D Generation
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editingweak or indirect keyword match
abstractRecently, generating 3D assets using visual priors from pretrained diffusion models has shown remarkable results. However, due to the inherent lack of 3D geometric priors in 2D diffusion, the synthesized results often suffer from spatial hallucination and multi-view inconsistency. To address this limitation, we propose Thoughtful3D, a novel framework that enhances 3D content generation quality by introducing structural chain-of-thought (CoT) reasoning to alleviate inconsistent issues and mitigate hallucinations. Specifically, we design a dual-phase structural CoT strategy: (1) 3DBlueprint-CoT explicitly plans the 3D generation process through textual semantic parsing and logical deduction during the initialization phase. (2) 3DRefine-CoT dynamically evaluates latent inconsistencies by analyzing multiple renderings, employing a multi-round iterative refinement mechanism to suppress halluc
701likely_noise
low
UniPart: Part-Level 3D Generation with Unified 3D Geom–Seg Latents
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editingweak or indirect keyword match
abstractPart-level 3D generation is essential for applications requiring decomposable and structured 3D synthesis. However, existing methods either rely on implicit part segmentation with limited granularity control or depend on strong external segmenters trained on large annotated datasets. In this work, we observe that part awareness emerges naturally during whole-object geometry learning and propose Geom-Seg VecSet, a unified geometry–segmentation latent representation that jointly encodes object geometry and part-level structure. Building on this representation, we introduce UniPart, a two-stage latent diffusion framework for image-guided part-level 3D generation. The first stage performs joint geometry generation and latent part segmentation, while the second stage conditions part-level diffusion on both whole-object and part-specific latents. A dual-space generation scheme further enhances
702likely_noise
low
VLM-Guided Group Preference Alignment for Diffusion-based Human Mesh Recovery
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractHuman mesh recovery (HMR) from a single RGB image is inherently ambiguous, as multiple 3D poses can correspond to the same 2D observation. Recent probabilistic and diffusion-based methods tackle this ambiguity by generating various hypotheses, but often sacrifice accuracy. They yield predictions that are either physically implausible or drift from the input image, especially under occlusion or in cluttered, in-the-wild scenes. To address this issue, we introduce a dual-memory augmented HMR critique agent with self-reflection to produce context-aware quality scores for predicted meshes. These scores distill fine-grained cues about 3D human motion structure, physical feasibility, and alignment with the input image. We use these scores to build a group-wise HMR preference dataset. Building upon this dataset, we propose a group preference alignment framework for finetuning diffusion-based HM
703likely_noise
low
Learning Hierarchical Hyperbolic Mixture Model for Part-aware 3D Generation
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editingweak or indirect keyword match
abstract3D shape generation has become increasingly important for graphics and vision applications. Current part-aware 3D generation usually overlooks hierarchical part relations or inefficiently encodes multi-level semantics in Euclidean space. Thus we propose a novel framework for hierarchical and efficient part-aware 3D generation in hyperbolic space. Our contributions are three-fold: (1) Hierarchical Hyperbolic Mixture Model (H$^2$MM): We propose part-aware semantic representation of objects within a hyperbolic manifold, providing a high-fidelity hierarchical part-aware representation of object details and semantics. (2) Hyperbolic Semantically Consistent Diffusion Model: We design the geodesic diffusion process that preserves the hierarchical and semantic structure of H$^{2}$MM, and progressively generates semantics from conditions and generates object under their joint guidance. We use an
704likely_noise
low
Nestwork: Conditional 3D Furnished House Layout Generation through Latent Heterogeneous Graph Diffusion
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractThis paper introduces Nestwork, a unified latent-diffusion framework for conditional 3D furnished house layout generation using a heterogeneous graph of rooms and furniture. Designing reasonable and controllable 3D layouts that reflect the underlying semantic structure of a house is a key challenge in AI-assisted architectural design. Existing graph-based methods either produce unfurnished multi-room layouts or generate furnished scenes one room at a time, preventing joint reasoning over room structure and furniture placement. Nestwork represents an entire house as a heterogeneous graph with typed room and furniture nodes and multiple spatial relations. A single unconditional autoencoder based on a heterogeneous graph attention network embeds this graph into a compact latent space, and a low-rank relational field compensates for missing geometric edge information at test time. A diffusio
705likely_noise
low
SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkweak or indirect keyword match
abstract3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies hindering the model’s ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate–based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-c
706likely_noise
low
Thermal is Always Wild: Characterizing and Addressing Challenges in Thermal-Only Novel View Synthesis
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; depth_correspondence; data_benchmarkweak or indirect keyword match
abstractThermal cameras provide reliable visibility in darkness and adverse conditions, but thermal imagery remains significantly harder to use for novel view synthesis (NVS) than visible-light images. This difficulty stems primarily from two characteristics of affordable thermal sensors. First, thermal images have extremely low dynamic range, which weakens appearance cues and limits the gradients available for optimization. Second, thermal data exhibit rapid frame-to-frame photometric fluctuations together with slow radiometric drift, both of which destabilize correspondence estimation and create high-frequency floater artifacts during view synthesis, particularly when no RGB guidance is available. Guided by these observations, we introduce a lightweight preprocessing and splatting pipeline that expands usable dynamic range and stabilizes per-frame photometry. Our approach achieves state-of-the
707likely_noise
low
Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; generation_editingweak or indirect keyword match
abstractArchitectural floor plan design demands joint reasoning over geometry, semantics, and spatial hierarchy, which remains a major challenge for current AI systems. Although recent diffusion and language models improve visual fidelity, they still struggle with coherent spatial reasoning and controllable generation. We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. We introduce discrete room-instance tokens to construct a unified vocabulary that bridges layouts and symbolic reasoning. With multimodal alignment and instruction tuning, the model synthesizes coherent, controllable layouts from text instructions. Experiments show how the framework achieves superior geometric validity and controllability while remaining efficient and locally deployable.
708likely_noise
low
Towards Intrinsic-Aware Monocular 3D Object Detection
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractMonocular 3D object detection (Mono3D) aims to infer object locations and dimensions in 3D space from a single RGB image.Despite recent progress, existing methods remain highly sensitive to camera intrinsics and struggle to generalize across diverse settings, since intrinsic changes reshape how 3D scenes are projected onto the image plane.We propose MonoIA, a unified intrinsic-aware framework that models and adapts to intrinsic variation through a language-grounded representation.The key insight is that intrinsic variation is not a numeric difference but a perceptual transformation that alters apparent scale, perspective, and spatial geometry.To capture this effect, MonoIA employs large language models and vision–language models to generate intrinsic embeddings that encode the visual and geometric implications of camera parameters.These embeddings are hierarchically integrated into the d
709likely_noise
low
VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; dynamic_4dweak or indirect keyword match
abstractThe rapid advances in deep learning have significantly enhanced the accuracy of multimodal 3D human pose estimation (HPE). However, the state-of-the-art (SOTA) HPE pipelines still rely on Transformers, whose quadratic complexity makes real-time processing for long sequences impractical. Mamba addresses this issue through selective state-space modeling, enabling efficient sequence processing without sacrificing representational power. Nevertheless, it struggles to capture complex spatial dependencies in multimodal settings. To bridge this gap, we propose VIMCAN, a hybrid architecture that combines the efficient sequence modeling of Mamba with the spatial reasoning of Cross-Attention, and performs robust visual–inertial fusion and human pose estimation between RGB keypoints and wearable IMU data. By leveraging Mamba’s dynamic parameterization for temporal modeling and Attention for spatial
710likely_noise
low
Synthetic Knowledge-Guided Learning via Target-Region Gradients
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; dynamic_4d; generation_editingkeyword noise pattern without direct reconstruction signal
abstractTraining with synthetic data has become a standard strategy for improving robustness to distribution shifts. However, most existing approaches exploit synthetic samples only indirectly---for example, by enriching backgrounds, contexts, or negative examples---while providing no explicit signal about where the true target content resides.As a result, models can continue to rely on spurious correlations, which ultimately limit their robustness. In this work, we convert a basic but under-utilized provenance of synthetic data into explicit supervision: during synthesis, we know which pixels or elements originate from which source instances. We formalize this provenance as synthetic knowledge and propose a Synthetic Knowledge-Guided (SKG) training framework that uses it to shape gradients toward target regions and away from irrelevant ones via a Gradient Guide Loss. Our framework is generic an
711likely_noise
low
ExPose: Reinforcing Video Generation Models for Extreme Pose Estimation
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; dynamic_4d; generation_editingweak or indirect keyword match
abstractPose estimation remains challenging under sparse views, especially when visual overlap across images is extremely limited. Recent advances in video generation models offer a promising solution by enabling keyframe interpolation, which can enrich contextual cues and improve pose estimation performance. However, existing video generation models often lack 3D consistency, producing temporally plausible but spatially inconsistent frames that degrade downstream pose estimation. In this paper, we propose a framework ExPose that directly addresses 3D inconsistency when applying video generation to pose estimation in extreme-view settings. Specifically, we fine-tune a video generation model using Group Relative Preference Optimization (GRPO), aligning its outputs with 3D-consistent supervisory signals derived from pose estimation objectives. Our approach not only enhances the quality of temporal
712likely_noise
low
Globally Optimal Pose from Silhouettes
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; surface_occupancyweak or indirect keyword match
abstractWe solve the problem of determining the pose of known shapes in $\mathbb{R}^3$ from their unoccluded silhouettes. The pose is determined up to global optimality using a simple yet under-explored property of the area-of-silhouette: its continuity w.r.t trajectories in the rotation space. The proposed method utilises pre-computed silhouette-signatures, modelled as a response surface of the area-of-silhouettes. Querying this silhouette-signature response surface for pose estimation leads to a strong branching of the rotation search space, making resolution-guided candidate search feasible. Additionally, we utilise the aspect ratio of 2D ellipses fitted to projected silhouettes as an auxiliary global shape signature to accelerate the pose search. This combined strategy forms the first method to efficiently estimate globally optimal pose from just the silhouettes, without being guided by corr
713likely_noise
low
MGDHand: Multi-Granularity Prior-to-Inertial Distillation Framework for Sequential 3D Hand Pose Estimation from Sparse IMUs
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; dynamic_4d; data_benchmarkweak or indirect keyword match
abstract3D hand pose estimation (HPE) from sparse inertial measurement units (IMUs) has shown great potential in human-computer interaction. However, due to the significant semantic gap between sparse local motion information and structured global pose information, estimating the hand poses from sparse IMU signals is ambiguous and challenging. Knowledge distillation can transfer rich knowledge from the stronger teacher to the student, so that the student enhances performance. Existing approaches distill morphological priors into the IMU-based student model, effectively improving its accuracy in complex scenarios. Nevertheless, overlooking the visual-inertial inherent semantic mismatch and information density difference leads to difficulties for students to learn coupled priors. In this paper, we propose a \textbf{M}ulti-\textbf{G}ranularity Prior-to-Inertial \textbf{D}istillation Framework for S
714likely_noise
low
Δynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; data_benchmarkweak or indirect keyword match
abstractInferring rigid-body physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, which are unable to generalize to complex real-world settings. We introduce Δynamics, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, Δynamics generates scene configurations in a structured text format for physics simulation. We enhance the model's generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, Δynamics achieves a segmentation IoU of $0.30$, a $7\times$ improvement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Further, test-time sampling a
715likely_noise
low
ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; data_benchmarkweak or indirect keyword match
abstractObject pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities.In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S)
716likely_noise
low
Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; data_benchmarkweak or indirect keyword match
abstractMultimodal image registration is a fundamental task for multimodal imagery and a prerequisite for downstream cross-modal analysis. Despite recent progress with shared feature extraction and multi-scale architectures, two key limitations remain. First, some methods use disentanglement to learn shared features but mainly regularize the shared part, so modality-private cues can still leak into the shared space. Second, most multi-scale frameworks support only one transformation type, which limits their applicability in real-world scenarios where global misalignment and local deformation coexist.To address these issues, we view hybrid multimodal registration as jointly constructing a stable shared feature space and a unified hybrid transformation within that space. Building on this perspective, we introduce HRNet, a Hybrid Registration Network that couples representation disentanglement with
717likely_noise
low
UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondence; data_benchmarkweak or indirect keyword match
abstractLow-visibility scenarios, such as low-light conditions, pose significant challenges to human pose estimation due to the scarcity of annotated low-light datasets and the loss of visual information under poor illumination. Recent domain adaptation techniques attempt to utilize well-lit labels by augmenting well-lit images to mimic low-light conditions.But handcrafted augmentations oversimplify noise patterns, while learning-based methods often fail to preserve high-frequency low-light characteristics, producing unrealistic images that lead pose models to generalize poorly to real low-light scenes.Moreover, recent pose estimators rely on image cues through image-to-keypoint cross-attention, but these cues become unreliable under low-light conditions.To address these issues, we propose Unsupervised Domain Adaptation for Pose Estimation (UDAPose), a novel framework that synthesizes realistic
718likely_noise
low
S$^2$AM3D: Scale-controllable Part Segmentation of 3D Point Clouds
3D Vision & Geometry / Point Cloud
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractPart-level point cloud segmentation has recently attracted significant attention in 3D computer vision.Nevertheless, existing research is constrained by two major challenges: native 3D models lack generalization due to data scarcity, while introducing 2D pre-trained knowledge often leads to inconsistent segmentation results across different views.To address these challenges, we propose S$^2$AM3D, which incorporates 2D segmentation priors with 3D consistent supervision. We design a point-consistent part encoder that aggregates multi-view 2D features through native 3D contrastive learning, producing globally consistent point features. A scale-aware prompt decoder is then proposed to enable real-time adjustment of segmentation granularity via continuous scale signals. Simultaneously, we introduce a large-scale, high-quality part-level point cloud dataset with more than 100k samples, providi
719likely_noise
low
Generalized-CVO: Fast and Correspondence-Free Point Cloud Registration in RKHS with Second Order Riemannian Optimization
3D Vision & Geometry / Point Cloud
C. cluster representativepose_calibration_localization; depth_correspondence; surface_occupancyweak or indirect keyword match
abstractWe propose a fast and correspondence-free point cloud registration method that leverages local geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The proposed method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order methods used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR registration task in the driving domain, we achieve a reduction of $>55\%$ in both translati
720likely_noise
low
VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement
3D Vision & Geometry / Point Cloud
C. cluster representativegeneral_reconstruction; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractWe propose VIAFormer, a \textbf{V}oxel-\textbf{I}mage \textbf{A}lignment Trans-\textbf{former} model designed for Multi-view Conditioned Voxel Refinement—the task of repairing incomplete noisy voxels using calibrated multi-view images as guidance. Its effectiveness stems from a synergistic design: an Image Index that provides explicit 3D spatial grounding for 2D image tokens, a Correctional Flow objective that learns a direct voxel-refinement trajectory, and a Hybrid Stream Transformer that enables robust cross-modal fusion. Experiments show that VIAFormer establishes a new state of the art in correcting both severe synthetic corruptions and realistic artifacts on the voxel shape obtained from powerful Vision Foundation Models. Beyond benchmarking, we demonstrate VIAFormer as a practical and reliable bridge in real-world 3D creation pipelines, paving the way for voxel-based methods to th
721likely_noise
low
Voxify3D: Pixel Art Meets Volumetric Rendering
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancyweak or indirect keyword match
abstractVoxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with co
722likely_noise
low
Image-Guided Geometric Stylization of 3D Meshes
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancyweak or indirect keyword match
abstractRecent generative models can create visually plausible 3D representations of objects. However, the generation process often allows for implicit control signals, such as contextual descriptions, and rarely supports bold geometric distortions beyond existing data distributions. We propose a geometric stylization framework that deforms a 3D mesh, allowing it to express the style of an image. While style is inherently ambiguous, we utilize pre-trained diffusion models to extract an abstract representation of the provided image. Our coarse-to-fine stylization pipeline can drastically deform the input 3D model to express a diverse range of geometric variations while retaining the valid topology of the original mesh and part-level semantics. We also propose an approximate VAE encoder that provides efficient and reliable gradients from mesh renderings. Extensive experiments demonstrate that our
723likely_noise
low
OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancyweak or indirect keyword match
abstractWe propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates
724likely_noise
low
Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancyweak or indirect keyword match
abstractImplicit Neural Representations (INRs) have emerged as a powerful paradigm for various signal processing tasks, but their inherent spectral bias limits the ability to capture high-frequency details. Existing methods partially mitigate this issue by using Fourier-based features, which usually rely on fixed frequency bases. This forces multi-layer perceptrons (MLPs) to inefficiently compose the required frequencies, thereby constraining their representational capacity. To address this limitation, we propose Content-Aware Frequency Encoding (CAFE), which builds upon Fourier features through multiple parallel linear layers combined via a Hadamard product. CAFE can explicitly and efficiently synthesize a broader range of frequency bases, while the learned weights enable the selection of task-relevant frequencies. Furthermore, we extend this framework to CAFE+, which incorporates Chebyshev fea
725likely_noise
low
FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancykeyword noise pattern without direct reconstruction signal
abstractLearning directly from boundary representations (B-reps) has significantly advanced 3D CAD analysis. However, state-of-the-art B-rep learning methods rely on absolute coordinates and normals to encode global context, making them highly sensitive to rotations. Our experiments reveal that models achieving over 95% accuracy on aligned benchmarks can collapse to as low as 10% under arbitrary SO(3) rotations. To address this, we introduce FoV-Net, the first B-rep learning framework that captures both local surface geometry and global structural context in a rotation-invariant manner. Each face is represented by a Local Reference Frame (LRF) UV-grid that encodes its local surface geometry, and by Field-of-View (FoV) grids that capture the surrounding 3D context by casting rays and recording intersections with neighboring faces. Lightweight CNNs extract per-face features, which are propagated o
726likely_noise
low
HiFi-Brep: High-Fidelity B-Rep Latent Representation and Robust Generation
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancyweak or indirect keyword match
abstractBoundary representation (B-rep) generation is a fundamental task in Computer-Aided Design (CAD), enabling automated modeling of 3D geometries. However, the direct synthesis of valid and high-quality B-reps remains a major challenge.Existing deep generative methods suffer from brittle representation and generation paradigms, due to: (1) representation noise from padding variable-length sequences and feature contamination between distant primitives, and (2) fragile generation pipelines marked by cascaded decoding error propagation and a train-inference mismatch from deferred validity enforcement.To address this, we propose HiFi-Brep. Our core insight is that robust, high-validity generation requires: first, building upon a compact and high-fidelity latent representation; and second, reformulating validity constraints as differentiable inductive biases within a single-stage generation proce
727likely_noise
low
HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancyweak or indirect keyword match
abstractIn this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. Achieving both photorealism and generalization is challenging due to the scarcity of diverse, high-quality 3D human data. To address this, we build a scalable data generation pipeline that follows two strategies. The first one is to leverage existing rigged assets and animate them with extensive poses from daily life. The second strategy is to utilize existing multi-camera captures of humans and employ fitting to generate more diverse views for training. These two strategies enable us to scale up to 100k assets, significantly enhancing both the quantity and the diversity of data for robust model training. In terms of the architecture, HumanNOVA adopts a feed-forward, token-conditioned avatar modeling framework that allows fast inference in less than
728likely_noise
low
Mirror Illusion Art
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancyweak or indirect keyword match
abstractMirror Illusion Art is a novel reflection-conditioned 3D illusion where one object yields two target appearances (front and mirror). The task is formulated as inverse design from two target 2D images (front and mirror) to a printable 3D object with geometry and texture. Prior topology-driven and shadow-based approaches demand substantial manual effort, optimize shape only, and often yield non-smooth or incomplete geometry. To address these challenges, we propose AutoMIA, an automated Mirror Illusion Art design pipeline that jointly optimizes shape and color. To stabilize optimization and suppress artifacts, four mechanisms are introduced: (1) projection-alignment component (PAC) selection to reduce surface noise, (2) position-weighted adaptive (PWA) suppression for background noise, (3) internal voxel preservation (IVP) to prevent internal fractures, and (4) shape-color decoupled (SCD)
729likely_noise
low
PaNDaS: Learnable Shape Interpolation Modeling with Localized Control
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancyweak or indirect keyword match
abstractWe present PaNDaS, a novel deep learning framework for Partial Non-Rigid Deformations and interpolations of Surfaces (PaNDaS). PaNDaS learns a per-face feature field on the source mesh and fuses it with a global encoding of the target. A deformation generator predicts a Jacobian field and recovers a smooth displacement, enabling precise regional control, pose mixing, and transferable local edits. Unlike previous approaches, our method can restrict the deformations to specific parts of the shape in a versatile way. Across various human body part datasets, PaNDaS achieves state-of-the-art interpolation accuracy and stronger locality than methods based on global shape codes or handles, while remaining robust to remeshing. We demonstrate several localized shape manipulation tasks and show that our method can generate new shapes by combining different input deformations.
730likely_noise
low
ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; surface_occupancyweak or indirect keyword match
abstractIn practical real-time XR and telepresence applications, network and computing resources fluctuate frequently. Therefore, a progressive, streamable 3D representation method is needed that can be immediately deployed and continuously optimized as resources increase. To this end, we propose ProgressiveAvatars, a progressive avatar representation built on a hierarchy of 3D Gaussians grown byadaptive implicit subdivision on a template mesh. 3D Gaussians are defined in face‑local coordinates to remain animatable under varying expressions and head motion across multiple detail levels. The hierarchy expands when screen-space signals indicate a lack of detail, allocating resources to important areas. ProgressiveAvatars supports incremental loading rendering, adding new Gaussians as they arrive while preserving previous content, thus achieving smooth quality improvements across varying bandwidths
731likely_noise
low
Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision–Language Understanding
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancyweak or indirect keyword match
abstractRecent advances in 3D vision–language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning.However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec).Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next,
732likely_noise
low
SuP: Sub-cloud Driven Point Cloud Registration
3D Vision & Geometry / Point Cloud
C. cluster representativepose_calibration_localization; depth_correspondence; surface_occupancyweak or indirect keyword match
abstractWhile existing point-cloud-registration methods can well handle high-overlap scenarios of two point clouds, they often struggle with low-overlap scenarios, due to inevitable geometric/semantic ambiguities in the non-overlapping regions. In this paper, we introduce SuP, a novel framework that reformulates low-overlap registration as a high-overlap sub-cloud pairs (anchor pairs) mining problem. Central to SuP is our Dual-phase Sub-cloud Anchor Mining (DSAM) module, which first subdivides the source and target point clouds into multiple sub-clouds, followed by introducing a dual-phase weighting pipeline: 1) an efficient overlap-guided prior-weighting scheme (OPS) that leverages feature salience to identify candidate anchor pairs, and 2) a multi-scale post-weighting network (MPN) that exploits neighborhood feature consensus to further identify anchor pairs. Subsequently, final corresponde
733likely_noise
low
UZ3DVG: Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancyweak or indirect keyword match
abstractZero-Shot 3D Visual Grounding (Zero-Shot 3DVG) aims to localize target objects in 3D scenes from natural language descriptions without relying on instance-wise description annotations. Existing methods rely on extra 2D images during inference and/or require multi-turn interactions with large language models (LLMs) or vision-language models (VLMs), which increase latency, computational cost, and deployment complexity. To overcome these limitations, we propose Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions (UZ3DVG), which is fed with 3D point clouds and textual descriptions only during inference and does not depend on external models. This is a new training paradigm: a VLM is employed solely to produce object-wise descriptions (pseudo labels) and reasoning chains for training a lightweight 3DVG model with robust spatial reasoning. Specifically, the introduced Open
734likely_noise
low
AutoRegressive Generation with B-rep Holistic Token Sequence Representation
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancyweak or indirect keyword match
abstractPrevious representation and generation approaches for the B-rep relied on graph-based representations that disentangle geometric and topological features through decoupled computational pipelines, thereby precluding the application of sequence-based generative frameworks, such as transformer architectures that have demonstrated remarkable performance. In this paper, we propose BrepARG, the first attempt to encode B-rep's geometry and topology into a holistic token sequence representation, enabling sequence-based B-rep generation with an autoregressive architecture. Specifically, BrepARG encodes B-rep into 3 types of tokens: geometry and position tokens representing geometric features, and face index tokens representing topology. Then the holistic token sequence is constructed hierarchically, starting with constructing the geometry blocks (i.e., faces and edges) using the above tokens, fo
735likely_noise
low
CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancyweak or indirect keyword match
abstractDespite major advances brought by diffusion-based models, current 3D texture generation systems remain hindered by cross-view inconsistency -- textures that appear convincing from one viewpoint often fail to align across others. We find that this issue arises from attention ambiguity, where unstructured full attention is applied indiscriminately across tokens and modalities, causing geometric confusion and unstable appearance-structure coupling.To address this, we introduce CaliTex, a framework of geometry-calibrated attention that explicitly aligns attention with 3D structure.It introduces two modules: Part-Aligned Attention that enforces spatial alignment across semantically matched parts, and Condition-Routed Attention which routes appearance information through geometry-conditioned pathways to maintain spatial fidelity.Coupled with a two-stage diffusion transformer, CaliTex makes geo
736likely_noise
low
Fast SceneScript: Accurate and Efficient Structured Language Model via Multi-Token Prediction
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancyweak or indirect keyword match
abstractRecent perception-generalist approaches based on language models have achieved state-of-the-art results across diverse tasks, including 3D scene layout estimation, via unified architecture and interface. However, these approaches rely on autoregressive next-token prediction, which is inherently slow. In this work, we introduce Fast SceneScript, a novel structured language model for accurate and efficient 3D scene layout estimation. Our method employs multi-token prediction (MTP) to reduce the number of autoregressive iterations and significantly accelerate inference. While MTP improves speed, unreliable token predictions can significantly reduce accuracy. To filter out unreliable tokens, we adapt self-speculative decoding (SSD) for structural language models and introduce confidence-guided decoding (CGD) with an improved scoring mechanism for token reliability.Furthermore, we design a pa
737likely_noise
low
GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radiance; depth_correspondenceweak or indirect keyword match
abstractRecent advances in generative modeling have substantially enhanced novel view synthesis, yet maintaining consistency across viewpoints remains challenging. Diffusion-based models rely on stochastic noise-to-data transitions, which obscure deterministic structures and yield inconsistent view predictions.We propose a Data-to-Data Flow Matching framework that learns deterministic transformations directly between paired views, enhancing view-consistent synthesis through explicit data coupling.To further enhance geometric coherence, we introduce Probability Density Geodesic Flow Matching (PDG-FM), which constrains flow trajectories using geodesic interpolants derived from probability density metrics of pretrained diffusion models. Such alignment with high-density regions of the data manifold promotes more realistic interpolants between samples.Empirically, our method surpasses diffusion-based
738likely_noise
low
LAM: Language Articulated Object Modelers
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancyweak or indirect keyword match
abstractWe introduce LAM, a system that explores the collaboration of large-language mod-els and vision-language models to generate articulated objects from text prompts.Our approach differs from previous methods that either rely on input visual structure(e.g., an image) or assemble articulated models from pre-built assets. In contrast,we formulate articulated object generation as a unified code generation task, wheregeometry and articulations can be co-designed from scratch. Given an input text,LAM coordinates a team of specialized modules to generate code to represent thedesired articulated object procedurally. The LAM first reasons about the hierarchi-cal structure of parts (links) with Link Designer, then writes code, compiles it, anddebugs it with Geometry & Articulation Coders and self-corrects with Geometry& Articulation Checkers. The code serves as a structured and interpretable bridgebe
739likely_noise
low
PP-Brep: Few-Shot B-rep Classification with Hybrid Graph Representation
3D Vision & Geometry / 3D Reconstruction
C. cluster representativegeneral_reconstruction; surface_occupancykeyword noise pattern without direct reconstruction signal
abstractIn industrial settings, classification of 3D CAD models are critical for efficient manufacturing. However, the limited availability of annotated CAD models presents an obstacle to achieving rapid adaptation in few-shot part classification scenarios. In this paper, we propose a hybrid graph representation and a pre-training and graph prompt framework for B-rep few-shot classification. Specifically, hybrid graph representation captures comprehensive and multi-level structural information of B-rep models by constructing local topology graph, global parallel graph and regional association hypergraph. A hierarchical graph network then fuses component-level structures with topological details in the hybrid graph. Reinforcement-augmented contrastive pre-training produces robust universal representations while in-place perturbation reduces training time. Structure-aware graph prompts finally pro
740likely_noise
low
Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation
3D Vision & Geometry / Point Cloud
C. cluster representativepose_calibration_localization; surface_occupancy; data_benchmarkweak or indirect keyword match
abstract3D human pose estimation (3D HPE) has emerged as a prominent research topic, particularly in the realm of RGB-based methods. However, the use of RGB images is often limited by issues such as occlusion and privacy constraints. Consequently, multi-modal sensing, which leverages non-intrusive sensors, is gaining increasing attention. Nevertheless, multi-modal 3D HPE still faces challenges, including modality imbalance. In this work, we introduce a novel balanced multi-modal learning method for 3D HPE, which harnesses the power of RGB, LiDAR, mmWave, and WiFi. Specifically, we propose a Shapley value-based contribution algorithm to assess the contribution of each modality and detect modality imbalance. To address this imbalance, we design a modality learning regulation strategy that decelerates the learning process during the early stages of training. We conduct extensive experiments on the
741likely_noise
low
4D Local Modeling Toward Dynamic Global Perception for Ambiguity-free Rotation-Invariant Point Cloud Analysis
3D Vision & Geometry / Point Cloud
C. cluster representativedynamic_4d; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractRotation invariance remains a core challenge in point cloud analysis, where existing methods often struggle with structural ambiguities and insufficient global context. Most rotation-invariant (RI) representations are derived from local coordinate systems, which inherently suffer from point-pair ambiguities and fail to capture discriminative features in symmetric or repetitive structures, while discarding informative global pose cues. To overcome these limitations, we propose Ga4DPF, a novel framework that offers a robust, global-aware RI representation by converting rotation-equivariant geometric representations into invariant ones, while concurrently integrating global pose awareness. Specifically, Ga4DPF introduces a learnable steerable transform that equivariantly lifts point clouds into 4D space, facilitating robust local feature construction and mitigating point-pair ambiguities. C
742likely_noise
low
Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis
3D Vision & Geometry / Point Cloud
C. cluster representativepose_calibration_localization; surface_occupancy; data_benchmarkkeyword noise pattern without direct reconstruction signal
abstractA symmetry on rigid motion is one of the salient factors in efficient learning of 3D point cloud problems. Group convolution has been a representative method to extract equivariant features, but its realizations have struggled to retain both rigorous symmetry and scalability simultaneously. We advocate utilizing the intertwiner framework to resolve this trade-off, but previous works on it, which did not achieve complete SE(3) symmetry or scalability to large-scale problems, necessitate a more advanced kernel architecture. We present Equivariant Coordinate-based Kernel Convolution, or ECKConv. It acquires SE(3) equivariance from the kernel domain defined in a double coset space, and its explicit kernel design using coordinate-based networks enhances its learning capability and memory efficiency. The experiments on diverse point cloud tasks, e.g., classification, pose registration, part se
743likely_noise
low
Topology-aware Feature Propagation for Unsupervised Non-rigid Point Cloud Correspondence
3D Vision & Geometry / Point Cloud
C. cluster representativedepth_correspondence; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractUnsupervised non-rigid point cloud correspondence aims to predict point-to-point correspondences without annotations. Existing methods leverage the spatial-relation-based feature propagation strategy that includes non-physical connections, which are sensitive to non-rigid deformation. To address this issue, we advocate to learn shape topology robust to non-rigid deformation, and propose the topology-aware feature propagation module integrated into a coarse-to-fine propagation and optimization pipeline. To extract point features robust to non-rigid deformation, we estimate keypoints as superpoints and encode superpoint features with topology weights, which learns reasonable topologies under non-rigid deformation. The vector quantization codebook is leveraged to enhance the original superpoint features with stored representative features across the dataset, improving feature robustness aga
744likely_noise
low
Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; dynamic_4dweak or indirect keyword match
abstractHuman motion analysis tasks, such as temporal 3D pose estimation, motion prediction, and motion in-betweening, play an essential role in computer vision. However, current paradigms suffer from severe fragmentation. First, the field is split between "perception" models that understand motion from video but only output text, and "generation" models that cannot perceive from raw visual input. Second, generative MLLMs are often limited to single-frame, static poses using dense, parametric SMPL models, failing to handle temporal motion. Third, existing motion vocabularies are built from skeleton data alone, severing the link to the visual domain. To address these challenges, we introduce Superman, a unified framework that bridges visual perception with temporal, skeleton-based motion generation. Our solution is twofold. First, to overcome the modality disconnect, we propose a Vision-Guided Mo
745likely_noise
low
COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondenceweak or indirect keyword match
abstractEstimating the 6DoF pose of a novel object with a single reference view is challenging due to occlusions, viewpoint changes, and outliers.A core difficulty lies in finding robust cross-view correspondences, as existing methods often rely on discrete one-to-one matching that is non-differentiable and tends to collapse onto sparse keypoints.We propose Confidence-aware Optimal Geometric Correspondence (COG), an unsupervised framework that formulates correspondence estimation as a confidence-aware optimal transport problem.COG produces balanced soft correspondences by predicting point-wise confidences and injecting them as target marginals, naturally suppressing non-overlapping regions.Semantic priors from vision foundation model features further regularize the correspondences, leading to stable pose estimation.This design integrates confidence into the end-to-end correspondence finding and
746likely_noise
low
Exploring 6D Object Pose Estimation with Deformation
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; data_benchmarkweak or indirect keyword match
abstractWe present DeSOPE, a large-scale dataset designed for Deformed Six-DoF Object Pose Estimation. Most existing 6D object pose approaches assume rigid or articulated objects, leaving deformed daily objects largely unexplored. This gap limits the realism and robustness of current pose estimation methods, which often fail when objects deviate from their canonical shapes due to wear, collision, or deformation. To address this issue, we present DeSOPE, a large-scale real-world dataset specifically designed for deformed object pose estimation. DeSOPE contains two major components: (1) a collection of high-fidelity 3D scans of 26 common object categories, each captured in one canonical and three deformed states using a non-rigid alignment framework; and (2) a real-scene RGB-D dataset comprising 133K frames and 665K pose annotations across 104 deformed instances, recorded in both static and dynami
747likely_noise
low
DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondenceweak or indirect keyword match
abstractNoisy, partially overlapping data and the need for real-time processing pose major challenges for rigid registration. Considering that feature-based matching can handle large transformation differences but suffers from limited accuracy, while local geometry-based matching can achieve fine-grained local alignment but relies heavily on a good initial transformation, we propose a novel dual-space paradigm to fully leverage the strengths of both approaches. First, we introduce an efficient filtering mechanism that incorporates a computationally lightweight single-point RANSAC algorithm followed by a refinement module to eliminate unreliable feature-based correspondences. Subsequently, we treat filtered correspondences as anchor points, extract geometric proxies, and formulate an effective objective function with a tailored solver to estimate the transformation. Experiments verify our method'
748likely_noise
low
mmWaveFlow: Unified Enhancement and Generation of mmWave Human Point Clouds
3D Vision & Geometry / Point Cloud
C. cluster representativedepth_correspondence; surface_occupancy; data_benchmarkweak or indirect keyword match
abstractMillimeter-wave (mmWave) point clouds have attracted growing interest in human sensing due to their robustness, privacy preservation, and low cost. However, their practical use is hindered by the inherent sparsity of data and the lack of large-scale data. We revisit generative modeling for mmWave point clouds and propose a unified flow-matching framework mmWaveFlow that unifies enhancement and generation by learning an invertible transport between dense and sparse point clouds. We leverage paired data and a latent-alignment module to enforce semantic alignment and bridge the modality gap. We find that condition-free flow matching is more vulnerable to latent path crossings, which impair bidirectional transport. Therefore, we propose Origin-Aware Flow Matching (OA-Flow), which conditioning transport on the origin of the path mitigates ambiguity in bidirectional transport. Results of exper
749likely_noise
low
RINO: Rotation-Invariant Non-Rigid Correspondences
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondenceweak or indirect keyword match
abstractDense 3D shape correspondence remains a central challenge in computer vision and graphics as many deep learning approaches still rely on intermediate geometric features or handcrafted descriptors, limiting their effectiveness under non-isometric deformations, partial data, and non-manifold inputs. To overcome these issues, we introduce RINO, an unsupervised, rotation-invariant dense correspondence framework that effectively unifies rigid and non-rigid shape matching. The core of our method is the novel RINONet, a feature extractor that integrates vector-based SO(3)-invariant learning with orientation-aware complex functional maps to extract robust features directly from raw geometry. This allows for a fully end-to-end, data-driven approach that bypasses the need for shape pre-alignment or handcrafted features. Extensive experiments show unprecedented performance of RINO across challengi
750likely_noise
low
Scalable Feature Matching via State Space Modeling and Sparse Correlation
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondenceweak or indirect keyword match
abstractEfficient and robust feature matching is crucial for latency-sensitive and resource-constrained applications. While current semi-dense feature matching approaches commonly suffer from quadratic complexity in spatial resolution due to transformer-based long-range context modeling or redundant full correlation computations. To overcome these limitations, we present a novel scalable feature matching method that delivers reliable correspondences with low memory footprint and latency, especially at high resolutions. Our approach introduces three key innovations: (1) a hybrid Conv-Mamba backbone for efficient cross-scale and cross-view feature extraction with linear complexity, (2) a training-free norm-based feature filtering mechanism, enabling sparse correlation that significantly reduces computation overhead during inference, and (3) a lightweight recurrent coordinate refinement that surpas
751likely_noise
low
KASALv2: Fully Automatic 3D Rotational Symmetry Classification and Axis Localization
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; data_benchmarkkeyword noise pattern without direct reconstruction signal
abstractRotational symmetry is an important prior in 6D pose estimation, improving pose accuracy and ensuring the consistency of symmetry-aware evaluation metrics. However, current symmetry annotations for 3D objects are still largely manual or semi-automatic, often requiring predefined symmetry types or rotational orders and thus limiting scalability. This work introduces a fully automatic and reference-free framework that performs symmetry-type classification, rotational-order identification, and full-axis localization across all eight canonical 3D rotational symmetry types. The method localizes a dominant high-order axis, infers its rotational order through self-consistency analysis, and reconstructs the complete symmetry structure under a hierarchy-guided geometric formulation. A texture-aware extension further models appearance-induced reductions in rotational order while preserving axis or
752likely_noise
low
Affine Perspective-Three-Point Problem
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondenceweak or indirect keyword match
abstractThis paper addresses the Perspective-Three-Point (P3P) problem under affine camera models. We derive direct closed-form solvers for weak perspective and para perspective, which are representative affine camera models. The affine P3P solution reduces to a bi-quadratic equation. Unlike exact P3P solvers that require a cubic or quartic equation, it allows for the simple and stable calculation of real solutions using the quadratic formula. Since affine approximations are valid only when scene depth variation is small, we further propose an iterative correction that upgrades the affine solution to the exact P3P solution. Through extensive comparisons using synthetic data and public datasets, we demonstrate that affine P3P solvers with two upgrade iterations achieve performance substantially comparable to that of the state-of-the-art P3P solver.
753likely_noise
low
Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondenceweak or indirect keyword match
abstractIn this paper, we address the problem of 6-DoF object pose estimation from a single RGB image. Indirect methods that typically predict intermediate 2D keypoints, followed by a Perspective-$n$-Point solver, have shown great performance. Direct approaches, which regress the pose in an end-to-end manner, are usually computationally more efficient but less accurate. However, direct heads rely on globally pooled features, ignoring spatial second-order statistics despite their informativeness in pose prediction. They also predict, in most cases, discontinuous pose representations that lack robustness. Herein, we therefore propose a covariance-pooled representation that encodes convolutional feature distributions as a symmetric positive definite (SPD) matrix. Moreover, we propose a novel pose encoding in the form of an SPD matrix via its Cholesky decomposition. Pose is then regressed in an end
754likely_noise
low
Linear Fundamental Matrix Estimation from 7 or 5 Points
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; depth_correspondenceweak or indirect keyword match
abstractWe revisit the problem of estimating the fundamental matrix of a pair of perspective cameras, a cornerstone of geometric computer vision.As is well-known, linear solvers require at least 8 point correspondences, whereas nonlinear minimal solvers require just 7 in the uncalibrated case or 5 in the calibrated case.In this paper, we consider a special case of the 7-point problem where 5 of the points are configured to lie on two lines, which has previously been shown to have a unique solution.As a theoretical contribution, we offer an analysis of how this uniqueness manifests in the standard 7-point algorithm. On a practical level, we provide the first practical linear solver for the minimal problem associated to this special configuration.Additionally, we evaluate a heuristic 5-point fundamental matrix solver based on the construction of virtual midpoints.When combined with early non-minim
755likely_noise
low
GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global–Local Feature Fusion
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; data_benchmarkweak or indirect keyword match
abstractWe present GazeOnce360, a novel end-to-end model for multi-person gaze estimation from a single tabletop-mounted upward-facing fisheye camera. Unlike conventional approaches that rely on forward-facing cameras in constrained viewpoints, we address the underexplored setting of estimating the 3D gaze direction of multiple people distributed across a 360° scene from an upward fisheye perspective. To support research in this setting, we introduce MPSGaze360, a large-scale synthetic dataset rendered using Unreal Engine, featuring diverse multi-person configurations with accurate 3D gaze and eye landmark annotations.Our model tackles the severe distortion and perspective variation inherent in fisheye imagery by incorporating rotational convolutions and eye landmark supervision. To better capture fine-grained eye features crucial for gaze estimation, we propose a dual-resolution architecture th
756likely_noise
low
Through the Frequency Lens: Cross-Domain Generalisable Gaze Estimation with Adaptive Modulation
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; data_benchmarkweak or indirect keyword match
abstractDeep learning-based gaze estimation methods often exhibit significant performance degradation on unseen target domains. Through systematic frequency-domain analysis, we reveal that face images contain frequency components with distinct contributions: some facilitate cross-domain generalization while others introduce domain-specific interference that impedes it, with both components varying across datasets and constituting a key source of domain gap. Based on these observations, we propose the Frequency-Guided Adaptive Learning framework (FGAL), a novel framework enhancing domain generalization without accessing target domain data. The FGAL consists of two complementary modules: the Adaptive Interference Suppression Module (AISM) and the Spectrum Diversification Module (SDM). AISM adaptively suppresses sample-specific interfering frequency components through learnable modulation maps, whi
757likely_noise
low
FusionRegister: Every Infrared and Visible Image Fusion Deserves Registrtaion
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; data_benchmarkweak or indirect keyword match
abstractSpatial registration across different visual modalities is a critical but formidable step in multi-modality image fusion for real-world perception. Although there are several methods are proposed to address this issue, the existing registration joint fusion methods typically require extensive pre-registration operations, limiting their efficiency. To overcome these limitations, a general cross modality registration method guided by visual priors is proposed for multi-modality image fusion task, termed as FusionRegister.Firstly, FusionRegister achieves robustness by learning cross-modality misregistration representations rather than forcing alignment of all differences, ensuring stable outputs even under challenging input conditions.Moreover, FusionRegister demonstrates strong generality by operating directly on fused results, where misregistration is explicitly represented and effectivel
758likely_noise
low
Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localization; data_benchmarkweak or indirect keyword match
abstractEstimating camera geometry typically involves solving minimal problems formulated as systems of multivariate polynomial equations, which often pose computational challenges when using existing Gröbner-basis or resultant-based methods due to matrix inversion needed in the online solver. Here we propose a sampling-based, matrix inversion-free method that constructs the solvers using sparse hidden-variable resultants. The determinant polynomial in the hidden variable is efficiently reconstructed via inverse fast Fourier transform interpolation from sampled evaluations, avoiding symbolic expansion. Solving this polynomial yields the hidden variable, and the remaining unknowns are recovered by identifying rank-1 deficient submatrices and applying Cramer's rule. A greatest common divisor-based criterion ensures robust submatrix identification under noise. Experiments on diverse minimal problem
759likely_noise
low
WildCap: Facial Appearance Capture in the Wild via Hybrid Inverse Rendering
3D Vision & Geometry / 3D Gaussian Splatting
C. cluster representativegaussian_radianceweak or indirect keyword match
abstractExisting methods achieve high-quality facial appearance capture under controllable lighting, which increases capture cost and limits usability. We propose WildCap, a novel method for high-quality facial appearance capture from a smartphone video recorded in the wild. To disentangle high-quality reflectance from complex lighting effects in in-the-wild captures, we propose a novel hybrid inverse rendering framework. Specifically, we first apply a data-driven method, i.e., SwitchLight, to convert the captured images into more constrained conditions and then adopt model-based inverse rendering. However, unavoidable local artifacts in network predictions, such as shadow-baking, are non-physical and thus hinder accurate inverse rendering of lighting and material. To address this, we propose a novel texel grid lighting model to explain non-physical effects as clean albedo illuminated by local p
760likely_noise
low
Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning
3D Vision & Geometry / Point Cloud
C. cluster representativegaussian_radiance; surface_occupancyweak or indirect keyword match
abstractLarge multimodal 3D vision--language models show strong generalization across diverse 3D tasks, but their performance still degrades notably under domain shifts. This has motivated recent studies on test-time adaptation (TTA), which enables models to adapt online using test-time data. Among existing TTA methods, cache-based mechanisms are widely adopted for leveraging previously observed samples in online prediction refinement. However, they store only limited historical information, leading to progressive information loss as the test stream evolves. In addition, their prediction logits are fused heuristically, making adaptation unstable.To address these limitations, we propose BayesMM, a Multimodal Bayesian Distribution Learning framework for test-time point cloud analysis. BayesMM models textual priors and streaming visual features of each class as Gaussian distributions: textual param
761likely_noise
low
PointCNN++: Performant Convolution on Native Points
3D Vision & Geometry / Point Cloud
C. cluster representativepose_calibration_localization; surface_occupancyweak or indirect keyword match
abstractExisting convolutional learning methods for 3D point cloud data are divided into two paradigms: point-based methods that preserve geometric precision but often face performance challenges, and voxel-based methods that achieve high efficiency through quantization at the cost of geometric fidelity. This loss of precision is a critical bottleneck for tasks such as point cloud registration. We propose PointCNN++, a novel architectural design that fundamentally mitigates this precision-performance trade-off. It generalizes sparse convolution from voxels to points, treating voxel-based convolution as a specialized, degraded case of our more general point-based convolution. First, we introduce a point-centric convolution where the receptive field is centered on the original, high-precision point coordinates. Second, to make this high-fidelity operation performant, we design a computational stra
762likely_noise
low
PointTPA: Test-Time Parameter Adaptation for 3D Scene Understanding
3D Vision & Geometry / Point Cloud
C. cluster representativedynamic_4d; surface_occupancyweak or indirect keyword match
abstractScene-level point cloud understanding remains challenging due to diverse geometries, imbalanced categories, and highly varied spatial layouts. Existing methods improve object-level performance but rely on static parameters during inference, limiting their adaptability to dynamic scene data. We propose Test-time Parameter Adaptation for Point Cloud Scene Perception (PointTPA), a test-time dynamic adaptation framework that constructs input-aware parameters for scene-level point clouds. PointTPA uses a Serialization-based Neighborhood Grouping (SNG) to form locally coherent patches and a Dynamic Parameter Projector (DPP) to produce patch-wise adaptive weights, enabling the backbone to adjust its behavior according to scene-specific variations while keeping parameter cost low. Integrated into PTv3, PointTPA reduces trainable parameters by over 95% and achieves competitive or superior perform
763likely_noise
low
Streamlined Open-Vocabulary Human-Object Interaction Detection
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localizationkeyword noise pattern without direct reconstruction signal
abstractOpen-vocabulary human-object interaction (HOI) detection aims to localize and recognize all human-object interactions in an image, including those unseen during training.Existing approaches usually rely on the collaboration between a conventional HOI detector and a Vision-Language Model (VLM) to recognize unseen HOI categories.However, feature fusion in this paradigm is challenging due to significant gaps in cross-model representations.To address this issue, we introduce **SL-HOI**, a **S**tream**L**ined open-vocabulary **HOI** detection framework based solely on the powerful DINOv3 model.Our design leverages the complementary strengths of DINOv3's components: its backbone for fine-grained localization and its text-aligned vision head for open-vocabulary interaction classification.Moreover, to facilitate smooth cross-attention between the interaction queries and the vision head's output,
764likely_noise
low
ViTPrompt: Training-Free Prompt Refinement with Visual Tokens for Open-Vocabulary Detection
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localizationkeyword noise pattern without direct reconstruction signal
abstractTest-Time Adaptive Object Detection (TTAOD) aims to maintain detection performance under distribution shifts without retraining. While recent vision-language models enable open-vocabulary detection, existing TTAOD methods—whether closed-set or open-vocabulary—focus exclusively on improving classification confidence and largely overlook the degradation of bounding box localization. To address this critical gap, we propose ViTPrompt (Visual Token-Prompting), a training-free framework that jointly refines both bounding boxes and class scores at test time. Our key insight is to augment the original text prompt with instance-aware visual tokens extracted from high-confidence detections in an initial forward pass; this enriched prompt is then used in a second inference stage, where the cross-modal decoder leverages the enhanced semantic context to produce more accurate box coordinates and clas
765likely_noise
low
ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders
3D Vision & Geometry / Point Cloud
C. cluster representativedepth_correspondence; surface_occupancykeyword noise pattern without direct reconstruction signal
abstractHierarchical LiDAR geometry compression encodes voxel occupancies from low to high bit-depths, yet prior methods treat each depth independently and re-estimate local context from coordinates at every level, limiting compression efficiency. We present ELiC, a real-time framework that combines cross-bit-depth feature propagation, a Bag-of-Encoders (BoE) selection scheme, and a Morton-order-preserving hierarchy. Cross-bit-depth propagation reuses features extracted at denser, lower depths to support prediction at sparser, higher depths. BoE selects, per depth, the most suitable coding network from a small pool, adapting capacity to observed occupancy statistics without training a separate model for each level. The Morton hierarchy maintains global Z-order across depth transitions, eliminating per-level sorting and reducing latency. Together these components improve entropy modeling and comp
766likely_noise
low
Towards Generalized Multimodal Homography Estimation
3D Vision & Geometry / Pose Estimation
C. cluster representativepose_calibration_localizationweak or indirect keyword match
abstractSupervised and unsupervised homography estimation methods depend on image pairs tailored to specific modalities to achieve high accuracy. However, their performance deteriorates substantially when applied to unseen modalities. To address this issue, we propose a training data synthesis method that generates unaligned image pairs with ground-truth offsets from a single input image. Our approach renders the image pairs with diverse textures and colors while preserving their structural information. These synthetic data empower the trained model to achieve greater robustness and improved generalization across various domains. Additionally, we design a network to fully leverage cross-scale information and decouple color information from feature representations, thus improving estimation accuracy. Extensive experiments show that our training data synthesis method improves generalization perfor
767likely_noise
low
AnyPcc: Compressing Any Point Cloud with a Single Universal Model
3D Vision & Geometry / Point Cloud
C. cluster representativesurface_occupancy; data_benchmarkkeyword noise pattern without direct reconstruction signal
abstractGeneralization remains a critical challenge in deep learning-based point cloud geometry compression. While existing methods perform well on standard benchmarks, their performance collapses in real-world scenarios due to two fundamental limitations: the lack of context models that are robust across diverse data densities, and the inability to efficiently adapt to out-of-distribution (OOD) data. To overcome both challenges, we introduce AnyPcc, a universal point cloud compression framework. AnyPcc first employs a Universal Context Model that leverages coarse-grained spatial priors with fine-grained channel priors to ensure robust context modeling across the entire density spectrum. Second, our novel Instance-Adaptive Fine-Tuning (IAFT) strategy tackles OOD data by synergizing explicit and implicit compression paradigms. For each instance, it fine-tunes a small subset of network weights and
768likely_noise
low
Image-to-Point Cloud Feature Back-projection for Multimodal Training of 3D Semantic Segmentation
3D Vision & Geometry / Point Cloud
C. cluster representativedepth_correspondence; surface_occupancyweak or indirect keyword match
abstractThe effective integration and utilization of multimodal data acquired from image cameras and LiDAR is of paramount importance for perception systems. This paper proposes **I**mage-to-**P**oint Cloud **F**eature Back-**P**rojection (**IPFP**), a novel method for training multimodal fusion networks that back-projects aggregated image-feature centers (from non-projection-aligned image pixels) into the point-cloud feature set via the estimated depth map. Consequently, image features and point cloud features reside within the same three-dimensional space, enabling the natural enrichment of image information into the point cloud during the network forward pass. This process can be selectively enabled when desired -- for instance, at training time -- and turned off in the absence of multimodal data -- for example, at testing time if only LiDAR sensors are available. Experimental results demonst
769likely_noise
low
PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction
3D Vision & Geometry / Point Cloud
C. cluster representativesurface_occupancy; data_benchmarkweak or indirect keyword match
abstractAutoregressive point cloud generation has long lagged behind diffusion-based approaches in quality. The performance gap stems from the fact that autoregressive models impose an artificial ordering on inherently unordered point sets, forcing shape generation to proceed as a sequence of local predictions. This sequential bias reinforces short-range continuity but limits the model’s ability to capture long-range dependencies, thereby weakening its capacity to enforce global structural properties such as symmetry, geometric consistency, and large-scale spatial regularities. Inspired by the level-of-detail (LOD) principle in shape modeling, we propose PointNSP, a coarse-to-fine generative framework that preserves global shape structure at low resolutions and progressively refines fine-grained geometry at higher scales through a next-scale prediction paradigm. This multi-scale factorization al
770likely_noise
low
PointCSP: Cross-Sample Semantic Propagation and Stability Preservation in Self-Supervised Point Cloud Learning
3D Vision & Geometry / Point Cloud
C. cluster representativesurface_occupancy; data_benchmarkweak or indirect keyword match
abstractScene-level point cloud self-supervised learning (PC-SSL) has demonstrated potential in enhancing the generalization capability of 3D vision models. Despite the advances achieved in the field through existing methods, the sample-independent modelling paradigm still poses significant limitations in terms of maintaining consistent semantic representations across different scenes. This challenge hinders the construction of a unified and transferable semantic space. To address this issue, we propose a PC-SSL framework based on cross-sample semantic propagation (CSP), in which samples within a batch are serialized into a continuous input and processed by a state-space model to enable semantic state propagation. This mechanism explicitly models the dynamic dependencies across samples in the state space, allowing the network to establish cross-sample semantic consistency in the latent space, an
771likely_noise
low
Structure-to-Intensity Diffusion for Adverse-Weather LiDAR Generation
3D Vision & Geometry / Point Cloud
C. cluster representativesurface_occupancy; data_benchmarkweak or indirect keyword match
abstractAdverse-weather LiDAR point cloud generation is challenged by complex weather-induced degradations. These degradations affect geometry and reflectance in fundamentally different ways, making joint modeling difficult and ambiguous, especially when diverse real-world training data is limited. To address this, we propose $\textit{Structure-to-Intensity Diffusion}$ (SiD), a diffusion-based framework that explicitly factorizes the denoising process at each time step: it first reconstructs the geometric structure, then conditions reflectance intensity denoising on the estimated structure. This structure-conditioned design decomposes the joint distribution, reduces modeling ambiguity, and leads to point clouds that are both geometrically coherent and radiometrically realistic. To mitigate data scarcity, we introduce $\textit{Real-Prior Weather Simulation}$ (RPWS), a degradation module that leve
772likely_noise
low
Test-Time Training for LiDAR Semantic Segmentation under Corruption via Geometric Inlier Discrimination
3D Vision & Geometry / Point Cloud
C. cluster representativesurface_occupancy; data_benchmarkweak or indirect keyword match
abstractLiDAR semantic segmentation must remain robust under various sensor and environmental corruptions to be reliable in safety-critical applications.Existing test-time adaptation methods, including approaches based on pseudo-labels and normalization statistics, have shown promising results but can still struggle under severe distribution shifts.To complement these approaches, we propose a geometry-aware test-time training framework that leverages an auxiliary self-supervised objective.Our method is based on geometric inlier discrimination (GeoID), which injects synthetic off-manifold points into the input and trains the model to distinguish geometry-consistent inliers from synthetically displaced outliers, enabling adaptation on unlabeled test data.To further stabilize this process under real corruptions, we introduce bidirectional unreliable point filtering (BiUPF), which uses inlier scores
773likely_noise
low
LitePT: Lighter Yet Stronger Point Transformer
3D Vision & Geometry / Point Cloud
C. cluster representativesurface_occupancyweak or indirect keyword match
abstractModern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has 3.6
774likely_noise
low
Low-Rank Test-Time Training for Pre-Trained Point Cloud Models
3D Vision & Geometry / Point Cloud
C. cluster representativesurface_occupancykeyword noise pattern without direct reconstruction signal
abstractTest-time training (TTT) enhances the robustness of pretrained models to out-of-distribution (OOD) data through auxiliary self-supervised tasks, without requiring labeled samples. However, existing TTT methods predominantly rely on decoder-based auxiliary objectives, which suffer from inefficient adaptation and weak coupling with the primary task. To solve these limitations, we revisit the mechanism of test-time training by analyzing masking-based pretrained models to uncover the fundamental source of their OOD robustness. Our investigation reveals that their generalization capability stems from a latent feature-level structural invariance, the consistency of encoded representations under masked perturbations. Building on this insight, we introduce LoTT-PC, a lightweight LoRA-based framework that operationalizes this invariance-preserving principle for 3D point cloud classification. LoTT
775likely_noise
low
Point Cloud as a Foreign Language for Multi-modal Large Language Model
3D Vision & Geometry / Point Cloud
C. cluster representativesurface_occupancyweak or indirect keyword match
abstractMulti-modal large language models (MLLMs) have shown remarkable progress in integrating visual and linguistic understanding. Recent efforts have extended these capabilities to 3D understanding through encoder-based architectures that rely on pre-trained 3D encoders to extract geometric features. However, such approaches suffer from semantic misalignment between geometric and linguistic spaces, resolution sensitivity, and substantial computational overhead. In this work, we present SAGE, the first end-to-end 3D MLLM that directly processes raw point clouds without relying on a pre-trained 3D encoder. Our approach introduces a lightweight 3D tokenizer that combines geometric sampling and neighbourhood aggregation with vector quantization to convert point clouds into discrete tokens—treating 3D data as a foreign language that naturally extends the LLM’s vocabulary. Furthermore, to enhance t
776likely_noise
low
PointThinker: Point-Incentivized Parallel Thinking for Multimodal Large Language Model
3D Vision & Geometry / Point Cloud
C. cluster representativesurface_occupancyweak or indirect keyword match
abstractThis paper explores parallel thinking for Multi-modal Large Language Models (MLLMs), aiming to improve Chain-of-Thought (CoT) through multiple diverse reasoning paths. We guide the model to list multiple visual key points and develop an independent reasoning path for each. Therefore, we term this method PointThinker, which is characterized by starting each thinking path with a point. PointThinker offers two key advantages. (1) It amplifies the benefits of parallel thinking. While parallel thinking naturally benefits from multiple reasoning paths, explicitly listing key points further amplifies these benefits by eliminating redundancy and promoting path diversity, enabling the model to explore problems from more varied perspectives. (2) It uses a novel dense (point-wise) reward for reinforcement learning. We observe that during parallel thinking, some points are helpful while others are i
777likely_noise
low
Routing on Demand: DSNet for Efficient Progressive Point Cloud Denoising
3D Vision & Geometry / Point Cloud
C. cluster representativesurface_occupancyweak or indirect keyword match
abstractPoint cloud denoising is a critical preprocessing step for enhancing the reliability and accuracy of 3D perception systems. Most existing progressive denoising methods rely on fixed iterative pipelines that process all regions uniformly, resulting in redundant computation and over-smoothing of geometric details when handling point clouds with non-uniform noise distributions. To overcome these limitations, we introduce Dynamic Skip Net (DSNet), a novel progressive denoising framework that adaptively determines the optimal denoising path for each local patch based on its noise characteristics. DSNet incorporates a noise discriminator that quantifies local noise intensity by analyzing normal similarity, and a reverse monotonic decision function that maps this measure to an appropriate denoising module. Furthermore, we propose a Path-Selective Iteration mechanism that dynamically re-evaluate
778likely_noise
low
A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractIn this paper, we propose a cross-view fusion framework that enhances the robustness of 6-DoF grasp pose estimation in corner views.Our framework alleviates occlusion by incorporating an auxiliary view and avoids the time-consuming, task-agnostic multi-view reconstruction through a post-fusion strategy.To enable cross-view fusion, we propose a self-supervised contrastive learning strategy that leverages cross-view associations to regularize point cloud features.In brief, a cross-view point pair is considered a match if the two points correspond to the same 3D location, and a non-match if they represent distinct grasp directions.The learning strategy significantly enhances the spatial consistency and direction distinctiveness of point features, thereby facilitating cross-view fusion and improving estimation robustness.Furthermore, we propose a cross-view-aligned cylinder integration modul
779likely_noise
low
Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer
Multimodal & Language / Agentic AI
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; generation_editing; data_benchmarkweak or indirect keyword match
abstractRecent progress in 4D representations, such as Dynamic NeRF and 4D Gaussian Splatting (4DGS), has enabled dynamic 4D scene reconstruction. However, text-driven 4D scene editing remains under-explored due to the challenge of ensuring both multi-view and temporal consistency across space and time during editing.Existing studies rely on 2D diffusion models that edit frames independently, often causing motion distortion, geometric drift, and incomplete editing. We introduce Dynamic-eDiTor, a training-free text-driven 4D editing framework leveraging Multimodal Diffusion Transformer (MM-DiT) and 4DGS. This mechanism consists of Spatio-Temporal Sub-Grid Attention (STGA) for locally consistent cross-view and temporal fusion, and Context Token Propagation (CTP) for global propagation via token inheritance and optical-flow-guided token replacement. Together, these components allow Dynamic-eDiTor t
780likely_noise
low
GaussianDWM: Driving World Model using Language-aligned 3D Gaussians for Scene Understanding and Multi-modal Generation
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editingweak or indirect keyword match
abstractDriving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality
781likely_noise
low
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; dynamic_4d; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractWe introduce OpenVO, a novel framework for Open-world Visual Odometry (VO) with temporal awareness under limited input conditions. OpenVO effectively estimates real-world–scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras, enabling robust trajectory dataset construction from rare driving events recorded in dashcam.Existing VO methods are trained on fixed observation frequency (e.g., 10Hz or 12Hz), completely overlooking temporal dynamics information. Many prior methods also require calibrated cameras with known intrinsic parameters. Consequently, their performance degrades when (1) deployed under unseen observation frequencies or (2) applied to uncalibrated cameras. These significantly limit their generalizability to many downstream tasks, such as extracting trajectories from dashcam footage.To address these challenges, OpenVO (1) expl
782likely_noise
low
PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractHumans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots, crucial for contact reasoning, while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advance
783likely_noise
low
SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextpose_calibration_localization; depth_correspondence; dynamic_4d; robotics_mapping; generation_editing; data_benchmarkweak or indirect keyword match
abstractControlling both camera motion and object dynamics is essential for coherent and expressive video generation, yet current methods typically handle only one motion type or rely on ambiguous 2D cues that entangle camera-induced parallax with true object movement. We present SymphoMotion, a unified motion-control framework that jointly governs camera trajectories and object dynamics within a single model. SymphoMotion features a Camera Trajectory Control mechanism that integrates explicit camera paths with geometry-aware cues to ensure stable, structurally consistent viewpoint transitions, and an Object Dynamics Control mechanism that combines 2D visual guidance with 3D trajectory embeddings to enable depth-aware, spatially coherent object manipulation. To support large-scale training and evaluation, we further construct RealCOD-25K, a comprehensive real-world dataset containing paired came
784likely_noise
low
TACO: Task-Aware Contrastive Learning for Joint LiDAR Localization and 3D Object Detection
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractReliable navigation and decision-making of autonomous vehicles require both accurate localization and object detection. Traditionally, these two tasks are handled separately, leading to redundant computation and limited cross-task knowledge transfer. This paper proposes TACO, the first Task-Aware COntrastive learning framework, which performs joint LiDAR localization and 3D object detection within a single, unified network. TACO leverages contrastive learning to explicitly decouple and align static geographic features for localization and object-centric features for detection. This bidirectional mutual supervision not only enhances localization robustness in dynamic environments by filtering dynamic noise but also boosts detection accuracy via effective spatial context. Additionally, we propose OxfoLD, the first dataset that provides multi-traversal LiDAR localization ground truth with r
785likely_noise
low
Test-Time 3D Occupancy Prediction
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mappingweak or indirect keyword match
abstractSelf-supervised 3D occupancy prediction offers a promising solution for understanding complex driving scenes without requiring costly 3D annotations. However, training dense occupancy decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and once trained, such models struggle to adapt to varying voxel resolutions or novel object categories without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-Occ. Our method incrementally constructs, optimizes and voxelizes time-aware 3D Gaussians from raw sensor streams by integrating vision foundation models (VFMs) at runtime. The flexible nature of 3D Gaussians allows voxelization at arbitrary user-specified resolutions, while the generalization ability of VFMs enables accurate perception and open-vocabulary recognition,
786likely_noise
low
UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractDynamic driving scene reconstruction is critical for autonomous driving simulation and closed-loop learning. While recent feed-forward methods have shown promise for 3D reconstruction, they struggle with long-range driving sequences due to quadratic complexity in sequence length and challenges in modeling dynamic objects over extended durations. We propose UFO, a novel recurrent paradigm that combines the benefits of optimization-based and feed-forward methods for efficient long-range 4D reconstruction. Our approach maintains a 4D scene representation that is iteratively refined as new observations arrive, using a visibility-based filtering mechanism to select informative scene tokens and enable efficient processing of long sequences. For dynamic objects, we introduce an object pose-guided modeling approach that supports accurate long-range motion capture. Experiments on the Waymo Open D
787likely_noise
low
UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractPerceiving and reconstructing objects from images are critical for real-to-sim transfer tasks, which are widely used in the robotics community.Existing methods rely on multiple submodules such as detection, segmentation, shape reconstruction, and pose estimation to complete the pipeline.However, such modular pipelines suffer from inefficiency and cumulative error, as each stage operates on only partial or locally refined information while discarding global context.To address these limitations, we propose UniPR, the first end-to-end object-level real-to-sim perception and reconstruction framework.Operating directly on a single stereo image pair, UniPR leverages geometric constraints to resolve the scale ambiguity.We introduce Pose-Aware Shape Representation to eliminate the need for per-category canonical definitions and to bridge the gap between reconstruction and pose estimation tasks.F
788likely_noise
low
UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mappingweak or indirect keyword match
abstractManually annotating accurate 3D hand poses is extremely time-consuming and labor-intensive. Existing self-supervised hand pose estimation methods leverage the discrepancy between input images and rendered outputs, or multiview consistency constraints, as the driving force to optimize networks and progressively refine pose accuracy. However, these methods are highly susceptible to noisy pseudo-labels and overlook the importance of fully exploiting fine-grained spatial correlations, which undermines the stability of model training. To address these issues, we propose UST-Hand, a self-supervised learning framework that estimates uncertainty distribution of hand pose and constructs a probabilistic point cloud feature space, which enables the complex spatiotemporal relationship modeling. UST-Hand employs a conditional normalizing flow model to capture hand pose distributions and samples diver
789likely_noise
low
SyncMos: Scalable Motion Synchronisation for Multi-Agent Scene Interaction
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractText-guided motion generation in 3D scenes has advanced the synthesis of human–scene interactions, contributing to embodied AI, scene understanding, and virtual agent simulation. While recent studies have begun exploring multi-agent scenarios, achieving temporally synchronised interactions among multiple agents remains an open challenge. Existing methods are often limited in flexibility and scalability when handling diverse interaction contexts.We present a method that enables synchronised multi-agent interaction using a single-agent motion synthesis model through two key components: a text-guided dependency-aware story planner and a temporal synchronisation module. The story planner interprets natural language instructions into structured event sequences with temporal dependencies. Our synchronisation module, built upon time-warping control and diffusion posterior sampling, aligns inter
790likely_noise
low
PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency
Segmentation & Dense Prediction / Depth / Optical Flow
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; data_benchmarkweak or indirect keyword match
abstractMonocular depth estimation (MDE) has been widely adopted in the perception systems of autonomous vehicles and mobile robots. However, existing approaches often struggle to maintain temporal consistency in depth estimation across consecutive frames. This inconsistency not only causes jitter but can also lead to estimation failures when the depth range changes abruptly. To address these challenges, this paper proposes a consistency-aware monocular depth estimation framework that leverages wheel odometry from a mobile robot to achieve stable and coherent depth predictions over time. Specifically, we estimate camera pose and sparse depth from triangulation using optical flow between consecutive frames. The sparse depth estimates are used to update a recursive Bayesian estimate of the metric scale, which is then applied to rescale the relative depth predicted by a pre-trained depth estimation
791likely_noise
low
DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editingweak or indirect keyword match
abstractSynthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Fi
792likely_noise
low
GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractDespite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) —a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM)–based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from large-sc
793likely_noise
low
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextdepth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractCurrent Vision-Language-Action (VLA) models primarily focus on mapping 2D observations to actions but exhibit notable limitations in spatiotemporal perception and reasoning: 1) spatial representations often rely on additional sensors, introducing substantial computational overhead; 2) visual reasoning is typically limited to future-frame prediction, lacking alignment with the instruction-grounded scene and thus compromising spatiotemporal consistency. To address these challenges, we propose **ConsisVLA-4D**, a unified and efficient framework that enhances spatiotemporal consistency in 3D-Perception and 4D-Reasoning. Specifically, we design: **1) CV-Aligner**, which ensures **C**ross-**V**iew object semantic consistency via filtering instruction-relevant regions and aligning object identities across multiple viewpoints; **2) CO-Fuser**, which guarantees **C**ross-**O**bject spatial g
794likely_noise
low
Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mapping; generation_editing; data_benchmarkweak or indirect keyword match
abstractWe introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations—such as translating, rotating, or resizing objects—due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampli
795likely_noise
low
Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views
Segmentation & Dense Prediction / Depth / Optical Flow
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancyweak or indirect keyword match
abstractSoft boundaries, like thin hairs, are commonly observed in natural and computer-generated imagery, but they remain challenging for 3D vision due to the ambiguous mixing of foreground and background cues. This paper introduces Guardians of the Hair (HairGuard), a framework designed to recover fine-grained soft boundary details in 3D vision tasks. Specifically, we first propose a novel data curation pipeline that leverages image matting datasets for training and design a depth fixer network to automatically identify soft boundary regions. With a gated residual module, the depth fixer refines depth precisely around soft boundaries while maintaining global depth quality, allowing plug-and-play integration with state-of-the-art depth models. For view synthesis, we perform depth-based forward warping to retain high-fidelity textures, followed by a generative scene painter that fills disocclude
796likely_noise
low
Structural Action Transformer for 3D Dexterous Manipulation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextdepth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractAchieving human-level dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity.This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories.This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length.To encode structural priors and resolve ambiguity, we introduce
797likely_noise
low
Spe-BEVHead: Rethinking the Detection Head Design for Bird’s-Eye-View Object Detection
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiance; surface_occupancy; robotics_mappingkeyword noise pattern without direct reconstruction signal
abstractBird’s-Eye-View (BEV) detection has become a dominant paradigm for 3D object detection in autonomous driving, due to its strong perception capability. However, most existing methods mainly focus on constructing high-quality BEV feature representations, while neglecting the design of task-specific detection heads. In practice, they directly adopt the center-based head originally developed for 2D detection, without any specific optimization. This leads to three inherent limitations: (i) a geometric mismatch between the Gaussian kernel used for classification and the real BEV object, (ii) degraded end-to-end performance without Non-Maximum Suppression(NMS), and (iii) sparse supervisory signals. To address these issues, we propose Spe-BEVHead, a detection head specifically tailored for BEV 3D object detection. Spe-BEVHead introduces three BEV-specific adaptations: (1) a Rotated Box Kernel th
798likely_noise
low
LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World
Detection & Tracking / Tracking
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4dweak or indirect keyword match
abstractTracking 3D human motion from egocentric, multi-camera devices is challenged by severe egomotion and partial visibility or occlusions. Existing methods are designed for monocular video often recorded from static or slowly-moving cameras and cannot easily leverage multi-view, calibrated and localized input. This makes them brittle and prone to fail on dynamic egocentric captures. We propose LAMP ($\textbf{L}$ocalization $\textbf{A}$ware $\textbf{M}$ulti-camera $\textbf{P}$eople Tracking): a novel, simple framework to solve this via early disentanglement of observer and target motion. LAMP introduces a two-step process: First, we leverage the device's known 6-DoF pose and calibration to convert detected 2D body keypoints from all cameras over a temporal window into a unified 3D world reference frame. Second, an end-to-end-trained Transformer model fits 3D human motion directly to this spa
799likely_noise
low
SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning
Video & Motion / Video Understanding
D. adjacent but useful contextgaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancyweak or indirect keyword match
abstractWeakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focus merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity-aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Fur
800likely_noise
low
Thinking in 360°: Humanoid Visual Search in the Wild
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractHumans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360°. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360° panoramic image. To study visual search in visually-crowded real-world scenarios, we build H* Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, u
801likely_noise
low
Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors
Robustness & Safety / Safety
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractAdversarial robustness of BEV 3D object detectors is critical for autonomous driving (AD). Existing invasive attacks require altering the target vehicle itself (*e.g.* attaching patches), making them unrealistic and impractical for real-world evaluation. While non-invasive attacks that place adversarial objects in the environment are more practical, current methods still lack the multi-view and temporal consistency needed for physically plausible threats. In this paper, we present the first framework for generating universal, non-invasive, and 3D consistent adversarial objects that expose fundamental vulnerabilities for BEV 3D object detectors. Instead of modifying target vehicles, our method inserts rendered objects into scenes with an occlusion-aware module that enforces physical plausibility across views and time. To maintain attack effectiveness across views and frames, we optimize a
802likely_noise
low
VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; robotics_mapping; generation_editingweak or indirect keyword match
abstractRecent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a sh
803likely_noise
low
Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractAffordance segmentation aims to decompose 3D objects into parts that serve distinct functional roles, enabling models to reason about object interactions rather than mere recognition. Existing methods, mostly following the paradigm of 3D semantic segmentation or prompt-based frameworks, struggle when geometric cues are weak or ambiguous, as sparse point clouds provide limited functional information. To overcome this limitation, we leverage the rich semantic knowledge embedded in large-scale 2D Vision Foundation Models (VFMs) to guide 3D representation learning through a cross-modal alignment mechanism. Specifically, we propose Cross-Modal Affinity Transfer (CMAT), a pretraining strategy that compels the 3D encoder to align with the semantic structures induced by lifted 2D features. CMAT is driven by a core affinity alignment objective, supported by two auxiliary losses, geometric reconst
804likely_noise
low
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgaussian_radiance; dynamic_4d; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractHierarchical Vision–Language–Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision–Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM repla
805likely_noise
low
ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; pose_calibration_localization; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractRecent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising vision-language-action (VLA) paradigm. However, most existing approaches overlook the importance of active perception: they typically rely on static, wrist-mounted cameras that provide an end-effector-centric viewpoint. As a result, these models are unable to adaptively select optimal viewpoints or resolutions during task execution, which significantly limits their performance in long-horizon tasks and fine-grained manipulation scenarios. To address these limitations, we propose ActiveVLA, a novel vision-language-action framework that empowers robots with active perception capabilities for high-precision, fine-grained manipulation. ActiveVLA adopts a coarse-to-fine paradig
806likely_noise
low
Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractVision-Language-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D physical environments, creating a significant gap between perception and action grounding. To bridge this gap, we propose a Spatial-Aware VLA Pretraining paradigm that enables models to acquire 3D spatial understanding before robot policy learning. Starting from pretrained vision-language models, we leverage large-scale human demonstration videos to extract 3D visual and 3D action annotations, forming a new source of supervision that aligns 2D visual observations with 3D spatial reasoning. We instantiate this paradigm with VIPA-VLA, a dual-encoder architecture that incorporates a 3D visual encoder to augment semantic visual representations with 3D-aware
807likely_noise
low
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractRobotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control.Yet their ability to generalize across new environments, tasks, and embodiments remains limited.We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs).However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world.Bridging this gap directly with large-scale robotic data is costly and difficult to scale.Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities.Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image.Building on SPEAR-VLM, we introduce
808likely_noise
low
Structural–Semantic Perception for Diffusion-Guided Temporal Forgery Localization
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextpose_calibration_localization; dynamic_4d; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractTemporal Forgery Localization (TFL) is crucial for enhancing the interpretability and accountability of deepfake forensics by precisely pinpointing the manipulated segments.However, existing methods face two limitations: (1) localization precision, where one-shot boundary prediction models fail to rectify inherent initial prediction biases, and temporal emphasis overlooks modality-internal semantic forgery cues, resulting in noise-sensitive localization, and (2) cross-dataset generalization, where fixed-scale temporal receptive fields struggle to accommodate varying manipulation durations across real-world scenarios. To address these challenges, we propose a unified framework based on structural–semantic perception and diffusion-guided refinement. The structural–semantic perception comprises two complementary components: (1) structural perception, which adaptively models manipulation dur
809likely_noise
low
TopoMA: Topology-Guided Multi-Agent Dense RGB 3D Reconstruction via Distributed Inference
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractMulti-agent 3D reconstruction, as a key technology for large-scale VR/AR, robot swarms, and digital twins, has attracted growing attention. Recent end-to-end 3D reconstruction methods achieve strong performance in single-agent scenarios, but they are difficult to directly extend to multi-agent collaborative settings, where they often suffer from unstable tracking, excessive memory consumption, and frequent loop-closure failures, thus failing to meet real-time and large-scale deployment requirements. To address these issues, we propose TOPOMA, a real-time end-to-end 3D reconstruction framework tailored for multi-agent collaboration. TOPOMA explicitly models the spatial topological structure of the scene and tightly couples it with end-to-end representation learning, thereby jointly solving core challenges such as inter-agent spatial alignment and submap fusion. Concretely, we introduce to
810likely_noise
low
Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractWe present Wave-Former, a novel method capable of high-accuracy 3D shape reconstruction for completely occluded, diverse, everyday objects. This capability can open new applications spanning robotics, augmented reality, and logistics. Our approach leverages millimeter-wave (mmWave) wireless signals, which can penetrate common occlusions and reflect off hidden objects. In contrast to past mmWave reconstruction methods, which suffer from limited coverage and high noise, Wave-Former introduces a physics-aware shape completion model capable of inferring full 3D geometry. At the heart of Wave-Former's design is a novel three-stage pipeline which bridges raw wireless signals with recent advancements in vision-based shape completion by incorporating physical properties of mmWave signals. The pipeline proposes candidate geometric surfaces, employs a transformer-based shape completion model desig
811likely_noise
low
ReManNet: A Riemannian Manifold Network for Monocular 3D Lane Detection
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgaussian_radiance; depth_correspondence; surface_occupancy; robotics_mappingweak or indirect keyword match
abstractMonocular 3D lane detection remains challenging due to depth ambiguity and weak geometric constraints. Mainstream methods rely on depth guidance, BEV projection, and anchor- or curve-based heads with simplified physical assumptions, remapping high-dimensional image features while only weakly encoding road geometry. Lacking an invariant geometric–topological coupling between lanes and the underlying road surface, 2D-to-3D lifting is ill-posed and brittle, often degenerating into concavities, bulges, and twists. To address this, we propose the Road-Manifold Assumption: the road is a smooth 2D manifold in $\mathbb{R}^3$, lanes are embedded 1D submanifolds, and sampled lane points are dense observations, coupling metric and topology across surfaces, curves, and samples. Building on this, we propose ReManNet: it first produces initial lane predictions with an image backbone and detection head
812likely_noise
low
From Manuals to Actions: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgaussian_radiance; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractVision–Language–Action (VLA) models have recently emerged, demonstrating strong generalization in robotic scene understanding and manipulation. However, when confronted with long-horizon tasks that require defined goal states, such as LEGO assembly or object rearrangement, existing VLA models still face challenges in coordinating long-horizon planning with precise manipulation.Therefore, we aim to endow a VLA model with the capability to infer the “how” process from the “what” outcomes, transforming goal states into executable procedures. In this paper, we introduce ManualVLA, a unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution. Unlike prior VLA models that directly map sensory inputs to actions, we first equip ManualVLA with a planning expert that generates intermediate
813likely_noise
low
Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextpose_calibration_localization; dynamic_4d; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractAutonomous trucking poses unique challenges due to articulated tractor–trailer geometry, and time-varying sensor poses caused by the fifth-wheel joint and trailer flex. Existing perception and calibration methods assume static baselines or rely on high-parallax and texture-rich scenes, limiting their reliability under real-world settings. We propose dCAP (dynamic Calibration and Articulated Perception), a vision-based framework that continuously estimates the 6-DoF (degree of freedom) relative pose between tractor and trailer cameras. dCAP employs a transformer with cross-view and temporal attention to robustly aggregate spatial cues while maintaining temporal consistency, enabling accurate perception under rapid articulation and occlusion. Integrated with BEVFormer, dCAP improves 3D object detection by replacing static calibration with dynamically predicted extrinsics. To facilitate eva
814likely_noise
low
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; depth_correspondence; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractEnd-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D v
815likely_noise
low
UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextpose_calibration_localization; dynamic_4d; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractDexterous manipulation remains challenging due to the cost of collecting real-robot teleoperation data, the heterogeneity of hand embodiments, and the high dimensionality of control. We present UniDex, a robot foundation suite that couples a large-scale robot-centric dataset with a unified vision–language–action (VLA) policy and a practical human-data capture setup for universal dexterous hand control. First, we construct UniDex-Dataset, a robot-centric dataset of 10M paired image–pointcloud–action frames and over 50K trajectories across eight dexterous hands (6–24 DoFs), derived from egocentric human video datasets. To transform human data into robot-executable trajectories, we employ a human-in-the-loop retargeting procedure to align fingertip trajectories while preserving plausible hand–object contacts, and we operate on explicit 3D pointclouds with human hands masked to narrow kinema
816likely_noise
low
GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgaussian_radiance; depth_correspondence; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractVision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation
817likely_noise
low
Rethinking Visual Rearrangement from A Diffusion Perspective
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgaussian_radiance; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractRearranging disarrayed objects to their intended goal states requires the agent to comprehend the changes that have occurred in the scene and to reason about the process of these changes. To address this, we propose a novel perspective on the visual rearrangement task, drawing inspiration from the diffusion processes in molecular thermodynamics. We model the room shuffle and unshuffle stages as the forward and reverse processes of diffusion. In contrast to conventional methods that rely on scene modeling and differential comparisons, our approach provides insight into the intrinsic evolution process between the goal and initial states of the scene, which allows for a more reasonable rearrangement of objects through fine-grained and progressive denoising steps with high confidence. By analyzing the task objectives, we represent the scene via spatial distributions of objects and model the
818likely_noise
low
SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images
Remote Sensing & Earth / Remote Sensing
D. adjacent but useful contextpose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractEffectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi-target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large-scale dataset built for comprehensive training and evaluation across four critical dimensions of language-guided segmentation: hierarchical granularity, target multiplicity, reasoning requirements, and linguistic variability. By capturing these dimensions, LaSeRS moves beyond simple commands, providing a benchmark for complex geospatial reasoning. This addresses a critical gap: existing datasets oversimpl
819likely_noise
low
Affostruction: 3D Affordance Grounding with Generative Reconstruction
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mappingweak or indirect keyword match
abstractThis paper addresses the problem of affordance grounding from RGBD images of an object, which aims to localize surface regions corresponding to a text query that describes an action on the object. While existing methods predict affordance regions only on visible surfaces, we propose a unified framework for affordance grounding and reconstruction, dubbed Affostruction, where affordance grounding actively combines with shape generation. In our approach, reconstructing complete geometry from partial observations enables affordance prediction on unobserved regions, while affordance heatmaps guide active view selection to improve reconstruction quality of functional regions. We make three core contributions: generative multi-view reconstruction via sparse voxel fusion that extrapolates unseen geometry while maintaining constant token complexity, flow-based affordance grounding that captures i
820likely_noise
low
AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextdepth_correspondence; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractAffordance learning is a complex challenge in many applications, where existing approaches primarily focus on the geometric structures, visual knowledge, and affordance labels of objects to determine interactable regions. However, extending this learning capability to a scene is significantly more complicated, as incorporating object- and scene-level semantics is not straightforward; for example, 3D instance identification often struggles with small, interactable, functional parts (i.e., knobs, handles, etc.). In this work, we introduce AffordBridge, a large-scale dataset with 291,637 functional interaction annotations across 685 high-resolution indoor scenes in the form of point clouds. Our affordance annotations are complemented by RGB images that are linked to the same instances within scenes. Building upon our dataset, we propose AffordMatcher, an affordance learning method that esta
821likely_noise
low
Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation
Remote Sensing & Earth / Remote Sensing
D. adjacent but useful contextpose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractRecent advances in cross-view geo-localization (CVGL) methods have shown strong potential for supporting unmanned aerial vehicle (UAV) navigation in GNSS-denied environments. However, existing work predominantly focuses on matching UAV views to onboard map tiles, which introduces an inherent trade-off between accuracy and storage overhead, and overlooks the importance of the UAV’s heading during navigation. Moreover, the substantial discrepancies and varying overlaps in cross-view scenarios have been insufficiently considered, limiting their generalization to real-world scenarios. In this paper, we present Bearing-UAV, a purely vision-driven cross-view navigation method that jointly predicts UAV absolute location and heading from neighboring features, enabling accurate, lightweight, and robust navigation in the wild. Our method leverages global and local structural features and explicitl
822likely_noise
low
MV-TAP: Tracking Any Point in Multi-View Videos
Detection & Tracking / Tracking
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; data_benchmarkweak or indirect keyword match
abstractMulti-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to many applications. Point tracking serves as a key mechanism for capturing dynamic motion; however, conventional single-view approaches often fail due to the limited geometric information available in monocular video, which becomes a critical bottleneck for multi-view scenarios. In this work, we present \ours, a robust point tracker that tracks query points across multi-view videos of dynamic scenes by leveraging cross-view information.\ours utilizes camera geometry and cross-view attention to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tai
823likely_noise
low
Unsupervised Multi-agent and Single-agent Perception from Cooperative Views
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mappingkeyword noise pattern without direct reconstruction signal
abstractThe LiDAR sensor based multi-agent and single-agent perception has shown promising performance in the environmental understanding for robots and automated vehicles. However, there is no existing method that simultaneously solves both multi-agent and single-agent perception in an unsupervised way. By sharing sensor data between multiple agents via communication, this paper discovers two key insights: 1) Improved point cloud density after the data sharing from cooperative views could benefit unsupervised object classification, 2) Cooperative view of multiple agents can be used as unsupervised guidance to the 3D object detection in the single view. Based on these two discovered insights, we propose an Unsupervised Multi-agent and Single-agent (UMS) perception framework that leverages multi-agent cooperation without human annotations to simultaneously solve multi-agent and single-agent perc
824likely_noise
low
InternVideo-Next: Towards World-Understanding Video Models
Video & Motion / Video Understanding
D. adjacent but useful contextdynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractLarge-scale video–text pretraining achieves strong performance but depends on noisy, synthetic captions with limited semantic coverage, often overlooking implicit world knowledge such as object motion, 3D geometry, and physical cues. In contrast, masked video modeling (MVM) directly exploits spatiotemporal structures but trails text-supervised methods on general tasks.We find this gap arises from overlooked architectural issues: pixel-level reconstruction struggles with convergence and its low-level requirement often conflicts with semantics, while latent prediction often encourages shortcut learning.To address these, we disentangle the traditional encoder–decoder design into an Encoder–Predictor–Decoder (EPD) framework, where the predictor acts as a latent world model, and propose InternVideo-Next, a two-stage pretraining scheme that builds a semantically consistent yet detail-preservin
825likely_noise
low
Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextdynamic_4d; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractCollaborative perception (CP) enables data sharing among connected and autonomous vehicles (CAVs) to enhance driving safety. However, CP systems are vulnerable to adversarial attacks where malicious agents forge false objects via feature-level perturbations. Current defensive systems use threshold-based consensus verification by comparing collaborative and ego detection results. Yet, these defenses remain vulnerable to more sophisticated attack strategies that could exploit two critical weaknesses: (i) lack of robustness against attacks with systematic timing and target region optimization, and (ii) inadvertent disclosure of vulnerability knowledge through implicit confidence information in shared collaboration data. In this paper, we propose MVIG attack, a novel adaptive adversarial CP framework learning to capture vulnerability knowledge disclosed by different defensive CP systems from
826likely_noise
low
Probabilistic Discrepancy Learning for Roadside LiDAR Scene Completion
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mappingweak or indirect keyword match
abstractWe propose a probabilistic discrepancy learning approach for roadside LiDAR scene completion (PDL). Conventional methods focus on object-level completion and scene completion from ego-vehicle viewpoint. These methods struggle to cope with long-term or total occlusions caused by roadside sensors with fixed viewpoints. To address this issue, we compensate for occlusion roadside point clouds by introducing external visual information. Specifically, Our PDL is mainly divided into probabilistic pose discrepancy minimization and scene discrepancy learning. We employ probabilistic pose discrepancy minimization to correct noisy poses from vision-based detectors, while utilizing a diffusion model within scene discrepancy learning for robust full-scene completion.Furthermore, we introduce regional and global sampling discrepancy learning losses to achieve robust and efficient training. We conducte
827likely_noise
low
VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; surface_occupancy; robotics_mappingweak or indirect keyword match
abstractDespite the remarkable progress of Multimodal Large Language Models (MLLMs) in 2D vision-language tasks, their application to complex 3D scene manipulation remains underexplored. In this paper, we bridge this critical gap by tackling three key challenges in 3D object arrangement task using MLLMs. First, to address the weak visual grounding of MLLMs, which struggle to link programmatic edits with precise 3D outcomes, we introduce an MCP-based API. This shifts the interaction from brittle raw code manipulation to more robust, function-level updates. Second, we augment the MLLM's 3D scene understanding with a suite of specialized visual tools to analyze scene state, gather spatial information, and validate action outcomes. This perceptual feedback loop is critical for closing the gap between language-based updates and precise 3D-aware manipulation. Third, to manage the iterative, error-pron
828likely_noise
low
Foca-VLA: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractEmbodied intelligence for contact-rich manipulation has predominantly relied on position control, while explicit awareness and regulation of interaction forces remain under-explored, limiting stability, precision, and robustness in real-world tasks. We propose Foca-VLA, an end-to-end vision-language-action framework that equips robots with hybrid force-position control and explicit force awareness. Foca-VLA introduces force-based prompts into the VLM expert to construct force-aware task concepts across stages, and employs a cross-scale routing Mixture-of-Experts (MoE) with impedance control in the action expert to adaptively fuse these concepts with real-time interaction forces for closed-loop hybrid force--position regulation. To support learning and evaluation, we construct Foca-Dataset, containing 1,000 trajectories over 5 contact-rich tasks, including wiping, pressing, and assembling
829likely_noise
low
ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextdepth_correspondence; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractRecent progress in self- and weakly supervised occupancy estimation has largely relied on 2D projection or rendering-based supervision, which suffers from geometric inconsistencies and severe depth bleeding.We thus introduce ShelfOcc, a vision-only method that overcomes these limitations without relying on LiDAR.ShelfOcc brings supervision into native 3D space by generating metrically consistent semantic voxel labels from video, enabling true 3D supervision without any additional sensors or manual 3D annotations.While recent vision-based 3D geometry foundation models provide a promising source of prior knowledge, they do not work out of the box as a prediction due to sparse or noisy and inconsistent geometry, especially in dynamic driving scenes.Our method introduces a dedicated framework that mitigates these issues by filtering and accumulating static geometry consistently across frames
830likely_noise
low
Arcadia: Toward a Full-Lifecycle Framework for Embodied Lifelong Learning
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractWe contend that embodied learning is fundamentally a lifecycle problem rather than a single-stage optimization. Systems that optimize only one link (data collection, simulation, learning, or deployment) rarely sustain improvement or generalize beyond narrow settings. We introduce Arcadia, a closed-loop framework that operationalizes embodied lifelong learning by tightly coupling four stages: (1) Self-evolving exploration and grounding for autonomous data acquisition in physical environments, (2) Generative scene reconstruction and augmentation for realistic and extensible scene creation, (3) a Shared embodied representation architecture that unifies navigation and manipulation within a single multimodal backbone, and (4) Sim-from-real evaluation and evolution that closes the feedback loop through simulation-based adaptation. This coupling is non-decomposable: removing any stage breaks th
831likely_noise
low
General Process Reward Modeling for Robotic Reinforcement Learning
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractThe primary obstacle for applying reinforcement learning (RL) to real-world robotics is the design of effective reward functions. While recently learning-based Process Reward Models (PRMs) are a promising direction, they are often hindered by two fundamental limitations: their reward models lack step-aware understanding and rely on single-view perception, leading to unreliable assessments of fine-grained manipulation progress; and their reward shaping procedures are theoretically unsound, often inducing a semantic trap that misguides policy optimization.To address these, we introduce Robo-Dopamine, a novel reward modeling method for learning a general-purpose, step-aware process reward model from multi-view inputs. At its core is our General Reward Model (GRM), trained on a vast 3,400+ hour dataset, which leverages Step-wise Reward Discretization for structural understanding and Multi-Pe
832likely_noise
low
GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextdepth_correspondence; surface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractDriving planning is a critical component of end-to-end (E2E) autonomous driving. However, prevailing Imitative E2E Planners often suffer from multimodal trajectory mode collapse, failing to produce diverse trajectory proposals. Meanwhile, Generative E2E Planners struggle to incorporate crucial safety and physical constraints directly into the generative process, necessitating an additional optimization stage to refine their outputs. In this paper, we propose GuideFlow, a novel planning framework that leverages Constrained Flow Matching. Concretely, GuideFlow explicitly models the flow matching process, which inherently mitigates mode collapse and allows for flexible guidance from various conditioning signals. Our core contribution lies in directly enforcing explicit constraints within the flow matching generation process, rather than relying on implicit constraint encoding. Crucially, Gu
833likely_noise
low
MER-Tracker: Towards High-Speed 3D Point Tracking via Multi-View Event-RGB Hybrid Cameras
Detection & Tracking / Tracking
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; data_benchmarkweak or indirect keyword match
abstractThis paper proposes the first task for high-speed 3D point tracking using multi-view Event-RGB hybrid cameras. We design a cuboid observation device comprising 4 RGB cameras (30fps) and 2 Event cameras to synchronously capture high-speed motions, and propose MER-Tracker, a high–frame-rate 3D point-tracking network that fuses the complementary strengths of dual modalities. We first respectively extract 2D motion-change features from the RGB and Event modalities, then apply linear interpolation and anchor sampling to fuse the discrete RGB 3D features and continuous Event 3D features after 3D lifting, and finally employ a LoRA-tuned Transformer based on temporal correlationship to predict the high-frame-rate 3D point trajectories over fast motions, accomplishing high-speed 3D point tracking. To verify the effectiveness of our method, we construct both real-world and simulated high-speed mot
834likely_noise
low
Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextgeneral_reconstruction; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractVision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), offering a unified framework for perception, reasoning, and decision-making by jointly modeling visual inputs and natural language instructions. However, their real-world deployment is hindered by the significant computational overhead incurred when processing high-resolution, multi-view images—a standard setup in AD systems that utilize six or even more synchronized cameras to perceive the environment comprehensively. This overhead stems from the large number of visual tokens generated during encoding, which significantly increases inference latency and memory consumption when passed to large language models, owing to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework specifically designed for multi-vi
835likely_noise
low
SMV-EAR: Bring Spatiotemporal Multi-View Representation Learning into Efficient Event-Based Action Recognition
Video & Motion / Video Understanding
D. adjacent but useful contextgeneral_reconstruction; dynamic_4d; data_benchmarkweak or indirect keyword match
abstractEvent cameras action recognition (EAR) offers compelling privacy-protecting and efficiency advantages, where temporal motion dynamics is of great importance. Existing spatiotemporal multi-view representation learning (SMVRL) methods for event-based object recognition (EOR) offer promising solutions by projecting $H$-$W$-$T$ events alone spatial axis $H$ and $W$, yet are limited by its translation-variant spatial binning representation and naive early concatenation fusion architecture. This paper reexamines the key SMVRL design stages for EAR and propose: (i) a principled spatiotemporal multi-view representation through translation-invariant dense conversion of sparse events, (ii) a dual-branch, dynamic fusion architecture that models sample-wise complementarity between motion features from different views, and (iii) a bio-inspired temporal warping augmentation that mimics speed variabili
836likely_noise
low
AeroAgent: A Vision–Physics–Decision Framework for Aerodynamic Vehicle Design
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextsurface_occupancy; robotics_mapping; generation_editing; data_benchmarkweak or indirect keyword match
abstractModern generative models can propose striking 3D vehicle shapes from text and images, but turning these sketches intoaerodynamically efficient, regulation-compliant designs still requires weeks of high-fidelity computational fluiddynamics (CFD) and manual iteration. As a result, fast 3D generation without trustworthy physics in the loop doeslittle to reduce end-to-end design time. We study how an AI agent can close this loop under a strict CFD budget.We introduce AeroAgent, a vision–physics–decision framework built around a single 3D, editable surfacerepresentation for vehicle shapes. A vision module turns text and 2D references into diverse, standardized 3Dcandidates and supports image-level edits. A physics module, AeroFormer, is a geometry-guidedTransformer surrogate trained on a large-scale vehicle aerodynamics dataset of roughly 50k CFD simulations; threetask-specific heads predict
837likely_noise
low
Instance-level Visual Active Tracking with Occlusion-Aware Planning
Detection & Tracking / Tracking
D. adjacent but useful contextgeneral_reconstruction; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractVisual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-
838likely_noise
low
Learning to Act Robustly with View-Invariant Latent Actions
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractVision-based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view-invariant visual representations. This challenge becomes more pronounced in real-world settings, where viewpoint variability is unavoidable and can significantly disrupt policy performance.Existing methods typically learn invariance from multi-view observations at the scene level, but such approaches rely on visual appearance and fail to incorporate the physical dynamics essential for robust generalization.We propose View-Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view-invariant representations grounded in physical dynamics. VILA aligns these latent actions across viewpoints using an action-guided objective based on ground-truth action sequences.Experiments in both simulation and the real world show tha
839likely_noise
low
Convexity-Aware Noise Calibration: A Self-Supervised Framework for Noise-Level-Unknown Image Denoising
Low-level Vision / Restoration
D. adjacent but useful contextgaussian_radiance; pose_calibration_localization; data_benchmarkweak or indirect keyword match
abstractImage denoising is a fundamental task in computer vision aimed at recovering clean images from noise-corrupted observations. While supervised deep learning methods achieve remarkable performance when trained on paired data with known noise levels, their real-world applicability is limited as noise characteristics are often unknown. Existing unsupervised techniques, such as blind-spot networks or methods based on statistical estimation, either compromise performance due to information loss or suffer from inaccuracies in noise level estimation. To address these challenges, we propose a novel two-stage self-supervised denoising framework that first accurately estimates the noise level directly from noisy images, without requiring clean references or prior noise knowledge. Building upon theoretical insights from Noisier2Noise, we rigorously derive a relationship between the noise level and t
840likely_noise
low
Multi-modal Test-time adaptation via Adaptive Probabilistic Gaussian Calibration
Robustness & Safety / Robustness
D. adjacent but useful contextgaussian_radiance; pose_calibration_localization; data_benchmarkweak or indirect keyword match
abstractMulti-modal test-time adaptation (TTA) enhances the resilience of benchmark multi-modal models against distribution shifts by leveraging the unlabeled target data during inference. Despite the documented success, the advancement of multi-modal TTA methodologies has been impeded by a persistent limitation, i.e., the lack of explicit modeling of category-conditional distributions, which is crucial for yielding accurate predictions and reliable decision boundaries. Canonical Gaussian discriminant analysis (GDA) provides a vanilla modeling of category-conditional distributions and achieves moderate advancement in uni-modal contexts. However, in multi-modal TTA scenario, the inherent modality distribution asymmetry undermines the effectiveness of modeling the category‑conditional distribution via the canonical GDA. To this end, we introduce a tailored probabilistic Gaussian model for multi-mo
841likely_noise
low
Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgaussian_radiance; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractIn real-world robotic manipulation, states typically admit a neighborhood of near-equivalent actions. That is for each state, there exist a feasible action neighborhood (FAN) rather than a single correct action, within which motions yield indistinguishable progress. However, prevalent VLA training methodologies are directly inherited from linguistic settings and does not exploit the FAN property, thus lead to poor generalization and low sample efficiency. To address this limitation, we introduce a FAN-guided regularizer that shapes the model's output distribution to align with the geometry of FAN. Concretely, we introduce a Gaussian prior that promotes locally smooth and unimodal predictions around the preferred direction and magnitude. In extensive experiments across both reinforced finetuning (RFT) and supervised finetuning (SFT), our method achieves significant improvement in sample
842likely_noise
low
EgoRoC: Towards Egocentric Robotic Control via Task-Agnostic Visual Alignment
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextpose_calibration_localization; surface_occupancy; robotics_mappingweak or indirect keyword match
abstractRecent Vision-Language-Action (VLA) models map visual-textual inputs to robotic actions via end-to-end architectures, yet this approach entangles visual understanding with task-specific actions. This leads to an exhaustive collection of full operational sequences and parameter redundancy across tasks, while generic third-person camera setups require fine-tuning for different hardware due to implicit hand-eye assumptions. We argue that decoupling \textbf{how robots see} from \textbf{how robots act} is a missing primitive in VLA systems. We present \textbf{EgoRoC}, a plug-and-play egocentric alignment head that precedes any task policy and exposes only a thin 6-DoF pose interface. EgoRoC establishes task-agnostic viewpoint consistency from a wrist-mounted (first-person) camera and then alternates alignment with manipulation, while a diffusion-based online hand–eye module corrects the actio
843likely_noise
low
OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextdepth_correspondence; surface_occupancy; robotics_mappingweak or indirect keyword match
abstractRobust 3D semantic occupancy is essential for legged and humanoid robots, yet most Semantic Scene Completion (SSC) systems are built for wheeled platforms with forward-facing sensors. We present $\textbf{OneOcc}$, a vision-only panoramic SSC framework tailored to severe body jitter and $360^{\circ}$ continuity. OneOcc integrates four complementary modules: (i) $\textit{Dual-Projection fusion (DP-ER)}$, which jointly exploits the raw annular panorama and its equirectangular unfolding to preserve true $360^{\circ}$ continuity while enabling grid-aligned feature extraction and seam-aware context; (ii) $\textit{Bi-Grid Voxelization (BGV)}$, which reasons in Cartesian and polar/cylindrical voxel spaces to reduce discretization bias and better align with panoramic geometry, yielding sharper free/occupied boundaries; (iii) a lightweight decoder with $\textit{Hierarchical AMoE-3D}$ fusion that d
844likely_noise
low
Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgaussian_radiance; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractRapidly-exploring random trees (RRTs) have been widely adopted for robot motion planning due to their robustness and theoretical guarantees. However, existing RRT-based planners require explicit goal configurations specified as numerical joint angles, while many practical applications provide goal specifications through visual observations such as images or demonstration videos where precise goal configurations are unavailable. In this paper, we propose visual-RRT (vRRT), a motion planner that enables visual-goal planning by unifying gradient-based exploitation from differentiable robot rendering with sampling-based exploration from RRTs. We further introduce (1) a frontier-based exploration-exploitation strategy that adaptively prioritizes visually promising search regions, and (2) inertial gradient tree expansion that inherits optimization states across tree branches for momentum-consi
845likely_noise
low
LiDAR-to-4DRadar Diffusion Bridge via Cross-Modal Alignment and Translation in Latent Space
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextdynamic_4d; surface_occupancy; robotics_mappingweak or indirect keyword match
abstractMillimeter-wave radar’s all-weather capability makes it increasingly vital for autonomous perception. However, the high cost of radar data collection drives the need for data generation to augment radar datasets. Existing works mainly target partial radar representations, e.g., 2D or 3D slices, leading to information loss and limited downstream performance. To overcome these issues, we introduce the novel task of LiDAR-to-4DRadar translation, which generates complete 4D radar tensors, with three spatial and one Doppler axes, guided by LiDAR data that preserve spatial and semantic consistency. We propose a novel diffusion bridge model in an aligned LiDAR-4DRadar latent space, namely \textbf{L2RLDB}, to tackle this task. Specifically, first, a key-voxel-aware VAE compresses high-dimensional, noisy radar tensors into a compact latent space, while enabling precise numerical reconstruction an
846likely_noise
low
Towards Human-Like Robot Handwriting via Contour-Aware Generation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextpose_calibration_localization; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractEmpowering machines to simulate human handwriting is a promising research direction. Most existing methods, however, primarily focus on reproducing the writing trajectory to capture the overall character structure, while neglecting the critical aspect of stroke contour modeling. Consequently, these methods struggle to generate visually realistic, human-like handwriting, limiting their applicability in scenarios such as calligraphy robots. To address this issue, we propose a new task, called Contour-aware Handwriting Trajectory Reconstruction (CHTR). This task presents two major challenges: 1) Existing handwriting datasets lack stroke contour annotations, making supervised learning difficult; 2) Previous methods are unable to recover stroke contour and preserve the overall character structure jointly. To address the dataset limitation, we present CHTR-110K, a large-scale character datase
847likely_noise
low
Memory-Augmented Scene Understanding and Exploration for Open-World Aerial Object-Goal Navigation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextpose_calibration_localization; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractAerial object-goal navigation (Aerial ObjectNav) requires an Unmanned Aerial Vehicle (UAV) to navigate to target objects in large-scale outdoor environments using only visual observations and high-level object descriptions, without detailed step-by-step instructions. Existing approaches rely on local observations or short-term history, lacking comprehensive scene understanding and efficient spatial exploration strategies, which constrains their navigation capability in complex aerial scenarios. To address these challenges, we propose OctMem-Agent, an octree memory-augmented framework for aerial object-goal navigation. Specifically, we introduce an Adaptive Octree Memory that incrementally aggregates RGB-D observations into a hierarchical 3D representation, capturing both explored regions and unexplored frontiers across large-scale aerial environments. We further propose a Instruction-Gu
848likely_noise
low
Rethinking Camera Choice : An Empirical Study on Fisheye Camera Properties in Robotic Manipulation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextpose_calibration_localization; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractThe adoption of fisheye cameras in robotic manipulation, driven by their exceptionally wide Field of View (FoV), is rapidly outpacing a systematic understanding of their downstream effects on policy learning. This paper presents the first comprehensive empirical study to bridge this gap, rigorously analyzing the properties of wrist-mounted fisheye cameras for imitation learning. Through extensive experiments in both simulation and the real world, we investigate three critical research questions: spatial localization, scene generalization, and hardware generalization. Our investigation reveals that: (1) The wide FoV significantly enhances spatial localization, but this benefit is critically contingent on the visual complexity of the environment. (2) Fisheye-trained policies, while prone to overfitting in simple scenes, unlock superior scene generalization when trained with sufficient envi
849likely_noise
low
Action–Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextsurface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractBimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions. However, existing methods typically rely on 2D features with limited spatial awareness, or require explicit point clouds that are difficult to obtain reliably in real-world settings. At the same time, recent 3D geometric foundation models show that accurate and diverse 3D structure can be reconstructed directly from RGB images in a fast and robust manner.We leverage this opportunity and propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model. Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap
850likely_noise
low
Beyond Success: Refining Elegant Robot Manipulation from Mixed-Quality Data via Just-in-Time Intervention
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextsurface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractVision-Language-Action (VLA) models have enabled notable progress in general-purpose robotic manipulation, yet their learned policies often exhibit variable execution quality. We attribute this variability to the mixed-quality nature of human demonstrations, where the implicit principles that govern how actions should be carried out are only partially satisfied. To address this challenge, we introduce the LIBERO-Elegant benchmark with explicit criteria for evaluating execution quality. Using these criteria, we develop a decoupled refinement framework that improves execution quality without modifying or retraining the base VLA policy. We formalize Elegant Execution as the satisfaction of Implicit Task Constraints (ITCs) and train an Elegance Critic via offline Calibrated Q-Learning to estimate the expected quality of candidate actions. At inference time, a Just-in-Time Intervention (JITI)
851likely_noise
low
Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextsurface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractWhile existing equivariant methods enhance data efficiency, they suffer from high computational intensity, reliance on single-modality inputs, and instability when combined with fast-sampling methods. In this work, we propose E3Flow, a novel framework that addresses the critical limitations of equivariant diffusion policies. E3Flow overcomes these challenges, successfully unifying efficient rectified flow with stable, multi-modal equivariant learning for the first time. Our framework is built upon spherical harmonic representations to ensure rigorous SO(3) equivariance. We introduce a novel invariant Feature Enhancement Module (FEM) that dynamically fuses hybrid visual modalities (point clouds and images), injecting rich visual cues into the spherical harmonic features. We evaluate E3Flow on 8 manipulation tasks from the MimicGen benchmark and further conduct 4 real-world experiments to
852likely_noise
low
Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextpose_calibration_localization; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractRecent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-aware Virtual View Exploration (TVVE), a framework designed to overcome these challenges by integrating virtual view exploration with task-specific representation learning. TVVE employs an efficient exploration policy, accelerated by a novel pseudo-environment, to acquire informative views. Furthermore, we introduce a Task-aware Mixture-of-Experts (TaskMoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TVVE generates more complete and discriminative visual representations, demonstrating significan
853likely_noise
low
ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextsurface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractVision-Language-Action (VLA) models have emerged essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model (VLM) embeddings.Recent advancements have introduced explicit intermediary reasoning—such as subtask prediction (language) or goal image synthesis (vision)—to guide action generation. However, these intermediate reasoning are often indirect and inherently limited in their capacity to convey the full, granular information required for precise action execution. Instead, we posit that the most effective form of reasoning is one that deliberates directly in the action space. We introduce Action Chain-of-Thought (ACoT), a paradigm where the reasoning process itself is formulated as a structured sequence of coarse action intents that guide the final policy. In this paper, we p
854likely_noise
low
Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextsurface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractDespite strong results on recognition and segmentation, current 3D visual pre-training methods often underperform on robotic manipulation. We attribute this gap to two factors: the lack of state–action–state dynamics modeling and the unnecessary redundancy of explicit geometric reconstruction. We introduce AFRO, a scalable self-supervised framework that learns dynamics-aware 3D representations directly from point clouds without action or label supervision. AFRO casts state prediction as a generative diffusion process and jointly models forward and inverse dynamics in a shared latent space to capture causal transition structure. To prevent feature leakage in action learning, we employ feature differencing and inverse-consistency supervision, improving the quality and stability of visual features. When combined with Diffusion Policy for control, AFRO substantially increases manipulation su
855likely_noise
low
Contact-Aware Neural Dynamics
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextsurface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractHigh-fidelity physics simulation is essential for scalable robotic learning, but the sim-to-real gap persists, especially for tasks involving complex, dynamic, and discontinuous interactions like physical contacts. Explicit system identification, which tunes explicit simulator parameters, is often insufficient to align the intricate, high-dimensional, and state-dependent dynamics of the real world. To overcome this, we propose an implicit sim-to-real alignment framework that learns to directly align the simulator's dynamics with contact information. Our method treats the off-the-shelf simulator as a base prior and learns a contact-aware neural dynamics model to refine simulated states using real-world observations. We show that using tactile contact information from robotic hands can effectively model the non-smooth discontinuities inherent in contact-rich tasks, resulting in a neural dy
856likely_noise
low
Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextsurface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractVision-Language-Action (VLA) models built upon Chain-of-Thought (CoT) have achieved remarkable success in advancing general-purpose robotic agents, owing to its significant perceptual comprehension. Recently, since text-only CoT struggles to adequately capture scene details in complex spatial environments, a highly promising strategy involves leveraging visual priors to guide robotic action generation. Nevertheless, these strategies face two inherent challenges: (i) a modality gap between visual observations and low-level actions, and (ii) unstable training due to competing objectives between visual prediction and action generation. To address these challenges, we propose a Vision-Integrated Trajectory Alignment (VITA) framework that learns a shared discrete latent space for vision and action, enabling joint modeling of perception and motor control. VITA introduces a implicit visual CoT:
857likely_noise
low
Self-Attention Driven Tensor Representation for High-Order Data Recovery
Low-level Vision / Restoration
D. adjacent but useful contextsurface_occupancy; robotics_mapping; data_benchmarkweak or indirect keyword match
abstractLow-rank tensor representation (LRTR) is an effective tool for compactly modeling high-order data. While nonlinear LRTR models can better capture real-world nonlinear dependencies, most existing methods rely on fixed mappings of multilayer perceptrons (MLPs) or convolutional neural networks (CNNs), limiting their ability to model complex global dependencies. To overcome this limitation, we construct a novel paradigm called Self-Attention Driven Tensor Representation (SADTR), which is the first framework that models nonlinearity from the perspective of self-attention. Specifically, we design a factor self-representation mechanism to establish dynamic global mapping, thereby adaptively capturing both local and non-local nonlinear dependencies. Moreover, we introduce an implicit sparse representation to impose sparsity constraint while avoiding additional optimization problems. As a result,
858likely_noise
low
GaussianMatch: Semi-Supervised Regression with Pseudo-Label Filtering via Multi-View Gaussian Consistency
Learning Algorithms / Optimization
D. adjacent but useful contextgeneral_reconstruction; gaussian_radiancekeyword noise pattern without direct reconstruction signal
abstractSemi-Supervised Regression (SSR) is essential in domains like sentiment analysis, healthcare, etc., where labeled data is limited but unlabeled data is plentiful. Despite its practical importance, SSR remains underexplored due to the lack of effective pseudo-labeling strategies for continuous outputs. Unlike classification, regression lacks inherent confidence measures, making it harder to filter and trust pseudo-labels. This limitation permits low-quality pseudo-labels to propagate during training without proper validation, significantly amplifying prediction errors in semi-supervised regression frameworks. In this work, we propose GaussianMatch, a novel SSR framework enabling high-quality pseudo-label filtering, which selects reliable pseudo-labels through multi-view prediction consistency under feature-space smoothness assumptions. Our framework introduces two key innovations: 1) Gau
859likely_noise
low
SpatialStack: Layered Geometry-Semantic Fusion for 3D VLM Spatial Reasoning
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextgeneral_reconstruction; robotics_mappingweak or indirect keyword match
abstractLarge vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local
860likely_noise
low
Tracking by Predicting 3-D Gaussians Over Time
Detection & Tracking / Tracking
D. adjacent but useful contextgaussian_radiance; robotics_mappingweak or indirect keyword match
abstractWe propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pre-training a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches.
861likely_noise
low
Taming Noise-Induced Prototype Degradation for Privacy-Preserving Personalized Federated Fine-Tuning
Robustness & Safety / Safety
D. adjacent but useful contextgaussian_radiance; pose_calibration_localizationweak or indirect keyword match
abstractPrototype-based Personalized Federated Learning (ProtoPFL) enables efficient cross-domain adaptation by communicating compact class prototypes, but directly sharing prototypes raises privacy risks. A common defense involves per-example $\ell_2$ clipping before prototype computation to limit sensitivity, followed by the addition of isotropic Gaussian noise during upload to enforce Local Differential Privacy (LDP). However, this Isotropic Gaussian Prototype Perturbation (IGPP) often over-perturbs key discriminative dimensions and struggles to balance the clipping threshold with representation fidelity. We propose VPDR, a client-side privacy plug-in that can be seamlessly integrated into existing ProtoPFL frameworks. Motivated by the statistical prior that dimension-wise class variance reflects discriminability, we introduce Variance-adaptive Prototype Perturbation (VPP), which uses groupwi
862likely_noise
low
BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird’s-Eye View Images
Autonomous Driving / Autonomous Driving
D. adjacent but useful contextpose_calibration_localization; robotics_mappingweak or indirect keyword match
abstractWe present BEV-SLD, a LiDAR global localization method building on the Scene Landmark Detection (SLD) concept. Unlike scene-agnostic pipelines, our new self-supervised approach leverages bird’s-eye-view (BEV) images to discover scene-specific patterns at a prescribed spatial density and treat them as landmarks. A consistency loss aligns a learnable set of global landmark coordinates with per-frame heatmaps, yielding consistent detection and reliable occurrence across the scene. Across campus, industrial, and forest environments, BEV-SLD delivers robust localization and outperforms state-of-the-art methods. Code and trained models will be released after publication.
863likely_noise
low
HQC-NBV: A Hybrid Quantum-Classical View Planning Approach
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextpose_calibration_localization; robotics_mappingweak or indirect keyword match
abstractEfficient view planning is a fundamental challenge in computer vision and robotic perception, critical for tasks ranging from search and rescue operations to autonomous navigation. While classical approaches, including sampling-based and deterministic methods, have shown promise in planning camera viewpoints for scene exploration, they often struggle with computational scalability and solution optimality in complex settings. This study introduces HQC-NBV, a hybrid quantum-classical framework for view planning that leverages quantum properties to efficiently explore the parameter space while maintaining robustness and scalability. We propose a specific Hamiltonian formulation with multi-component cost terms and a parameter-centric variational ansatz with bidirectional alternating entanglement patterns that capture the hierarchical dependencies between viewpoint parameters. Comprehensive e
864likely_noise
low
Test-Time Perturbation Tuning with Delayed Feedback for Vision-Language-Action Models
Robotics & Embodied AI / Embodied AI
D. adjacent but useful contextpose_calibration_localization; robotics_mappingweak or indirect keyword match
abstractVision-Language-Action models (VLAs) achieve strong performance in sequential decision-making but remain fragile to subtle environment shifts, such as small changes in object pose. We attribute this brittleness to trajectory overfitting, where VLAs over-attend to spurious cues and replicate memorized actions. We propose Perturbation learning with Delayed Feedback (PDF), a verifier-free test-time adaptation framework that improves decision performance without fine-tuning the base model. PDF mitigates spurious correlations through uncertainty-based data augmentation and action voting, while an adaptive scheduler allocates augmentation budgets to balance performance and efficiency. To further improve stability, PDF learns a lightweight perturbation module that retrospectively adjusts action logits guided by delayed feedback, correcting high-confidence errors. Experiments on LIBERO (+7.4\% s