| 1 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation Generative Models / Video Generation | vggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy; generation_editing; data_benchmark | video / temporal; panorama; multimodal / language | mesh / surface; editable / generative 3D | foundation/prior; robustness; dynamic; benchmark/data; editing/generation | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractCinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional con |
| 2 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation Autonomous Driving / Autonomous Driving | vggt_lineage; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark | video / temporal; LiDAR / driving | Gaussian map; occupancy / voxel; radiance field / NVS; 4D scene | foundation/prior; dynamic; benchmark/data | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractUnderstanding dynamic 3D environments is essential for safe autonomous driving, particularly when reasoning about human-centric, nonrigid agents. However, existing self-supervised occupancy prediction frameworks predominantly assume rigid-body motion and rely on simple frame-to-frame offsets, limiting their ability to capture fine-grained deformations and maintain temporal coherence. To address this issue, we propose DeGO, a deformable Gaussian occupancy framework that unifies decoupled Gaussian deformation with factorized 4D foundation-model distillation. DeGO disentangles rigid and nonrigid motion, enabling each Gaussian primitive to evolve through both deformation and offset-based updates. In parallel, a factorized 4D distillation strategy transfers cross-camera and cross-frame knowledge from the VGGT foundation model, producing foundation-aligned features that enhance temporal consis |
| 3 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | Dynamic Visual SLAM using a General 3D Prior Robotics & Embodied AI / Embodied AI | vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping | single image; video / temporal; RGB-D / depth; multimodal / language | camera pose; depth / normals; mesh / surface; 4D scene | foundation/prior; unified pipeline; robustness; dynamic | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractReliable incremental estimation of camera poses and 3D reconstruction is key to enable various applications including robotics, interactive visualization, and augmented reality. However, this task is particularly challenging in dynamic natural environments, where scene dynamics can severely deteriorate camera pose estimation accuracy. In this work, we propose a novel monocular visual SLAM system that can robustly estimate camera poses in dynamic scenes. To this end, we leverage the complementary strengths of geometric patch-based online bundle adjustment and recent feed-forward reconstruction models. Specifically, we propose a feed-forward reconstruction model to precisely filter out dynamic regions, while also utilizing its depth prediction to enhance the robustness of the patch-based visual SLAM. By aligning depth prediction with estimated patches from bundle adjustment, we robustly ha |
| 4 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving Autonomous Driving / Autonomous Driving | vggt_lineage; general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping | video / temporal; LiDAR / driving | point map / point cloud; Gaussian map; mesh / surface; 4D scene | unified pipeline; efficiency; robustness; dynamic | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractDynamic scene reconstruction in autonomous driving remains a fundamental challenge due to significant temporal variations, moving objects, and complex scene dynamics. Existing feed-forward 3D models have demonstrated strong performance in static reconstruction but still struggle to capture dynamic motion. To address these limitations, we propose DynamicVGGT, a unified feed-forward framework that extends VGGT from static 3D perception to dynamic 4D reconstruction. Our goal is to model point motion within feed-forward 3D models in a dynamic and temporally coherent manner. To this end, we jointly predict the current and future point maps within a shared reference coordinate system, allowing the model to implicitly learn dynamic point representations through temporal correspondence. To efficiently capture temporal dependencies, we introduce a Motion-aware Temporal Attention (MTA) module that |
| 5 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training Learning Algorithms / Self-supervised | vggt_lineage; general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy | multi-view images; multimodal / language | camera pose; mesh / surface; radiance field / NVS | foundation/prior | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractSelf-supervised pre-training has revolutionized foundation models for language, 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Exp |
| 6 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | Emergent Extreme-View Geometry in 3D Foundation Models 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark | RGB-D / depth | camera pose; depth / normals; point map / point cloud; mesh / surface | foundation/prior; benchmark/data | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstract3D foundation models (3DFMs) have recently transformed 3D vision, enabling joint prediction of depths, poses, and point maps directly from images. Yet their ability to reason under extreme, non-overlapping views remains largely unexplored. In this work, we study their internal representations and find that 3DFMs exhibit an emergent understanding of extreme-view geometry, despite never being trained for such conditions. To further enhance these capabilities, we introduce a lightweight alignment scheme that refines their internal 3D representation by tuning only a small subset of backbone bias terms, leaving all decoder heads frozen. This targeted adaptation substantially improves relative pose estimation under extreme viewpoints without degrading per-image depth or point quality. Additionally, we contribute MegaUnScene, a new benchmark of Internet scenes unseen by existing 3DFMs, with ded |
| 7 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | Flow3r: Factored Flow Prediction for Visual Geometry Learning 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark | single image; RGB-D / depth | depth / normals; mesh / surface; 4D scene | scale; dynamic | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractWe propose Flow3r, a scalable framework for visual geometry learning that leverages flow prediction to guide learning using unlabeled monocular videos. Current 3D/4D reconstruction systems primarily rely on dense geometry and pose supervision, and cannot easily generalize to diverse dynamic real-world scenes. In this work, we propose a mechanism to augment training directly from unlabeled videos, leveraging dense 2D correspondences (or ‘flow’) between arbitrary image pairs as supervision. Our key insight is that a factored flow prediction module that computes from two images using ‘geometry latents’ from one image and the ‘pose latent’ from the othercan guide visual geometry learning. We first highlight the benefits and scalability of flow supervision in controlled settings and then leverage large-scale unlabeled data to improve off-the-shelf visual geometry models. We evaluate Flow3r ac |
| 8 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | FRM: Linear-Time 3D Reconstruction via Test-Time Training 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy | | point map / point cloud; mesh / surface; radiance field / NVS | unified pipeline; efficiency | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractFeed-forward transformer models such as VGGT and $\pi^3$ are highly accurate, but their computational cost grows quadratically with the number of input images, making them slow to evaluate on large collections. More efficient approaches ameliorate this cost at the expense of reconstruction quality. We introduce Fast Reconstruction Model, a stateful feed-forward reconstruction model that uses a bidirectional architecture that scales linearly in the number of input views, while matching or surpassing the reconstruction quality of quadratic-time methods. FRM employs test-time training layers to compress images into a compact hidden scene state during a single forward pass, enabling our model to reconstruct 3D scenes at speeds up to 75 FPS on a single H100 GPU---over 20 times faster than SOTA methods such as VGGT. This hidden state also serves as an implicit scene representation which can be |
| 9 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance 3D Vision & Geometry / 3D Gaussian Splatting | vggt_lineage; general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy | multi-view images; multimodal / language | camera pose; point map / point cloud; Gaussian map; radiance field / NVS; editable / generative 3D | foundation/prior; editing/generation | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstract3D Gaussian Splatting has demonstrated superior performance in rendering efficiency and quality, yet the generation of 3D Gaussians still remains a challenge without proper geometric priors. Existing methods have explored to predict point maps as geometric references for inferring Gaussian primitives, while the unreliable estimated geometries may lead to poor generations. In this work, we introduce GaussianGrow, a novel approach that generates 3D Gaussians by learning to grow them from easily accessible 3D point clouds, naturally enforcing geometric accuracy in Gaussian generation. Specifically, we design a text-guided Gaussian growing scheme that leverages a multi-view diffusion model to synthesize consistent appearances from input point clouds for supervision. To mitigate artifacts caused by fusing neighboring views, we constrain on novel views generated at non-preset camera poses iden |
| 10 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing | video / temporal; RGB-D / depth | camera pose; depth / normals; point map / point cloud; mesh / surface; editable / generative 3D | foundation/prior; unified pipeline; robustness; dynamic; editing/generation | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractWe present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, \method{} produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models. |
| 11 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | Generalizable Sparse-View 3D Reconstruction from Unconstrained Images 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark | sparse multi-view; multi-view images; RGB-D / depth | depth / normals; mesh / surface | foundation/prior; unified pipeline; efficiency; dynamic; benchmark/data | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractReconstructing 3D scenes from sparse, unposed images remains challenging under real-world conditions with varying illumination and transient occlusions. Existing methods rely on scene-specific optimization with appearance embeddings or dynamic masks, requiring extensive per-scene training and failing under sparse views. Moreover, evaluations on limited scenes raise questions about generalization. We present GenWildSplat, a feed-forward framework for sparse-view outdoor reconstruction that requires no per-scene optimization. Given unposed internet images, GenWildSplat predicts depth, camera parameters, and 3D Gaussians in a canonical space using learned geometric priors. An appearance adapter modulates appearance for target lighting conditions, while semantic segmentation handles transient objects. Through curriculum learning on synthetic and real data, GenWildSplat generalizes across div |
| 12 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction Autonomous Driving / Autonomous Driving | vggt_lineage; gaussian_radiance; depth_correspondence; surface_occupancy; robotics_mapping | single image; video / temporal; LiDAR / driving; RGB-D / depth | depth / normals; Gaussian map; mesh / surface; occupancy / voxel; radiance field / NVS | foundation/prior; unified pipeline; efficiency; dynamic | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractAccurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction.We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric s |
| 13 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | GGPT: Geometry-Grounded Point Transformer 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy | multi-view images; multimodal / language | camera pose; point map / point cloud; mesh / surface | foundation/prior; unified pipeline; efficiency | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractRecent feed-forward networks have achieved remarkable progress in sparse-view 3D reconstruction by predicting dense point maps directly from RGB images. However, they often suffer from geometric inconsistencies and limited fine-grained accuracy due to the absence of explicit multi-view constraints. We introduce the Geometry-Grounded Point Transformer (GGPT), a framework that augments feed-forward reconstruction with reliable sparse geometric guidance. We first propose an improved Structure-from-Motion pipeline based on dense feature matching and lightweight geometric optimisation to efficiently estimate accurate camera poses and partial 3D point clouds from sparse input views.Building on this foundation, we propose a geometry-guided 3D point transformer that refines dense point maps under explicit sparse-geometry supervision using an optimised guidance encoding. Extensive experiments dem |
| 14 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | Group Editing: Edit Multiple Images in One Go Generative Models / Image Editing | vggt_lineage; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmark | video / temporal | editable / generative 3D | foundation/prior; unified pipeline; scale; dynamic; benchmark/data; editing/generation | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractIn this paper, we tackle the problem of performing consistent and unified modifications across a set of related images. This task is particularly challenging because these images may vary significantly in pose, viewpoint, and spatial layout. Achieving coherent edits requires establishing reliable correspondences across the images, so that modifications can be applied accurately to semantically aligned regions. To address this, we propose GroupEditing, a novel framework that builds both explicit and implicit relationships among images within a group. On the explicit side, we extract geometric correspondences using VGGT, which provides spatial alignment based on visual features. On the implicit side, we reformulate the image group as a pseudo-video and leverage the temporal coherence priors learned by pre-trained video models to capture latent relationships. To effectively fuse these two t |
| 15 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | HTTM: Head-wise Temporal Token Merging for Faster VGGT 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy | video / temporal | camera pose; mesh / surface | efficiency; dynamic | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractThe Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT.Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers' output, which hinders the model's representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the un |
| 16 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy | single image; video / temporal; RGB-D / depth | camera pose; depth / normals; point map / point cloud; mesh / surface; 4D scene | foundation/prior; unified pipeline; dynamic | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractRecent feed-forward reconstruction models like VGGT and $\pi^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system byaligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($Sim(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignmen |
| 17 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | Learning 3D Reconstruction with Priors in Test Time 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy | multi-view images; RGB-D / depth | camera pose; depth / normals; point map / point cloud; mesh / surface | foundation/prior; unified pipeline | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractWe introduce a test-time framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks, without retraining or modifying the pre-trained image-only networks. Rather than feeding priors into the architecture, we cast them as constraints on the predictions and optimize the network at inference. The optimization loss is composed of a self-supervised objective and prior penalty terms. The self-supervised objective is defined as the compatibility among multi-view predictions, implemented by the photometric or geometric loss between the renderings from other views and each view itself. Any available priors are turned into the penalty terms on the corresponding output modalities. Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation, our method cons |
| 18 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos Autonomous Driving / Autonomous Driving | vggt_lineage; pose_calibration_localization; dynamic_4d; robotics_mapping; data_benchmark | single image; video / temporal; LiDAR / driving; multimodal / language | camera pose; point map / point cloud; 4D scene | foundation/prior; unified pipeline; scale; robustness; dynamic; benchmark/data | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractEgo-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals |
| 19 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark | RGB-D / depth; multimodal / language | depth / normals; mesh / surface | foundation/prior; scale; robustness; dynamic | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractRecent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks.In 3D visual geometry reconstruction, large-scale training has likewise proven effective for learning versatile representations.However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability.Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation.In addition, it integrates dense sem |
| 20 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | MotionCrafter: Repurposing Video Generators for Dense Geometry and Motion Reconstruction 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy; generation_editing | single image; video / temporal | point map / point cloud; mesh / surface; 4D scene; editable / generative 3D | foundation/prior; unified pipeline; dynamic; editing/generation | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractWe introduce MotionCrafter, the first video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. To represent them effectively in latent space, we propose a 4D VAE that encodes point maps and scene flows as a unified latent compatible with pretrained video generators. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents—despite their fundamentally different distributions—we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in joint 4D geometry reconstruction and dense scene flow estimation, d |
| 21 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy | single image; multi-view images; RGB-D / depth; multimodal / language | camera pose; depth / normals; mesh / surface | foundation/prior; robustness | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractGeneral 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, |
| 22 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; surface_occupancy; generation_editing; data_benchmark | multi-view images; panorama; multimodal / language | mesh / surface; editable / generative 3D | unified pipeline; efficiency; editing/generation | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractCurrent compositional image-to-3D scene generation approaches construct 3D scenes by time-consuming iterative layout optimization or inflexible joint object-layout generation. Moreover, most methods rely on limited field-of-view perspective images, hindering the creation of complete $360^\circ$ environments. To address these limitations, we design $\textbf{Pano3DComposer}$, an efficient feed-forward framework for panoramic images. To decouple object generation from layout estimation, we propose a plug-and-play Object-World Transformation Predictor. This module converts the 3D objects generated by off-the-shelf image-to-3D models from local to world coordinates. To achieve this, we adapt the VGGT architecture to $\textbf{Alignment-VGGT}$ by using target object crop, multi-view object renderings and camera parameters to predict the transformation. The predictor is trained using pseudo-geom |
| 23 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark | video / temporal; LiDAR / driving | point map / point cloud; mesh / surface; 4D scene | foundation/prior; unified pipeline; dynamic; benchmark/data | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractUnderstanding how the 3D world evolves over time is a fundamental task in computer vision, essential for embodied settings, autonomous driving, etc. It requires not only the reconstruction of the observed scene but also the anticipation of how the scene dynamics will unfold in the future. While the area of 3D reconstruction has progressed rapidly with the advent of recent feed-forward neural networks, forecasting future dynamics in 3D, given the 2D frames of a video remains unexplored. We present Point4Cast, a unified framework that processes streaming 2D frame sequences of a video to estimate the past, present, and future of the underlying dynamic scene, in 3D. At the core of our approach lies a persistently evolving latent \emph{spacetime representation} that models the environment’s evolution across time. Upon receiving a new 2D frame, an update operation integrates the incoming evide |
| 24 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; surface_occupancy; generation_editing; data_benchmark | single image; LiDAR / driving; multimodal / language | point map / point cloud; mesh / surface; editable / generative 3D | foundation/prior; unified pipeline; editing/generation | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractRecent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain—from active sensors such as LiDAR or from feed-forward predictors like VGGT—offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation. A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a st |
| 25 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark | RGB-D / depth | camera pose; depth / normals; point map / point cloud; mesh / surface | unified pipeline; efficiency; robustness | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractEstimating 3D attributes directly from images has advanced rapidly with the Visual Geometry Grounded Transformer (VGGT), which predicts camera parameters, depth maps, and point clouds in a single forward pass. However, its 1.2B-parameter scale severely limits deployment on resource-constrained platforms such as UAVs and mobile AR devices. To address this limitation, we introduce QVGGT, a tailored quantization framework designed to compress VGGT. Our approach starts from the observation that transformer blocks within VGGT exhibit heterogeneous sensitivity to quantization. We thus analyze per-block quantization sensitivity and propose a selective mixed-precision strategy that allocates higher precision to the most fragile transformer blocks. To address the amplification of quantization error caused by high-variance camera and register tokens, we further introduce token filtering with camer |
| 26 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | Reliev3R: Relieving Feed-forward 3D Reconstruction from Multi-View Geometric Annotations 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy | single image; multi-view images; RGB-D / depth | camera pose; depth / normals; point map / point cloud; mesh / surface | unified pipeline; scale | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractWith recent advances, Feed-forward Reconstruction Models (FFRMs) have demonstrated great potential in reconstruction quality and adaptiveness to multiple downstream tasks. However, the excessive reliance on multi-view geometric annotations, e.g. 3D point maps and camera poses, makes the fully-supervised training scheme of FFRMs difficult to scale up.In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. Relieving the reliance on geometric sensory data and compute-exhaustive structure-from-motion preprocessing, our method draws 3D knowledge directly from monocular relative depths and image sparse correspondences given by zero-shot predictions of pretrained models.At the core of Reliev3R, we design an ambiguity-aware relative depth loss and a trigonometry-based reprojection loss to facilita |
| 27 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | Scal3R: Scalable Test-Time Training for Feed-forward Large-Scale 3D Reconstruction 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; pose_calibration_localization; surface_occupancy; data_benchmark | video / temporal; multimodal / language | camera pose; mesh / surface | foundation/prior; unified pipeline; efficiency; scale | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractThis paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lig |
| 28 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation Autonomous Driving / Autonomous Driving | vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping | video / temporal; LiDAR / driving; RGB-D / depth | camera pose; depth / normals; point map / point cloud; mesh / surface | unified pipeline; efficiency | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractRecent advancements in feed-forward architectures for visual geometry estimation have achieved significant progress. Interestingly, per-frame visual geometry estimation approaches typically exhibit weaker multi-frame consistency but demonstrate superior per-frame accuracy compared to multi-frame algorithms. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals three key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furtherm |
| 29 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-view Indoor 3D Object Detection 3D Vision & Geometry / 3D Reconstruction | vggt_lineage; general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark | multi-view images; RGB-D / depth; multimodal / language | camera pose; depth / normals; mesh / surface; editable / generative 3D | foundation/prior; dynamic; editing/generation | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractCurrent multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain—i.e., precisely calibrated multi-view camera poses—to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key c |
| 30 | 100 | core trend paper A. thesis anchor: VGGT/feed-forward geometry | VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation Segmentation & Dense Prediction / Segmentation | vggt_lineage; general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark | | mesh / surface | foundation/prior; robustness; benchmark/data | Use as evidence that multi-view geometry is becoming a feed-forward foundation-model problem.abstractInstance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in t |
| 31 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy | single image; video / temporal | mesh / surface; 4D scene | dynamic | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractWe present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps.Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity.The resulting system e |
| 32 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark | single image; multi-view images; video / temporal; multimodal / language | mesh / surface; 4D scene | foundation/prior; robustness; dynamic; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractExisting hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object's metric scale and pose for grounding its normalized mesh i |
| 33 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | Captain Safari: A Real-time World Engine 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; dynamic_4d; surface_occupancy; generation_editing; data_benchmark | video / temporal; multimodal / language | mesh / surface; editable / generative 3D | efficiency; robustness; dynamic; benchmark/data; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractWorld engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers.To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dyn |
| 34 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark | single image; video / temporal; RGB-D / depth | depth / normals; mesh / surface; 4D scene | foundation/prior; robustness; dynamic; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractAccurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and |
| 35 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | Catch Me if You Can: Active Mapping of Moving 3D Objects 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark | video / temporal | mesh / surface; 4D scene | robustness; dynamic; benchmark/data | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractCurrent 3D mapping pipelines generally assume static environments, which limits their ability to accurately capture and reconstruct moving objects. To address this limitation, we introduce the novel task of active mapping of moving objects, in which a mapping agent must plan its trajectory while compensating for the object's motion. Our approach, Paparazzo, provides a learning-free solution that robustly predicts the target's trajectory and identifies the most informative viewpoints from which to observe it, to plan its own path. We also contribute a comprehensive benchmark designed for this new task. Through extensive experiments, we show that Paparazzo significantly improves 3D reconstruction completeness and accuracy compared to several strong baselines, marking an important step toward dynamic scene understanding. |
| 36 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | Clone Deterministic 3D Worlds 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping | video / temporal | mesh / surface; editable / generative 3D | foundation/prior; dynamic; editing/generation | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractA world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future physical state of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. However, existing world models often focus on random generation of open worlds, but neglect the need for high-fidelity modeling of deterministic scenarios (such as fixed-map mazes and static space robot navigation). In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone a deterministic 3D world. 1) Through diagnostic experiment, we quantitatively demonstrate that high-fidelity cloning is feasible and the primary bottleneck for long-horizon fidelity is the geometric st |
| 37 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | Complet4R: Geometric Complete 4D Reconstruction 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark | video / temporal; multimodal / language | mesh / surface; 4D scene | unified pipeline; dynamic; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractWe introduce Complet4R, a novel end-to-end framework for Geometric Complete 4D Reconstruction, which aims to recover temporally coherent and geometrically complete reconstruction for dynamic scenes. Our method formalizes the task of Geometric Complete 4D Reconstruction as a unified framework of reconstruction and completion, by directly accumulating full contexts onto each frame. Unlike previous approaches that rely on pairwise reconstruction or local motion estimation, Complet4R utilizes a decoder-only transformer to operate all context globally directly from sequential video input, reconstructing a complete geometry for every single time step, including occluded regions visible in other frames. Our method demonstrates the state-of-the-art performance on our proposed benchmark for Geometric Complete 4D Reconstruction and the 3D point tracking task. Code will be released to support futur |
| 38 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | Efficiently Reconstructing Dynamic Scenes one D4RT at a Time 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy | video / temporal; RGB-D / depth | depth / normals; mesh / surface; 4D scene | unified pipeline; efficiency; scale; dynamic | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractUnderstanding and reconstructing the complex geometry and motion of dynamic 4D scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward network designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our unified decoding interface allows the model to independently and efficiently probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state-of-the-art, outper |
| 39 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark | single image; multi-view images; video / temporal | point map / point cloud; mesh / surface; occupancy / voxel | efficiency; scale; robustness; dynamic | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractStrand-level hair geometry reconstruction is a fundamental problem in virtual human modeling and the digitization of hairstyles. However, existing methods still suffer from a significant trade-off between accuracy and efficiency. Implicit neural representations can capture the global hair shape but often fail to preserve fine-grained strand details, while explicit optimization-based approaches achieve high-fidelity reconstructions at the cost of heavy computation and poor scalability.To address this issue, we propose EfficientMonoHair, a fast and accurate framework that combines the implicit neural network with multi-view geometric fusion for strand-level reconstruction from monocular video. Our method introduces a fusion-patch-based multi-view optimization that reduces the number of optimization iterations for point cloud direction, as well as a novel parallel hair-growing strategy that |
| 40 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | ESAM++: Efficient Online 3D Perception on the Edge 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark | video / temporal | point map / point cloud; mesh / surface | efficiency; scale; dynamic | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractOnline 3D scene perception in real time is essential for robotics, AR/VR, and autonomous systems, particularly in edge computing scenarios where computational resources are limited and privacy is crucial. Recent state-of-the-art methods like EmbodiedSAM (ESAM) demonstrate the promise of online 3D perception by leveraging the Segment Anything Model (SAM) for real-time, fine-grained, and gen- eralized 3D instance segmentation. However, ESAM still relies on a computationally expensive 3D sparse UNet for point cloud feature extraction, which accounts for the majority of the 3D inference time, hindering its practicality on resource-constrained devices. In this paper, we propose ESAM++, a lightweight and scalable alternative for online 3D scene perception tailored to edge devices without GPU acceleration. Our method introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently capture |
| 41 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; dynamic_4d; surface_occupancy; generation_editing | video / temporal | mesh / surface; occupancy / voxel; editable / generative 3D | efficiency; dynamic; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractDiffusion models have achieved impressive generative quality across modalities like 2D images, videos, and 3D shapes, but their inference remains computationally expensive due to the iterative denoising process. While recent caching-based methods effectively reuse redundant computations to speed up 2D and video generation, directly applying these techniques to 3D diffusion models can severely disrupt geometric consistency. In 3D synthesis, even minor numerical errors in cached latent features accumulate, causing structural artifacts and topological inconsistencies. To overcome this limitation, we propose Fast3Dcache, a training-free geometry-aware caching framework that accelerates 3D diffusion inference while preserving geometric fidelity. Our method introduces a Predictive Caching Scheduler Constraint (PCSC) to dynamically determine cache quotas according to voxel stabilization pattern |
| 42 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy | video / temporal; RGB-D / depth; multimodal / language | depth / normals; mesh / surface; 4D scene | foundation/prior; dynamic | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractOne of the most exciting applications of vision models involve pixel-level reasoning.Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level.Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction.We present a framework that learns pixel-accurate feature descriptors from videos, LILA.The core element of our training framework is linear in-context learning.LILA leverages spatio-temporal cue maps -- depth and motion -- estimated with off-the-shelf networks.Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner.We dem |
| 43 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; dynamic_4d; surface_occupancy; generation_editing | single image; multi-view images; video / temporal; multimodal / language | mesh / surface; 4D scene; editable / generative 3D | foundation/prior; unified pipeline; dynamic; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractSingle-image 3D human reconstruction holds significant promise due to its convenience and high demand in various applications. Previous methods have garnered tremendous progress by employing 2D multi-view diffusion models to generate auxiliary views as reconstruction priors, but they struggle with 3D inconsistencies and limited generalization capabilities. In this paper, we present FISHuman, which aims to generate fine-grained, high-fidelity, and content-wise diverse 3D humans from a single-view input, providing production-ready 3D assets. We propose an elaborately designed workflow that reconstructs dynamic 3D meshes from multi-view inconsistent guidance. Specifically, we adapt a dual-stream transformer-based video diffusion model to generate cross-modally aligned multi-view RGB and normal sequences. We find that naively employing static 3D reconstruction can lead to geometric distortio |
| 44 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | GeoWorld: Geometric World Models 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing | video / temporal | mesh / surface; editable / generative 3D | dynamic; editing/generation | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractEnergy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive expe |
| 45 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | Inferring Compositional 4D Scenes without Ever Seeing One 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark | single image; video / temporal | mesh / surface; 4D scene | robustness; dynamic; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractScenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard.Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, a |
| 46 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | MoRe: Motion-aware Feed-forward 4D Reconstruction Transrformer 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; dynamic_4d; surface_occupancy; data_benchmark | single image; video / temporal | camera pose; mesh / surface; 4D scene | unified pipeline; efficiency; scale; robustness; dynamic | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractReconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruc |
| 47 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy | video / temporal | point map / point cloud; mesh / surface; 4D scene | foundation/prior; efficiency; dynamic | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractemporally consistent surface reconstruction of dynamic 3D objects from unstructured point cloud data remains challenging, especially for very long sequences. Existing methods either optimize deformations incrementally, risking drift and requiring long runtimes, or rely on complex learned models that demand category-specific training. We present Neu-PiG, a fast optimization method based on a novel preconditioned surface encoding that estimates coherent non-rigid deformations without sacrificing temporal stability or accuracy. Our method encodes entire deformations across all time steps at various spatial scales into a multi-resolution latent grid, parameterized by the position and normal direction of a reference surface from a single keyframe. This latent representation is then augmented for time modulation and decoded into per-frame 6-DoF deformations via a lightweight multi-layer percep |
| 48 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | NeuROK: Generative 4D Neural Object Kinematics 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark | video / temporal | mesh / surface; 4D scene; editable / generative 3D | foundation/prior; scale; dynamic; benchmark/data; editing/generation | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractData-driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simulative 4D dynamics---realistic temporal deformations of static objects under various physical conditions---remains challenging and often ad hoc despite being critical for building comprehensive 3D world models. Most existing methods assume a predefined physical model and use system identification to estimate parameters, restricting these methods to specific categories and small-scale datasets. We propose that these restrictions can be overcome by learning a data-driven kinematic state parameterization for object-centric physical systems. Specifically, we learn both a latent space of all possible states of the object and a decoder that maps any sampled latent to a plausibly deformed shape of the object. We refer to this parameteriza |
| 49 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | Order Matters: 3D Shape Generation from Sequential VR Sketches 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; dynamic_4d; surface_occupancy; data_benchmark | video / temporal | mesh / surface; editable / generative 3D | foundation/prior; efficiency; dynamic; benchmark/data; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractVR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD software. However, existing sketch-to-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for 3D shape generation from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates ordered VR sketches from arbitrary shapes, (ii) a dataset comprising over 20k synthetic and 900 hand-drawn sketch–shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work and generalizes effectively from synthetic to real sketches with minimal supervision. All data and models will |
| 50 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | PatchScene: Patch-based Voxel Diffusion Model for Large-Scale Scene Completion 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark | video / temporal; LiDAR / driving | mesh / surface; occupancy / voxel; editable / generative 3D | foundation/prior; unified pipeline; scale; dynamic; benchmark/data; editing/generation | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractWe propose PatchScene, a novel diffusion-based framework for large-scale LiDAR scene completion. Unlike existing methods that rely on global latent representations or dense voxel grids, PatchScene adopts a patch-based voxel diffusion paradigm that explicitly generates fine-grained geometry within localized 3D regions. To ensure coherent reconstruction at both spatial and temporal scales, we introduce a confidence-guided spatio-temporal fusion mechanism that integrates overlapping patches and adjacent frames in a unified generative process. Furthermore, we design an Annular-Flow diffusion strategy that leverages the radial density pattern of LiDAR scans to progressively propagate high-fidelity information from near-range to far-range regions, enabling spatially unbounded scene completion. Extensive experiments on the SemanticKITTI benchmark demonstrate that PatchScene achieves state-of-th |
| 51 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy | single image; multi-view images; video / temporal | mesh / surface; 4D scene | unified pipeline; robustness; dynamic | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractWe present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel self-correction manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre-computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation.To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation-Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision.The core of ReFlow is a novel self-correction flow matching mechanism, c |
| 52 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | ReGenHOI: Unifying Reconstruction and Generation for 3D Human–Object Interaction Understanding 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; dynamic_4d; surface_occupancy; generation_editing | video / temporal | mesh / surface; editable / generative 3D | unified pipeline; dynamic; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractUnderstanding 3D human–object interaction (HOI) involves two highly-related abilities: reconstruction, which perceives observed geometry, and generation, which imagines plausible future interactions. However, most existing methods treat these abilities as separate tasks, limiting their capacity to capture the unified nature of human spatial reasoning. To address this, we propose a unified framework that bridges reconstruction and generation through a shared semantic–geometric reasoning space. Specifically, a 3D Contact Reasoning mechanism enables direct reasoning in 3D space, jointly modeling geometric structure and semantic relationships, while a Reasoning Trace Refinement module iteratively refines contact predictions by integrating geometric and semantic cues. The framework builds a unified latent representation via explicit reasoning on human–object contact regions. To further enhanc |
| 53 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling 3D Vision & Geometry / Pose Estimation | general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy | single image; video / temporal | camera pose; mesh / surface | unified pipeline; efficiency; dynamic | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractVisual SLAM is one of the most fundamental problems in computer vision, with direct applications to real-time localization tasks such as AR/VR, robotics, and 3D scene reconstruction. Although significant progress has been made in both sparse and dense approaches, real-time monocular SLAM remains challenging—particularly in the uncalibrated setting, where existing methods are often inefficient and lack modularity. In this paper, we present a new visual SLAM pipeline implemented from scratch in C++ that explicitly leverages the spatio-temporal structure of the scene for improved localization, and is designed to be modular so that off-the-shelf components can be easily integrated. We introduce a temporal representation based on a buffer of recent keyframes that preserves short-term scene continuity. To complement this, we incorporate a spatial representation based on a 3D cell-based scene m |
| 54 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | Spatia: Video Generation with Updatable Spatial Memory 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; dynamic_4d; surface_occupancy; generation_editing | video / temporal | point map / point cloud; mesh / surface; editable / generative 3D | scale; dynamic; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractExisting video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory–aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic–static disentanglement design enhances spatial consistency throughout the generation process while preserving the model’s ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation. |
| 55 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmark | single image; video / temporal | mesh / surface; editable / generative 3D | foundation/prior; unified pipeline; efficiency; scale; dynamic; benchmark/data; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractThe growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone.To address this challenge, we present **StereoWorld**, an **end-to-end framework** that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a **geometry-aware regularization** to ensure 3D structural fidelity.A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis.To enable large-scale training and evaluation, we curate a **high-definition stereo video dataset** containing over 11M frames aligned to natural human interpupillary distance (IPD).Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generat |
| 56 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | Tokenizing Vector Animation for Autoregresive Generation 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; dynamic_4d; surface_occupancy; generation_editing; data_benchmark | video / temporal; multimodal / language | mesh / surface; editable / generative 3D | foundation/prior; scale; dynamic; benchmark/data; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractDespite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides , 3D meshes , LEGO sequences , and indoor layouts , suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometr |
| 57 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | Towards Visual Query Localization in the 3D World 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark | video / temporal; RGB-D / depth; multimodal / language | camera pose; depth / normals; point map / point cloud; mesh / surface | dynamic; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractVisual query localization (VQL) aims to predict a spatial-temporal response of the most recent occurrence from a sequence given a query. Currently, most research focuses on visual query localization from 2D videos, while its counterpart in 3D space has received little attention. In this paper, we make the first attempt to visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. Specifically, 3DVQL contains 2,002 sequences with around 170,000 frames and 6.4K response track segments from 38 object categories. Each sequence in 3DVQL is provided with multiple modalities including point clouds (PC), RGB and depth images to support flexible research. To ensure high-quality annotation, each sequence is manually annotated with multiple rounds of verification and refinement. To our best knowledge, 3DVQL is the first benchmark towards 3D multimodal visual query loc |
| 58 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | VABench: A Comprehensive Benchmark for Audio-Video Generation 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmark | video / temporal; multimodal / language | mesh / surface; editable / generative 3D | dynamic; benchmark/data; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractRecent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, |
| 59 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | Vista4D: Video Reshooting with 4D Point Clouds 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark | multi-view images; video / temporal; RGB-D / depth | depth / normals; point map / point cloud; mesh / surface; 4D scene | robustness; dynamic | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractWe present **Vista4D**, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, our method re-synthesizes the scene with the same dynamics from a different camera trajectory and viewpoint. Existing video reshooting methods often struggle with depth estimation artifacts of real-world dynamic videos, while also failing to preserve content appearance and maintain precise camera control for challenging new trajectories. We build a 4D-grounded point cloud representation with static pixel segmentation and 4D reconstruction to explicitly preserve seen content and provide rich camera signals, and we train with reconstructed multiview dynamic data for robustness against point cloud artifacts during real-world inference. Our results demonstrate improved 4D consistency, camera control, and visual quali |
| 60 | 100 | core trend paper A. thesis anchor: dynamic/4D recon | WorldStereo: Bridging Controllable Video Generation and Scene Reconstruction via 3D Geometric Memories 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing | multi-view images; video / temporal; panorama | point map / point cloud; mesh / surface; editable / generative 3D | foundation/prior; dynamic; editing/generation | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractRecent advances in foundational Video Diffusion Models (VDMs) have yielded significant progress. Yet, despite the remarkable visual quality of generated videos, reconstructing consistent 3D scenes from these outputs remains challenging, due to limited camera controllability and inconsistent generated content when viewed from distinct camera trajectories.In this paper, we propose WorldStereo, a novel framework that bridges camera-guided video generation and 3D reconstruction via two dedicated geometric memory modules. Formally, the global-geometric memory enables precise camera control while injecting coarse structural priors through incrementally updated point clouds.Moreover, the spatial-stereo memory constrains the model's attention receptive fields with 3D correspondence to focus on fine-grained details from the memory bank.These components enable WorldStereo to generate multi-view-co |
| 61 | 100 | core trend paper A. thesis anchor: representation shift | 3D Gaussian Splatting with Self-Constrained Prior for High Fidelity Surface Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark | RGB-D / depth | depth / normals; Gaussian map; mesh / surface; radiance field / NVS | foundation/prior; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractRendering 3D surfaces has been revolutionized within the modeling of radiance fields through either 3DGS or NeRF. Although 3DGS has shown advantages over NeRF in terms of rendering quality or speed, there is still room for improvement in recovering high fidelity surfaces through 3DGS. To resolve this issue, we propose a self-constrained prior to constraining the movement of 3D Gaussians, aiming for more accurate depth rendering. Our self-constrained prior is a TSDF grid fused by the rendered depth during the learning of 3D Gaussians. The prior measures a band on both sides of the estimated surface for imposing more specific constraints on the right 3D Gaussians, such as removing 3D Gaussians outside the band, encouraging larger opacity for Gaussians near the center of the band or smaller opacity for Gaussians near the boundary of the band. We regularly update the prior by fusing more rec |
| 62 | 100 | core trend paper A. thesis anchor: representation shift | 4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark | single image; multi-view images; video / temporal | Gaussian map; mesh / surface; 4D scene | unified pipeline; scale; dynamic; benchmark/data | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstract4D reconstruction of equine family (e.g. horses) from monocular video is important for animal welfare. Previous mainstream 4D animal reconstruction methods require joint optimization of motion and appearance over a whole video, which is time-consuming and sensitive to incomplete observation. In this work, we propose a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction. For motion, we introduce a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video. For appearance, we design a novel feed-forward network that reconstructs a high-fidelity, animatable 3D Gaussian avatar from as few as a single image. To assist training, we create a large-scale synthetic motion dataset, VarenPose |
| 63 | 100 | core trend paper A. thesis anchor: representation shift | ActivePolicy: Active Gaussian Reconstruction and Optimization Strategy Based on Global-Local Information Gain 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy | sparse multi-view; RGB-D / depth; multimodal / language | depth / normals; Gaussian map; mesh / surface; radiance field / NVS; 4D scene | foundation/prior; dynamic | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractActive 3D Gaussian reconstruction achieves superior completeness and rendering quality by intelligently selecting viewpoints. However, existing methods suffer from two critical limitations: information gain metrics that prioritize geometric coverage while ignoring rendering quality, and overfitting to sparse view configurations that degrades novel view synthesis. We introduce ActivePolicy, a novel framework addressing both challenges through principled NBV selection and regularization. We propose \textbf{GL-Graph}, a graph-theoretic strategy that unifies geometric consistency, rendering quality, and observation redundancy into a single stability criterion. To counteract overfitting, we introduce \textbf{4D-Reg}, which identifies floaters through manifold discrepancies among three depth types (R-Depth, $\alpha$-Depth, C-Depth) and suppresses them via adaptive dropout. Extensive experiment |
| 64 | 100 | core trend paper A. thesis anchor: representation shift | AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy | single image; video / temporal; multimodal / language | camera pose; Gaussian map; mesh / surface; radiance field / NVS; 4D scene | unified pipeline; dynamic | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractMonocular UAV videos pose a fundamental challenge for 3D reconstruction: dynamic scene modeling requires accurate camera poses, yet recovering poses from long UAV trajectories often fails under texture-sparse regions and moving objects.Existing approaches typically handle either pose-free static reconstruction or dynamic reconstruction with known poses, but jointly solving both from casual aerial footage remains difficult due to motion coupling and severe scale variation.We introduce \modelname, a scale-aware Gaussian splatting framework that jointly recovers camera trajectories and reconstructs dynamic scenes from pose-free monocular videos.Central to our method are scale-aware spatio-temporal anchors (S$^2$A-Anchors), which enable a unified optimization via three key decoupling mechanisms:(i) separating ego-motion from object motion,(ii) isolating static geometry from temporal deformat |
| 65 | 100 | core trend paper A. thesis anchor: representation shift | AnchorSplat: Feed-Forward 3D Gaussian Splatting With 3D Geometric Priors 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark | multi-view images; RGB-D / depth | depth / normals; point map / point cloud; Gaussian map; radiance field / NVS | foundation/prior; unified pipeline; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractScene-level 3D reconstruction has attracted increasing attention, and feed-forward 3D Gaussian Splatting (3DGS) has emerged as a promising paradigm for novel view synthesis. However, most existing methods adopt a pixel-aligned formulation that maps each 2D pixel to a 3D Gaussian, making the number of Gaussians tightly coupled with the input images. This leads to several limitations: (i) reconstruction quality is sensitive to the quantity and viewpoint coverage of input images, often causing Gaussians to accumulate more densely in regions with frequent viewpoints; (ii) alignment errors become more pronounced under sparse-view conditions; and (iii) the lack of explicit geometric consistency can degrade depth estimation and downstream 3D tasks. In this paper, we propose AnchorSplat, a novel multi-view feed-forward 3DGS framework for scene-level reconstruction that departs from pixel-aligned |
| 66 | 100 | core trend paper A. thesis anchor: representation shift | BA-GS: Bayesian Adaptive Gaussian Splatting for SFM-Free 3D Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; data_benchmark | | Gaussian map; mesh / surface; radiance field / NVS | foundation/prior; dynamic; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstract3D Gaussian Splatting (3DGS) has demonstrated exceptional performance in reconstruction and novel view synthesis tasks. However, its reliance on Structure-from-Motion preprocessing may lead to degraded performance under sparse-view scenarios. Recent works attempt to address this limitation by leveraging pre-trained image matching models to generate Gaussian primitives but overlook the probabilistic uncertainty embedded in both the initial primitive distribution and iterative position updates. This uncertainty can accumulate and degrade reconstruction fidelity. Hence, we propose BA-GS, a Bayesian framework that models both the global distribution and local uncertainty of Gaussian primitives. At global initialization, a Variational Bayesian Gaussian Mixture Model (VB-GMM) models the latent distribution of primitives, capturing region-wise density and gradient patterns. At local refinement, |
| 67 | 100 | core trend paper A. thesis anchor: representation shift | ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark | multi-view images; video / temporal; multimodal / language | Gaussian map; mesh / surface; radiance field / NVS; 4D scene | efficiency; scale; dynamic | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractDynamic 3D scene reconstruction is essential for immersive media such as VR, MR, and XR, yet remains challenging for long multi-view sequences with large-scale motion. Existing dynamic Gaussian approaches are either Frame-Stream, offering scalability but poor temporal stability, or Clip, achieving local consistency at the cost of high memory and limited sequence length.We propose ClipGStream, a hybrid reconstruction framework that performs stream optimization at the clip level rather than the frame level. The sequence is divided into short clips, where dynamic motion is modeled using clip-independent spatio-temporal fields and residual anchor compensation to capture local variations efficiently, while inter-clip inherited anchors and decoders maintain structural consistency across clips. This Clip-Stream design enables scalable, flicker-free reconstruction of long dynamic videos with hig |
| 68 | 100 | core trend paper A. thesis anchor: representation shift | D-Prism: Differentiable Primitives for Structured Dynamic Modeling 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy | | Gaussian map; mesh / surface; 4D scene | dynamic | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractCapturing both geometry and rigid motion for structured dynamic objects, like multi-part assemblies or jointed mechanisms, remains a key challenge. Existing dynamic methods, such as deformable meshes or 3DGS, rely on unstructured representations and fail to jointly model suitable geometry and articulated motion. Primitive-based methods excel at structured static scenes, but their dynamic potential is still unexplored. We propose D-Prism, the first framework to achieve high-fidelity structured dynamic modeling by extending differentiable primitives to the dynamic domain.Specifically, we bind 3DGS to primitive surfaces, leveraging their respective strengths in appearance and geometry. We introduce a deformation network to control primitive motion, ensuring it accurately matches the object's movement. Furthermore, we design a novel adaptive control strategy to dynamically adjust primitive c |
| 69 | 100 | core trend paper A. thesis anchor: representation shift | Dark3R: Learning Structure from Motion in the Dark 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark | multi-view images | camera pose; Gaussian map; mesh / surface; radiance field / NVS | foundation/prior; scale; robustness; benchmark/data | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstractWe introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signal-to-noise ratios (SNRs) below $-4$ dB—a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher–student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy--clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson–Gaussian noise model applied to well-exposed raw images.To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes $\sim$42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structur |
| 70 | 100 | core trend paper A. thesis anchor: representation shift | Disco-GS: Gaussian Splatting in Dynamic Color Lighting 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; data_benchmark | video / temporal | Gaussian map; mesh / surface; radiance field / NVS | foundation/prior; unified pipeline; robustness; dynamic; benchmark/data | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractRecent advances in Gaussian Splatting (GS) have significantly improved 3D scene reconstruction and novel view synthesis. However, most existing methods typically assume that training inputs are captured under stable lighting conditions and achromatic light. In contrast, scenes recorded under temporally varying color light, as in “disco lights” commonly seen in events, performances, and decorative settings, introduce severe ambiguities in both scene photometry and geometry. We propose Disco-GS, a framework that leverages GS for reconstructing the 3D scene while simultaneously recovering the underlying canonical appearance from videos captured under dynamic lighting conditions. Disco-GS estimates the effective per-pixel transient light, which, when applied to the canonical image, results in the observed color image of the scene, thereby enabling self-supervised learning. Disco-GS is an end |
| 71 | 100 | core trend paper A. thesis anchor: representation shift | E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; data_benchmark | video / temporal; RGB-D / depth | depth / normals; Gaussian map; radiance field / NVS; 4D scene | efficiency; robustness; dynamic | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractThe emergence of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS) has advanced novel view synthesis (NVS). These methods, however, require high-quality RGB inputs and accurate corresponding poses, limiting robustness under real-world conditions such as fast camera motion or adverse lighting. Event cameras, which capture brightness changes at each pixel with high temporal resolution and wide dynamic range, enable precise sensing of dynamic scenes and offer a promising solution. However, existing event-based NVS methods still rely on known poses or depend on depth estimation models and auxiliary modalities such as RGB-D. We present E2EGS, a pose-free framework operating solely on event streams. Our key insight is that edge information provides rich structural cues essential for accurate trajectory estimation and high-quality NVS. To extract edges from noisy event streams, we |
| 72 | 100 | core trend paper A. thesis anchor: representation shift | Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow 3D Vision & Geometry / Pose Estimation | general_reconstruction; gaussian_radiance; pose_calibration_localization; robotics_mapping; data_benchmark | | camera pose; Gaussian map; radiance field / NVS | foundation/prior | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstractHigh-quality 3D scene representation in radiance fields relies on accurate camera poses which are often difficult to acquire in real-world scenarios. An effective solution is to use RGB images for the joint optimization of radiance fields and camera poses, an approach that has been well explored in NeRF series methods. However, unlike NeRF, joint optimization in 3D Gaussian Splatting (3DGS) often requires additional regularization or prior spatial knowledge to reach comparable performance. To eliminate these dependencies, we introduce Energy-GS, a pose-aware Gaussian splatting framework that jointly optimizes scene representation and camera poses using only RGB images. We observe that pose gradients in joint optimization are unstable due to the point-based rendering mechanism. Furthermore, unlike NeRF’s spatial sampling framework that enables coarse-to-fine pose alignment, rasterization- |
| 73 | 100 | core trend paper A. thesis anchor: representation shift | FastEventDGS: Deformable Gaussian Splatting for Fast Dynamic Scenes from a Single Event Camera 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark | video / temporal; RGB-D / depth | depth / normals; Gaussian map; mesh / surface; radiance field / NVS; 4D scene; editable / generative 3D | efficiency; robustness; dynamic; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractThe demand for dynamic 3D assets in AR/VR has recently popularized Deformable Gaussian Splatting. However, traditional RGB cameras are limited in their ability to reconstruct high-speed scenes due to motion blur and low temporal resolution. While event cameras offer a promising alternative, reconstructing a complete scene from their sparse and noisy output is a significant challenge. Existing event-based methods rely on an auxiliary sensor, such as a frame camera, thereby inducing tedious hardware and calibration challenges.We introduce FastEventDGS, a novel Deformable Gaussian Splatting-based framework that leverages a single event camera for high-fidelity 4D reconstruction. Our method utilizes a continuous camera trajectory parametrization and integrates two event generation models to provide both photometric and geometric constraints. We further propose a local patch event motion loss |
| 74 | 100 | core trend paper A. thesis anchor: representation shift | FastGS: Training 3D Gaussian Splatting in 100 Seconds 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark | multi-view images; video / temporal; panorama | Gaussian map; mesh / surface; radiance field / NVS; 4D scene | efficiency; scale; dynamic; benchmark/data | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractThe dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multi-view consistency, efficiently solving the trade-off between training time and rendering quality. We innovatively design a densification and pruning strategy based on multi-view consistency, dispensing with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets demonstrate that our method significantly outperforms the state-of-the-art methods in training speed, achieving a 3.29× training acceleration and comparable rendering quality compared with DashGaussian on the Mip-NeRF 360 dataset and a 15.45× acceleration |
| 75 | 100 | core trend paper A. thesis anchor: representation shift | Feed-forward Gaussian Registration for Head Avatar Creation and Editing 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; generation_editing | multi-view images; multimodal / language | Gaussian map; mesh / surface; radiance field / NVS; editable / generative 3D | unified pipeline; efficiency; editing/generation | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstractWe present MATCH (Multi-view Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. State-of-the-art multi-view head avatars require time-consuming head tracking, which is followed by an expensive avatar optimization, often resulting in a total creation time that exceeds one day. MATCH instead directly predicts Gaussian splat textures in correspondence from calibrated multi-view images in 0.5 seconds per frame. While the learned intra-subject correspondence across frames allows us to quickly build personalized head avatars, correspondence across subjects enables various applications such as expression transfer, optimization-free tracking, semantic editing, and identity interpolation. We learn to establish such correspondences end-to-end, with a transformer-based model that predicts textures of Gaussian |
| 76 | 100 | core trend paper A. thesis anchor: representation shift | GaussFusion: Improving 3D Reconstruction in the Wild with Geometry-Informed Video Generator 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing | video / temporal; RGB-D / depth | camera pose; depth / normals; Gaussian map; mesh / surface; radiance field / NVS; editable / generative 3D | foundation/prior; unified pipeline; efficiency; robustness; dynamic; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractWe present GaussFusion, a novel approach for improving 3D Gaussian splatting (3DGS) reconstructions in the wild through geometry-informed video generation. GaussFusion mitigates common 3DGS artifacts, including floaters, flickering, and blur caused by camera pose errors, incomplete coverage, and noisy geometry initialization. Unlike prior RGB-based approaches limited to a single reconstruction pipeline, our method introduces a geometry-informed video-to-video generator that refines 3DGS renderings across both optimization-based and feed-forward methods. Given an existing reconstruction, we render a Gaussian primitive video buffer encoding depth, normals, opacity, and covariance, which the generator refines to produce temporally coherent, artifact-free frames. We further introduce an artifact synthesis pipeline that simulates diverse degradation patterns, ensuring robustness and generaliz |
| 77 | 100 | core trend paper A. thesis anchor: representation shift | Gaussian Mapping for Evolving Scenes 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark | video / temporal; LiDAR / driving; RGB-D / depth | depth / normals; Gaussian map; mesh / surface; radiance field / NVS; 4D scene | scale; dynamic | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractMapping systems with novel view synthesis (NVS) capabilities are widely used in computer vision, as well as in various applications, including augmented reality, robotics, and autonomous driving. Most notably, 3D Gaussian Splatting-based systems show high NVS performance; however, many current approaches are limited to static scenes. While recent works have begun addressing short-term dynamics (motion within the camera's view), long-term dynamics (the scene evolving through changes out of view) remain less explored.To overcome this limitation, we introduce a dynamic scene adaptation mechanism that continuously updates the 3D representation to reflect the latest changes. In addition, since maintaining geometric and semantic consistency remains challenging due to stale observations disrupting the reconstruction process, we propose a novel keyframe management mechanism that discards outdate |
| 78 | 100 | core trend paper A. thesis anchor: representation shift | GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing | video / temporal | depth / normals; Gaussian map; mesh / surface; radiance field / NVS; 4D scene; editable / generative 3D | foundation/prior; efficiency; dynamic; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractReconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diffusion models have enabled remarkable progress in image and video generation for avatar reconstruction, existing methods primarily rely on 2D priors and struggle to achieve consistent 3D geometry. We propose a novel framework that leverages geometry-aware diffusion to distill strong geometry priors for high-fidelity head avatar reconstruction. Our approach jointly synthesizes portrait images and corresponding surface normals, while a pose-free expression encoder captures implicit expression representations. Both synthesized images and expression latents are distilled into 3D Gaussian-based avatars, enabling photorealistic rendering with accurate geometry. Extensive experiments demonstrate that our method substantially outperforms state-of- |
| 79 | 100 | core trend paper A. thesis anchor: representation shift | Geometric-Photometric Event-based 3D Gaussian Ray Tracing 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; data_benchmark | video / temporal; RGB-D / depth | depth / normals; Gaussian map; radiance field / NVS | foundation/prior; dynamic; benchmark/data | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractEvent cameras offer a high temporal resolution over traditional frame-based cameras, which makes them suitable for motion and structure estimation. However, it has been unclear how event-based 3D Gaussian Splatting (3DGS) approaches could leverage fine-grained temporal information of sparse events. This work proposes a framework to address the trade-off between accuracy and temporal resolution in the event-based 3DGS. Our key idea is to decouple the rendering into two branches: event-by-event geometry (depth) rendering and snapshot-based radiance (intensity) rendering, by using ray-tracing and the image of warped events. The extensive evaluation shows that our method achieves the state-of-the-art performance on the real-world datasets and competitive performance on the synthetic datasets. Also, the proposed method works without prior information (e.g., pretrained image reconstruction mod |
| 80 | 100 | core trend paper A. thesis anchor: representation shift | Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; robotics_mapping | multi-view images; multimodal / language | Gaussian map; mesh / surface; radiance field / NVS | foundation/prior; unified pipeline; robustness | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstractRobust 3D representation learning forms the perceptual foundation of spatial intelligence, enabling downstream tasks in scene understanding and embodied AI. However, learning such representations directly from unposed multi-view images remains challenging. Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics.We introduce $\textbf{\textit{UniSplat}}$, a feed-forward framework designed to address these limitations through three complementary components. First, we propose a $\textit{dual-masking strategy}$ that strengthens geometry induction in the encoder. By masking both encoder and decoder tokens, and targeting decoder masks toward geometry-rich regions, the model is forced to infer structural informatio |
| 81 | 100 | core trend paper A. thesis anchor: representation shift | No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark | RGB-D / depth | depth / normals; Gaussian map; mesh / surface; radiance field / NVS | foundation/prior; scale | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstractWe present the first study of cross-sensor view synthesis across different modalities. We examine a practical, fundamental, yet widely overlooked problem: getting aligned RGB-X data, where most RGB-X prior work assumes such pairs exist and focuses on modality fusion, but it empirically requires huge engineering effort in calibration. We propose a match-densify-consolidate method. First, we perform RGB-X image matching followed by guided point densification. Using the proposed confidence-aware densification and self-matching filtering, we attain better view synthesis and later consolidate them in 3D Gaussian Splatting (3DGS). Our method uses no 3D priors for X-sensor and only assumes nearly no-cost COLMAP for RGB. We aim to remove the cumbersome calibration for various RGB-X sensors and advance the popularity of cross-sensor learning by a scalable solution that breaks through the bottlene |
| 82 | 100 | core trend paper A. thesis anchor: representation shift | Perceptual 3D Simulation With Physical World Modeling 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark | video / temporal; multimodal / language | mesh / surface; radiance field / NVS; 4D scene | dynamic | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractPredicting how a scene will evolve after a desired 3D transformation from images is a central goal in vision, graphics, and robotics. Yet unlike ideal simulators with full access to 3D geometry and dynamics, real world systems must rely on perceptual inputs and local actions that are inherently partial and incomplete. In this work, we present P3Sim, a physical world modeling system that simulates future scene states under both partial observations and incomplete 3D transformation signals. P3Sim is composed of three interacting components: a learned physical world model, a geometric conditioning module, and a persistent scene memory. The world model interprets perception as probabilistic inference over multimodal scene variables, providing predictions of the distributions of any scene variable conditioned on any combination of others. The geometric conditioning module provides a partial 3 |
| 83 | 100 | core trend paper A. thesis anchor: representation shift | Photo3D: Advancing Photorealistic 3D Generation through Structure‑Aligned Detail Enhancement 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; gaussian_radiance; depth_correspondence; surface_occupancy; generation_editing; data_benchmark | multimodal / language | mesh / surface; radiance field / NVS; editable / generative 3D | foundation/prior; benchmark/data; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractAlthough recent 3D‑native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich surface details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non‑rigid motions of objects, and the limited precision of scanners.We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT‑4o‑Image model.Considering that the generated images can distort 3D structures due to their lack of multi‑view consistency, we design a structure‑aligned multi‑view synthesis pipeline and construct a detail‑enhanced multi‑view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages percep |
| 84 | 100 | core trend paper A. thesis anchor: representation shift | Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark | single image; multi-view images; panorama; RGB-D / depth | camera pose; depth / normals; Gaussian map; mesh / surface; radiance field / NVS | foundation/prior; efficiency; robustness | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstractOmnidirectional 3D Gaussian Splatting with panoramas is a key technique for 3D scene representation, and existing methods typically rely on slow SfM to provide camera poses and sparse points priors. In this work, we propose a pose-free omnidirectional 3DGS method, named PFGS360, that reconstructs 3D Gaussians from unposed omnidirectional videos. To achieve accurate camera pose estimation, we first construct a spherical consistency-aware pose estimation module, which recovers poses by establishing consistent 2D–3D correspondences between the reconstructed Gaussians and the unposed images using Gaussians' internal depth priors. Besides, to enhance the fidelity of novel view synthesis, we introduce a depth-inlier-aware densification module to extract depth inliers and Gaussian outliers with consistent monocular depth priors, enabling efficient Gaussian densification and achieving photoreali |
| 85 | 100 | core trend paper A. thesis anchor: representation shift | REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark | multi-view images; video / temporal; RGB-D / depth | depth / normals; Gaussian map; mesh / surface; radiance field / NVS | foundation/prior; dynamic | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractArticulated objects are pervasive in daily environments, such as drawers and refrigerators. Towards their part-level surface reconstruction and joint parameter estimation, REArtGS introduces a category-agnostic approach using multi-view RGB images at two different states. However, we observe that REArtGS still struggles with screw-joint or multi-part objects and lacks geometric constraints for unseen states. In this paper, we propose REArtGS++, a novel method towards generalizable articulated object reconstruction with temporal geometry constraint and planar Gaussian splatting. We first model a decoupled screw motion for each joint without type prior, and jointly optimize part-aware Gaussians with joint parameters through part motion blending. To introduce time-continuous geometric constraint for articulated modeling, we encourage Gaussians to be planar and propose a temporally consisten |
| 86 | 100 | core trend paper A. thesis anchor: representation shift | RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark | video / temporal; LiDAR / driving; multimodal / language | Gaussian map; occupancy / voxel; radiance field / NVS; 4D scene | robustness; dynamic | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractNeural fields (NFs) have achieved remarkable success in scene reconstruction and novel view synthesis. However, existing NF approaches that rely on RGB or LiDAR inputs often struggle under adverse weather conditions, limiting their robustness in real-world outdoor environments such as autonomous driving. In contrast, millimeter-wave radar is inherently resilient to environmental variations, yet its integration with NFs remains largely underexplored. Moreover, outdoor driving scenes frequently involve dynamic objects, making spatiotemporal modeling crucial for temporally consistent novel view synthesis. To address these challenges, we present RF4D, a radar-based neural field framework tailored for novel view synthesis in outdoor dynamic scenes. RF4D explicitly incorporates temporal information into its representation, enabling more accurate modeling of object motion. A dedicated \textbf{s |
| 87 | 100 | core trend paper A. thesis anchor: representation shift | RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark | single image; video / temporal; RGB-D / depth; multimodal / language | depth / normals; mesh / surface; radiance field / NVS; 4D scene | foundation/prior; robustness; dynamic; benchmark/data; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractReconstructing people, objects, and their interactions in 3D is a long-standing and fundamental goal for intelligent systems. Often the input is RGB video from a moving camera, making the task ill-posed; depth is ambiguous, humans and objects occlude each other, and camera and object motion entangle to create apparent motion. Most prior work addresses humans or objects in isolation, ignoring their interplay, or assumes known 3D shapes or cameras, which is impractical for real-world applications. We develop RHINO (Reconstructing Human Interactions with Novel Objects), a novel three-step framework that recovers in 3D a human, novel (unseen) manipulated object, and static scene in a common world frame from a monocular RGB video. First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and |
| 88 | 100 | core trend paper A. thesis anchor: representation shift | Scaling4D: Pushing the Frontier of Video Novel View Synthesis through Large-Scale Monocular Videos 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmark | single image; multi-view images; video / temporal | mesh / surface; radiance field / NVS; 4D scene; editable / generative 3D | foundation/prior; scale; dynamic; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractVideo Novel View Synthesis (VNVS) aims to render arbitrary novel viewpoints of dynamic scenes from a single-view video, but its algorithmic training faces a major challenge: the lack of large-scale multi-view video datasets. Prior methods often train on monocular data by framing it as an inpainting task, which typically leads to a train-inference gap and visual artifacts. While synthetic multi-view data can partially alleviate the data scarcity issue, its high acquisition costs and limited diversity restrict scalability. To address these problems, we propose Scaling4D, a novel strategy that theoretically avoids the train-inference gap while leveraging large-scale monocular videos for training. Specifically, we take a higher-level perspective on the problem, reformulating VNVS into a general correspondence-guided generation task. Furthermore, in conjunction with extensive real-world data, |
| 89 | 100 | core trend paper A. thesis anchor: representation shift | Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing | single image; multi-view images; video / temporal; RGB-D / depth | depth / normals; Gaussian map; mesh / surface; radiance field / NVS; editable / generative 3D | foundation/prior; robustness; dynamic; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractWe present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised globa |
| 90 | 100 | core trend paper A. thesis anchor: representation shift | Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; generation_editing; data_benchmark | multi-view images; video / temporal; multimodal / language | Gaussian map; radiance field / NVS; 4D scene; editable / generative 3D | foundation/prior; dynamic; benchmark/data; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractGenerating dynamic 4D objects from sparse inputs is difficult because it demands joint preservation of appearance and motion coherence across views and time while suppressing artifacts and temporal drift. We hypothesize that the view discrepancy arises from supervision limited to pixel- or latent-space video-diffusion losses, which lack explicitly temporally aware, feature-level tracking guidance.We present \emph{Track4DGen}, a two-stage framework that couples a multi-view video diffusion model with a foundation point tracker and a hybrid 4D Gaussian Splatting (4D-GS) reconstructor. The central idea is to explicitly inject tracker-derived motion priors into intermediate feature representations for both multi-view video generation and 4D-GS. In Stage One, we enforce dense, feature-level point correspondences inside the diffusion generator, producing temporally consistent features that cur |
| 91 | 100 | core trend paper A. thesis anchor: representation shift | Unblur-SLAM: Dense Neural SLAM for Blurry Inputs 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark | multi-view images; RGB-D / depth; multimodal / language | camera pose; depth / normals; Gaussian map; mesh / surface; radiance field / NVS | unified pipeline | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstractWe propose Unblur-SLAM, an RGB SLAM pipeline for sharp 3D reconstruction from blurred image inputs. In contrast to previous work, our approach is able to handle different types of blur and demonstrates state-of-the-art performance in the presence of both motion blur and defocus blur. Moreover, we adjust the computation effort with the amount of blur in the input image.As a first stage, our method uses a feed-forward image deblurring model for which we propose a suitable training scheme that can improve both tracking and mapping modules.Frames that are successfully deblurred by the feed-forward network obtain refined poses and depth through local-global multi-view optimization and loop closure. Frames that fail the first stage deblurring are directly modeled through the global 3DGS representation and an additional blur network to model multiple blurred sub-frames and simulate the blur for |
| 92 | 100 | core trend paper A. thesis anchor: representation shift | VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmark | RGB-D / depth; multimodal / language | camera pose; Gaussian map; mesh / surface; radiance field / NVS | efficiency; robustness | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstractSimultaneous Localization and Mapping (SLAM) with 3D Gaussian Splatting (3DGS) enables fast, differentiable rendering and high-fidelity reconstruction across diverse real-world scenes. However, existing 3DGS-SLAM approaches handle measurement reliability implicitly, making pose estimation and global alignment susceptible to drift in low-texture regions, transparent surfaces, or areas with complex reflectance properties. To this end, we introduce VarSplat, an uncertainty-aware 3DGS-SLAM system that explicitly learns per-splat appearance variance. By using the law of total variance with alpha compositing, we then compute corresponding differentiable per-pixel uncertainty map. This variance map guides tracking, submap registration, and loop detection toward focusing on reliable regions and contributes to more stable optimization. Experimental results on Replica (synthetic) and TUM-RGBD, Sca |
| 93 | 100 | core trend paper A. thesis anchor: representation shift | VDFE: Difference-Aware 3D Scene Editing with Non-Intrusive Video Diffusion Priors for Multi-View Consistency and Efficiency 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; pose_calibration_localization; surface_occupancy; generation_editing; data_benchmark | multi-view images; video / temporal; multimodal / language | camera pose; Gaussian map; mesh / surface; radiance field / NVS; editable / generative 3D | foundation/prior; efficiency; benchmark/data; editing/generation | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstractText-driven 3D editing, enabled by advancements in 3D reconstruction techniques such as NeRF and 3D Gaussian Splatting, aims to provide intuitive scene customization. However, existing methods frequently exhibit limitations in controllability and consistency. To address these shortcomings, we propose \textbf{VDFE}, a difference-aware 3D scene editing method based on non-intrusive utilization of pre-trained video diffusion priors, which integrates Optimal Control Guided Flow Editing (FlowOCE), Decoupled Flow Difference (DFD), and Difference-Aware Gaussians Editing (DAGE). Specifically, FlowOCE treats the editing process as an optimal control problem, optimizing a noise-free editing trajectory to minimize unintended modifications in non-target region; DFD precisely locates editing region by analyzing flow differences, which supplies priors for the subsequent optimization process; and DAGE |
| 94 | 100 | core trend paper A. thesis anchor: representation shift | VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark | video / temporal | point map / point cloud; Gaussian map; mesh / surface; occupancy / voxel; radiance field / NVS; 4D scene; editable / generative 3D | unified pipeline; scale; robustness; dynamic; benchmark/data; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractVideo world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretr |
| 95 | 100 | core trend paper A. thesis anchor: representation shift | Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image 3D Vision & Geometry / 3D Gaussian Splatting | general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; data_benchmark | single image; RGB-D / depth | depth / normals; Gaussian map; mesh / surface; radiance field / NVS; 4D scene | foundation/prior; unified pipeline; efficiency; scale; dynamic | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractExisting single-image 3D human avatar methods primarily rely on rigid joint transformations, limiting their ability to model realistic cloth dynamics. We present DynaAvatar, a zero-shot framework that reconstructs animatable 3D human avatars with motion-dependent cloth dynamics from a single image. Trained on large-scale multi-person motion datasets, DynaAvatar employs a Transformer-based feed-forward architecture that directly predicts dynamic 3D Gaussian deformations without subject-specific optimization. To overcome the scarcity of dynamic captures, we introduce a static-to-dynamic knowledge transfer strategy: a Transformer pretrained on large-scale static captures provides strong geometric and appearance priors, which are efficiently adapted to motion-dependent deformations through lightweight LoRA fine-tuning on dynamic captures. We further propose the DynaFlow loss, an optical flow |
| 96 | 100 | core trend paper B. bridge: reconstruction becomes mapping/world model | ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes 3D Vision & Geometry / 3D Gaussian Splatting | gaussian_radiance; dynamic_4d; robotics_mapping; data_benchmark | video / temporal | Gaussian map; radiance field / NVS | unified pipeline; scale; robustness; dynamic; benchmark/data | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractMulti-period image collections are common in real-world applications. Cities are re-scanned for mapping, construction sites are revisited for progress tracking, and natural regions are monitored for environmental change. Such data form multi-period scenes, where geometry and appearance evolve. Reconstructing such scenes is an important yet underexplored problem. Existing pipelines rely on incompatible assumptions: static and in-the-wild methods enforce a single geometry, while dynamic ones assume smooth motion, both failing under long-term, discontinuous changes. To solve this problem, we introduce ChronoGS, a temporally modulated Gaussian representation that reconstructs all periods within a unified anchor scaffold. It‘s also designed to disentangle stable and evolving components, achieving temporally consistent reconstruction of multi-period scenes. To catalyze relevant research, we re |
| 97 | 100 | core trend paper B. bridge: reconstruction becomes mapping/world model | DROID-SLAM in the Wild 3D Vision & Geometry / Pose Estimation | general_reconstruction; pose_calibration_localization; robotics_mapping; data_benchmark | multi-view images | camera pose | foundation/prior; efficiency; robustness; dynamic | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractWe present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 8 FPS. The source code will be publicly |
| 98 | 100 | core trend paper B. bridge: reconstruction becomes mapping/world model | Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual–Inertial Odometry 3D Vision & Geometry / Pose Estimation | general_reconstruction; pose_calibration_localization; robotics_mapping; data_benchmark | | camera pose | foundation/prior; unified pipeline; efficiency; robustness; benchmark/data | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractVisual-Inertial Odometry (VIO) is a critical component for robust ego-motion estimation, enabling foundational capabilities such as autonomous navigation in robotics and real-time 6-DoF tracking for augmented reality.Existing methods face a well-known trade-off: filter-based approaches are efficient but prone to drift, while optimization-based methods, though accurate, rely on computationally prohibitive Visual-Inertial Bundle Adjustment (VIBA) that is difficult to run on resource-constrained platforms.Rather than removing VIBA altogether, we aim to reduce how often and how heavily it must be invoked. To this end, we cast two key design choices in modern VIO, when to run the visual frontend and how strongly to trust its output, as sequential decision problems, and solve them with lightweight reinforcement learning (RL) agents. Our framework introduces a lightweight, dual-pronged RL polic |
| 99 | 100 | core trend paper B. bridge: reconstruction becomes mapping/world model | ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark | single image | mesh / surface | unified pipeline; scale; robustness; benchmark/data | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractThe ubiquity of monocular videos capturing daily hand-object interactions presents a valuable resource for embodied intelligence. While 3D hand reconstruction from in-the-wild videos has seen significant progress, reconstructing the involved objects remains challenging due to severe occlusions and the complex, coupled motion of the camera, hands, and object. In this paper, we introduce ForeHOI, a novel feed-forward model that directly reconstructs 3D object geometry from monocular hand-object interaction videos within one minute of inference time, eliminating the need for any pre-processing steps. Our key insight is that, the joint prediction of 2D mask inpainting and 3D shape completion in a feed-forward framework can effectively address the problem of severe occlusion in monocular hand-held object videos, thereby achieving results that outperform the performance of optimization-based |
| 100 | 100 | core trend paper B. bridge: reconstruction becomes mapping/world model | Haptic Neural Fields: Bringing Tactile Interactions to 3D Rendered Scenes 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping | | mesh / surface; editable / generative 3D | dynamic; editing/generation | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractWe address the problem of making 3D scenes interactive by asking: what would objects feel like if touched in a virtual environment? State-of-the-art 3D rendering methods provide compelling visual realism, but they fall short in modeling physical interactions, such as haptic feedback. We propose a framework that learns the correspondence between user actions and tactile responses, enabling the generation of touch-based signals directly from simulated interactions in 3D scenes. Our approach leverages a neural field representation conditioned on geometry and action to synthesize material-specific tactile signals. Experiments show that the generated signals reliably convey material properties and interaction dynamics. This paves the way toward interactive, touch-aware virtual environments with realistic haptic feedback. |
| 101 | 100 | core trend paper B. bridge: reconstruction becomes mapping/world model | HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps 3D Vision & Geometry / Pose Estimation | general_reconstruction; pose_calibration_localization; robotics_mapping; data_benchmark | multi-view images; LiDAR / driving | camera pose | foundation/prior; scale; benchmark/data | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractVisual localization on standard-definition (SD) maps has emerged as a promising low-cost and scalable solution for autonomous driving. However, existing regression-based approaches often overlook inherent geometric priors, resulting in suboptimal training efficiency and limited localization accuracy. In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and standard-definition (SD) maps. We construct input pairs that satisfy a homography constraint by projecting ground-view features into the BEV domain and enforcing semantic alignment with map features. Then we leverage homography relationships to guide feature fusion and restrict the pose outputs to a valid feasible region, which significantly improves training efficiency and localization accuracy compared to prior methods relying on attention-based fusi |
| 102 | 100 | core trend paper B. bridge: reconstruction becomes mapping/world model | MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; surface_occupancy; robotics_mapping; generation_editing; data_benchmark | multimodal / language | mesh / surface; editable / generative 3D | benchmark/data; editing/generation | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractGenerating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language–driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent that supports five object-level operations. To support photorea |
| 103 | 100 | core trend paper B. bridge: reconstruction becomes mapping/world model | PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; surface_occupancy; robotics_mapping; generation_editing; data_benchmark | single image; multimodal / language | mesh / surface; editable / generative 3D | foundation/prior; efficiency; robustness; benchmark/data; editing/generation | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstract3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce \textbf{PhysX-Anything}, the first \textbf{simulation-ready} physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by \textbf{193$\times$}, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning a |
| 104 | 100 | core trend paper B. bridge: reconstruction becomes mapping/world model | PromptDepth: Efficient and Promptable Geometric 3D Vision Model \\ for Embodied Intelligence 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark | RGB-D / depth | depth / normals; mesh / surface | foundation/prior; unified pipeline; efficiency; benchmark/data | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractVision models for embodied intelligence require efficient 3D comprehension and interaction with objects within the scene. Existing 3D reconstruction models either overlook instance-level perception or rely on time-consuming offline reasoning, showing a less adaptability in real-time embodied scenario. In this paper, we present PromptDepth, the first promptable vision model that features both geometric 3D understanding and instance-level interaction especially designed for embodied intelligence. PromptDepth is a feed-forward network that quickly yields panoptic, instanced, or tracked depth map from two corresponding frames, enabling the real-time infer sequences from embodied agents. Specifically, following the minimal prediction problem, we design a promptable Dense Prediction Transformer, making it flexible to interact with unified dense prediction according to a specific prompt. Consid |
| 105 | 100 | core trend paper B. bridge: reconstruction becomes mapping/world model | Representing 3D Faces with Learnable B-Spline Volumes 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping | single image; multi-view images | point map / point cloud; mesh / surface | unified pipeline | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractWe present CUBE (Control-based Unified B-Splinie Encoding), a new geometric representation for digital humans that combines B-Spline volumes with learned features, and demonstrate its use as decoder for 3D scan registration and monocular 3D face reconstruction. Unlike existing B-Spline representations that use 3D control points, CUBE is parametrized by a lattice (e.g., $8 \times 8 \times 8$) of high-dimensional control features, increasing the models' expressivity. These control features define a continuous mapping from a 3D parametric domain to 3D Euclidean space through an intermediate feature space, which is evaluated in two stages. First, high-dimensional control features are locally blended using the B-Spline bases, yielding a high-dimensional feature vector, where the first three values are the 3D coordinates of a coarse base mesh. This feature vector is input to a small MLP to pre |
| 106 | 100 | core trend paper B. bridge: reconstruction becomes mapping/world model | SAGE: Scalable Agentic 3D Scene Generation for Embodied AI 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; surface_occupancy; robotics_mapping; generation_editing; data_benchmark | multimodal / language | mesh / surface; editable / generative 3D | scale; editing/generation | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractReal-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., “pick up a bowl and place it on the table”), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simula |
| 107 | 100 | core trend paper B. bridge: reconstruction becomes mapping/world model | SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings 3D Vision & Geometry / Pose Estimation | general_reconstruction; pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmark | single image; video / temporal | camera pose; mesh / surface | foundation/prior; unified pipeline; efficiency; scale | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractMonocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustm |
| 108 | 100 | core trend paper B. bridge: reconstruction becomes mapping/world model | SE(3)-Equivariance with Geometric and Topological Guidance for Category-Level Object Pose Estimation 3D Vision & Geometry / Pose Estimation | general_reconstruction; pose_calibration_localization; surface_occupancy; robotics_mapping; data_benchmark | | camera pose; point map / point cloud | foundation/prior; benchmark/data | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractObject pose estimation is a key task for embodied robots, enabling them to interact with objects effectively. Category-level object pose estimation provides a way for robots to estimate the pose of unknown objects. However, estimating object pose from point clouds alone remains challenging. In this paper, we introduce SEGPose, a novel category-level object pose estimation method based on point clouds. Unlike previous methods, SEGPose leverages geometric, topological information, and SE(3)-equivariance, enhancing the network's accuracy in pose prediction. To utilize geometric and topological features, we propose a constraint-based feature extraction and 3D reconstruction method, enabling effective object shape reconstruction. We also design an SE(3)-equivariance feature prediction network to handle pose transformations consistently across viewpoints, improving pose accuracy. Experimental |
| 109 | 100 | core trend paper B. bridge: reconstruction becomes mapping/world model | SMVRT: Implicit Human 3D Modeling Using Sparse Multi-view Volumetric Reconstruction with Transformer Fusion 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark | sparse multi-view; multi-view images; video / temporal | mesh / surface; occupancy / voxel | unified pipeline; robustness; benchmark/data | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractRecently, the community has witnessed significant progress in human modeling from a single view or multi-views, which often involves "guessing" the occluded parts using either generative models or template fitting. In this work, we address these challenges by exploring optimal fusion strategies from sparse views only. We propose an end-to-end implicit 3D reconstruction framework using a sparse multi-view setup. Specifically, we achieve this by exploring fusion blocks at three stages of the network. First, 2D feature encoders carrying out locally and globally, which produce enhanced features. Second, 3D feature grid, formed by attentional fusion of warped multi-view and multi-level 2D features, which follows 3D regularization of feature grids to aggregate spatially coherent multi-view features. Third, attentional 2D3D feature aggregation associated to query point generate enhanced latent |
| 110 | 100 | core trend paper B. bridge: reconstruction becomes mapping/world model | StreamVLO: Streaming Visual–LiDAR Odometry with Cumulative Drift Compensation 3D Vision & Geometry / Pose Estimation | pose_calibration_localization; depth_correspondence; dynamic_4d; robotics_mapping | video / temporal; LiDAR / driving; multimodal / language | camera pose; 4D scene | foundation/prior; unified pipeline; efficiency; dynamic | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractWe propose StreamVLO, a streaming visual–LiDAR odometry framework that performs unified spatio-temporal correlation with Mamba models and tackles the long-standing cumulative drift problem via an online Cumulative Drift Compensation scheme for localization in 4D dynamic environments. Specifically, StreamVLO introduces a unified spatio-temporal correlation module built on Mamba to fuse heterogeneous visual and LiDAR cues across multi-frame clips, overcoming the limited temporal exploration of prior pairwise methods. Furthermore, a Cumulative Drift Compensation module minimizes cumulative drift by iteratively learning residual corrections from multiple historical frames in a causal manner. To strengthen spatial feature representation on salient regions, we adopt a Keypoint-Aware Auxiliary Loss with a winner-takes-all strategy. StreamVLO achieves state-of-the-art performance on two commonly |
| 111 | 100 | important bridge paper B. bridge: reconstruction becomes mapping/world model | VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models 3D Vision & Geometry / Pose Estimation | pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; data_benchmark | LiDAR / driving; multimodal / language | camera pose; point map / point cloud | robustness; benchmark/data | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractText-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird’s-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mecha |
| 112 | 100 | core trend paper B. bridge: reconstruction becomes mapping/world model | Volumetric Functional Maps 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; surface_occupancy; robotics_mapping | RGB-D / depth; multimodal / language | depth / normals; mesh / surface | | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractThe computation of volumetric correspondences between 3D shapes is a prominent tool for medical and industrial applications. In this work, we pave the way for spectral volume mapping, extending for the first time the functional maps framework from the surface to the volumetric setting. We show that the eigenfunctions of the volumetric Laplace operator define a functional space that is suitable for high-quality signal transfer. We also experiment with various techniques that edit this functional space, porting them to volume domains. We validate our method on novel volumetric datasets and on tetrahedralizations of well-established surface datasets, also showcasing practical applications involving both discrete and continuous signal mapping, for segmentation transfer, mesh connectivity transfer, and solid texturing. Last but not least, we show that considering the volumetric spectrum great |
| 113 | 99 | core trend paper B. bridge: reconstruction becomes mapping/world model | Deep Feature Deformation Weights 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; surface_occupancy; robotics_mapping; generation_editing | | mesh / surface; editable / generative 3D | foundation/prior; efficiency; robustness; editing/generation | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractHandle-based mesh deformation has been a long-standing paradigm in computer graphics, enabling intuitive shape edits from sparse controls. Classic techniques offer precise and rapid deformation control. However, they solve an optimization problem with constraints defined by the choice of control handles, requiring a user to know apriori the ideal distribution of handles on the shape to accomplish the desired edit. The mapping from handle set to deformation behavior is often unintuitive and, importantly, non-semantic. Modern data-driven methods, on the other hand, leverage the data prior to obtain semantic edits, at the cost of fine-grained control and speed. We propose a technique that achieves the best of both worlds by leveraging the semantic prior of data and the precise control and speed of traditional frameworks. Our approach is surprisingly simple yet effective: deep feature proxim |
| 114 | 99 | core trend paper B. bridge: reconstruction becomes mapping/world model | HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; surface_occupancy; robotics_mapping; generation_editing | multimodal / language | mesh / surface; editable / generative 3D | efficiency; editing/generation | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractThe 3D layout generation and editing play a crucial role in Embodied AI and immersive VR interaction. However, manual creation requires extensive and tedious labor, while data-driven generation often lacks diversity. The emergence of large models introduces new possibilities for automatic 3D scene synthesis. We present HOG-Layout that enables text-driven hierarchical scene generation, optimization and real-time scene editing with large language models (LLMs) and vision-language models (VLMs). HOG-Layout improves scene semantic consistency and plausibility through retrieval-augmented generation (RAG) technology, incorporates an optimization module to enhance physical consistency, and adopts a hierarchical representation to enhance inference and optimization, achieving real-time editing. Experimental results demonstrate that HOG-Layout produces more reasonable environments compared with ex |
| 115 | 99 | core trend paper B. bridge: reconstruction becomes mapping/world model | OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting | gaussian_radiance; depth_correspondence; robotics_mapping; data_benchmark | | Gaussian map; radiance field / NVS | efficiency; robustness | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractOpen-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit spatial attribute grids for the |
| 116 | 99 | core trend paper B. bridge: reconstruction becomes mapping/world model | SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark | multimodal / language | mesh / surface; editable / generative 3D | unified pipeline; efficiency; scale; dynamic; benchmark/data; editing/generation | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractRealistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision-language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer t |
| 117 | 99 | core trend paper B. bridge: reconstruction becomes mapping/world model | UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; surface_occupancy; robotics_mapping; data_benchmark | multi-view images; multimodal / language | mesh / surface; editable / generative 3D | foundation/prior; unified pipeline; efficiency; scale; editing/generation | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractWe present UniTEX, a novel two-stage 3D texture generation framework to create high-quality, consistent textures for 3D assets. Existing approaches predominantly rely on UV-based models in the second stage to refine textures after reprojecting the generated multi-view images onto the 3D shapes, which introduces challenges related to topological ambiguity. To address this, we bypass the limitations of UV mapping by introducing a Large Texturing Model (LTM) that directly regresses textures in a unified 3D functional space. Moreover, to enable more effective and complete supervision of LTM, we propose to extend surface-defined textures into a continuous volumetric field to serve as an advanced training objective, which we refer to as Texture Functions (TF). Finally, we develop an advanced LoRA-based strategy for efficiently adapting large-scale 2D Diffusion Transformers (DiTs) for high-qual |
| 118 | 93 | important bridge paper B. bridge: reconstruction becomes mapping/world model | Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration via Heterogeneous Graphs 3D Vision & Geometry / Pose Estimation | pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping | | camera pose; point map / point cloud | foundation/prior; unified pipeline | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractImage-to-point-cloud (I2P) registration aims to align 2D images with 3D point clouds by establishing reliable 2D-3D correspondences. The drastic modality gap between images and point clouds makes it challenging to learn features that are both discriminative and generalizable, leading to severe performance drops in unseen scenarios. We address this challenge by introducing a heterogeneous graph framework that jointly refines cross-modal features and correspondences within a unified architecture. The proposed graph represents a mapping between segmented 2D and 3D regions, which enhances cross-modal feature interaction and thus improves feature discriminability. In addition, modeling the consistency among vertices and edges within the graph enables pruning of unreliable correspondences. Building on these insights, we propose a heterogeneous graph embedded I2P registration method, termed Hg- |
| 119 | 92 | core trend paper B. bridge: reconstruction becomes mapping/world model | Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events 3D Vision & Geometry / 3D Gaussian Splatting | gaussian_radiance; dynamic_4d; robotics_mapping | video / temporal | Gaussian map; radiance field / NVS | unified pipeline; robustness; dynamic | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractNovel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We utilize NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A pixel-wise RGB mapping field is introduced to align the above NeRF-rendered HDR pixel values with the sensor-recorded LD |
| 120 | 91 | important bridge paper B. bridge: reconstruction becomes mapping/world model | Reconstructing Functional 3D Scenes from Egocentric Interaction Videos 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; surface_occupancy; robotics_mapping | RGB-D / depth | mesh / surface | foundation/prior; robustness | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractWe present FunREC, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGB-D interaction videos. Unlike existing methods on articulated reconstruction, which rely on controlled setups, multi-state captures, or CAD priors, FunREC operates directly on in-the-wild human interaction sequences to recover interactable 3D scenes. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. Across new real and simulated benchmarks, FunREC surpasses prior work by a large margin, achieving up to +50 mIoU improvement in part segmentation, 5$-$10$\times$ lower articulation and pose errors, and significantly higher reconstruction accuracy. We further demonstrate applications on URDF/USD export for simulation |
| 121 | 100 | core trend paper B. bridge: representation meets metric pose | AERGS-SLAM: Auto-Exposure-Robust Stereo 3D Gaussian Splatting SLAM 3D Vision & Geometry / 3D Gaussian Splatting | gaussian_radiance; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark | RGB-D / depth | camera pose; depth / normals; Gaussian map; radiance field / NVS | robustness; benchmark/data | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstract3D Gaussian splatting (3DGS) has emerged as a revolutionary scene representation in simultaneous localization and mapping (SLAM) research. However, existing research on 3DGS-based SLAM fails to accurately address the appearance variations induced by camera auto-exposure in prevalent real-world scenarios, resulting in reduced localization and photorealistic mapping accuracy. To address this issue, we propose a stereo auto-exposure-robust Gaussian splatting SLAM (AERGS-SLAM), a framework robust to such variations and enables both reliable localization and exposure-controlled photorealistic mapping. Our key contributions are two fold. Firstly, we propose a camera exposure network to model the camera exposure process, which we integrate with Gaussian splatting to achieve exposure-controlled novel view synthesis. Secondly, we exploit an illumination-robust geometric feature for localization a |
| 122 | 100 | core trend paper B. bridge: representation meets metric pose | DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis 3D Vision & Geometry / Pose Estimation | gaussian_radiance; pose_calibration_localization; dynamic_4d; data_benchmark | video / temporal | camera pose; radiance field / NVS; 4D scene; editable / generative 3D | dynamic; benchmark/data; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractImage alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. In this paper, we present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges in image alignment from a new perspective, employing a generation-based solution that showcases strong capabilities and avoids the problems associated with flow-based image warping. Specifically, we propose a Dynamics-aware Diffusion Training approach for learning conditional image generation, synthesizing a novel view for image alignment. This incorporates a Dynamics-aw |
| 123 | 100 | core trend paper B. bridge: representation meets metric pose | Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM 3D Vision & Geometry / 3D Gaussian Splatting | gaussian_radiance; pose_calibration_localization; depth_correspondence; dynamic_4d; robotics_mapping | video / temporal; RGB-D / depth | camera pose; depth / normals; Gaussian map; radiance field / NVS; 4D scene; editable / generative 3D | foundation/prior; efficiency; robustness; dynamic; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractHandling the dynamic environments is a significant research challenge in Visual Simultaneous Localization and Mapping (SLAM). Recent research combines 3D Gaussian Splatting (3DGS) with SLAM to achieve both robust camera pose estimation and photorealistic renderings. However, using SLAM to efficiently reconstruct both static and dynamic regions remains challenging. In this work, we propose an efficient framework for dynamic 3DGS SLAM guided by optical flow. Using the input depth and prior optical flow, we first propose a category-agnostic motion mask generation strategy by fitting a camera ego-motion model to decompose the optical flow. This module separates dynamic and static Gaussians and simultaneously provides flow-guided camera pose initialization. We boost the training speed of dynamic 3DGS by explicitly modeling their temporal centers at keyframes. These centers are propagated usin |
| 124 | 100 | core trend paper B. bridge: representation meets metric pose | ODGS-SLAM: Omnidirectional Gaussian Splatting SLAM 3D Vision & Geometry / 3D Gaussian Splatting | gaussian_radiance; pose_calibration_localization; robotics_mapping; data_benchmark | panorama | camera pose; Gaussian map; radiance field / NVS | unified pipeline; benchmark/data | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstractThis work presents ODGS-SLAM, an omnidirectional simultaneous localization and mapping (SLAM) system utilizing 3D Gaussian Splatting (3DGS) as the unified representation for tracking and mapping.Thus, it reconstructs scene geometry from panoramic image sequences (RGB or RGBD) via splats while also detecting the camera poses.Such a framework is important to understand the full surrounding, *e.g.*, for augmented reality applications or autonomous systems.We extended existing 3DGS-SLAM methods to handle omnidirectional input by including closed-form gradients for mapping and camera pose estimation, utilizing an equirectangular projection model.To lower memory footprint, a key frame removal procedure based on graph analysis is proposed, enabling the application to handle larger input sizes.For evaluation, we provide a data set of controlled real-world and synthetic test scenes (indoor and ou |
| 125 | 100 | core trend paper B. bridge: representation meets metric pose | SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM 3D Vision & Geometry / 3D Gaussian Splatting | gaussian_radiance; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark | RGB-D / depth | camera pose; depth / normals; Gaussian map; radiance field / NVS | benchmark/data | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstract3D Gaussian Splatting (3DGS) has made huge progress in RGBD SLAM. Current methods usually use 3D Gaussians or view-tied 3D Gaussians to represent radiance fields in tracking and mapping. However, these Gaussians are either too flexible or too limited in movements, resulting in slow convergence or limited rendering quality. To resolve this issue, we adopt pixel-aligned Gaussians but allow each Gaussian to adjust its position along its ray to maximize the rendering quality, even if Gaussians are simplified for improving scalability. To speed up the tracking, we model the depth distribution around each pixel as a Gaussian function, and then use these points to align each frame to the 3D scene quickly. We report our evaluations on widely used benchmarks, justify our designs, and show advantages over the latest methods in view rendering, camera tracking, runtime, and storage complexity. |
| 126 | 100 | core trend paper B. bridge: representation meets metric pose | ULF-Loc: Unbiased Landmark Feature for Robust Visual Localization with 3D Gaussian Splatting 3D Vision & Geometry / Pose Estimation | gaussian_radiance; pose_calibration_localization; depth_correspondence; robotics_mapping; data_benchmark | | camera pose; Gaussian map; radiance field / NVS | efficiency; robustness; benchmark/data | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstractVisual localization is a core technology for augmented reality and autonomous navigation. Recent methods combine the efficient rendering of 3D Gaussian Splatting (3DGS) with feature-based localization. These methods rely on direct matching between 2D query features and the 3D Gaussian feature field, but this often results in mismatches due to an inherent bias in the learned Gaussian feature. We theoretically analyze the feature learning process in 3DGS, revealing that the widely adopted $\alpha$-blending optimization inherently introduces bias into 3D point features. This bias stems from the entanglement between individual Gaussians and their neighboring Gaussians, making the learned features unsuitable for precise matching tasks. Motivated by these findings, we propose ULF-Loc, an unbiased landmark feature framework that replaces biased feature optimization with geometry-weighted featur |
| 127 | 96 | core trend paper B. bridge: representation meets metric pose | Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty 3D Vision & Geometry / 3D Gaussian Splatting | gaussian_radiance; pose_calibration_localization; depth_correspondence | RGB-D / depth | camera pose; depth / normals; Gaussian map; radiance field / NVS | foundation/prior; robustness | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstract3D Gaussian Splatting (3DGS) has recently emerged as a powerful scene representation and is increasingly used for visual localization and pose refinement. However, despite its high-quality differentiable rendering, the robustness of 3DGS-based pose refinement remains highly sensitive to both the initial camera pose and the reconstructed geometry. In this work, we take a closer look at these limitations and identify two major sources of uncertainty: (i) pose prior uncertainty, which often arises from regression or retrieval models that output a single deterministic estimate, and (ii) geometric uncertainty, caused by imperfections in the 3DGS reconstruction that propagate errors into PnP solvers. Such uncertainties can distort reprojection geometry and destabilize optimization, even when the rendered appearance still looks plausible.To address these uncertainties, we introduce a relocaliza |
| 128 | 94 | core trend paper B. bridge: representation meets metric pose | Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting | gaussian_radiance; pose_calibration_localization; depth_correspondence | | camera pose; Gaussian map; radiance field / NVS | efficiency; robustness | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstractVisual relocalization is a fundamental task in the field of 3D computer vision, estimating a camera’s pose when it revisits a previously known scene. While point-based hierarchical localization methods have shown strong scalability and efficiency, they are often limited by sparse image observations and weak feature matching. In this work, we propose SplatHLoc, a novel hierarchical visual relocalization framework that uses Feature Gaussian Splatting as the scene representation. For feature matching, we observe that Gaussian-rendered features and those extracted directly from images exhibit different strengths across the two-stage matching process: the former performs better in the coarse stage, while the latter proves more effective in the fine stage. Therefore, we introduce a hybrid feature matching strategy, enabling more accurate and efficient pose estimation. Extensive experiments on |
| 129 | 88 | important bridge paper B. bridge: representation meets metric pose | FMPose: 3D Pose Estimation via Flow Matching 3D Vision & Geometry / Pose Estimation | gaussian_radiance; pose_calibration_localization; depth_correspondence | single image; RGB-D / depth | camera pose; depth / normals; Gaussian map; editable / generative 3D | foundation/prior; efficiency; editing/generation | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstractMonocular 3D pose estimation is fundamentally ill-posed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses.In particular, diffusion-based models have demonstrated strong performance, but their iterative denoising process typically requires many time steps for each prediction, making inference computationally expensive.In contrast, Flow Matching (FM) learns an ODE-based velocity field, enabling efficient generation of 3D pose samples with only a few integration steps. Inspired by this capability, we propose a novel generative pose estimation framework, FMPose, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned on 2D inputs. While the ODE trajectories are dete |
| 130 | 88 | important bridge paper B. bridge: representation meets metric pose | Uncertainty-Aware Modality Fusion for Unaligned RGB-T Salient Object Detection 3D Vision & Geometry / Pose Estimation | gaussian_radiance; pose_calibration_localization; surface_occupancy | | camera pose; Gaussian map | | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstractUnaligned RGB-T salient object detection (SOD) remains challenging due to severe cross-modal spatial discrepancies and unreliable feature fusion. Existing methods often assume perfect alignment or rely on geometric registration, which is computationally demanding and sensitive to cross-modal inconsistencies. To address these limitations, we propose an uncertainty-aware modality fusion network (UMFNet) that reformulates RGB-T SOD as an uncertainty-aware representation learning problem. Specifically, the proposed uncertainty alignment module (UAM) models pixel-wise features as Gaussian latent distributions to estimate local uncertainty and identify cross-modal consistency regions within the feature space, thereby achieving implicit alignment without explicit registration. Furthermore, the confidence-guided global modulation (CGM) mechanism leverages confidence maps derived from uncertainty |
| 131 | 87 | important bridge paper B. bridge: representation meets metric pose | GenSplat: Bridging the Generalization Gap in 3DGS Language Comprehension 3D Vision & Geometry / 3D Gaussian Splatting | gaussian_radiance; pose_calibration_localization; data_benchmark | multimodal / language | Gaussian map; radiance field / NVS | foundation/prior; robustness; benchmark/data | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstractIn this paper, we propose GenSplat, a novel approach for language comprehension in 3D Gaussian Splatting (3DGS). Unlike previous methods that either achieve cross-scene generalization by being bounded to a predefined vocabulary or handle free-form language by overfitting to individual scenes, GenSplat is robust to free-form language queries and generalizable across 3DGS scene representations. Our key insight for this problem is to formulate a structured learning process to progressively align linguistic concepts with 3D Gaussians. It contains two novel technical contributions. First, we propose a Progressive Language Grounding Curriculum that structurally guides the model through learning category-level semantics to instance-level concepts and free-form language, preventing overfitting by building a generalizable language feature space. Second, we design a Multi-modal Large Language Mode |
| 132 | 69 | important bridge paper B. bridge: representation meets metric pose | GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting 3D Vision & Geometry / 3D Gaussian Splatting | gaussian_radiance; pose_calibration_localization | multimodal / language | Gaussian map; radiance field / NVS | efficiency; dynamic | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstract3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and real-time rendering. Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points. Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issue, we propose graph-based spatial distribution optimization for compact 3D Gaussian Splatting (GS\textasciicircum2), which enhances reconstruction quality by optimizing the spatial distribution of Gaussian points. Specifically, we introduce an evidence lower bound (ELBO)-based adaptive densification strategy that automatically controls the densification process. In addition, an opacity-aware progressive pruning strategy is proposed to further reduce memory consumption by dynamically removing |
| 133 | 65 | specialized geometry paper B. bridge: representation meets metric pose | Landscape-Awareness for Geometric View Diffusion Model 3D Vision & Geometry / Pose Estimation | gaussian_radiance; pose_calibration_localization | | camera pose; radiance field / NVS | | Use as a bridge between reconstruction representation and metric pose/calibration reliability.abstractAccuracy camera viewpoint estimation under sparse-view conditions remains challenging, particularly in two-view scenarios. Recent approaches leverage diffusion models such as Zero123, which synthesize novel views conditioned on relative viewpoint, and have demonstrated promising performance when repurposed for viewpoint estimation via optimization with MSE loss. However, existing methods often suffer from non-convex loss landscape with numerous local minima, which makes them sensitive to initialization and reliant on na\"ive multi-start strategies to achieve reasonable results. We analyze these optimization challenges and visualize failure cases, showing that ambiguities in object geometry, such as symmetry and self-similarity, can mislead gradient-based updates toward incorrect viewpoints. To address these limitations, we propose a score-based method that reshapes the optimization lands |
| 134 | 100 | core trend paper C. cluster representative | 3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; surface_occupancy; generation_editing; data_benchmark | single image; multimodal / language | camera pose; mesh / surface; editable / generative 3D | foundation/prior; unified pipeline; efficiency; scale; robustness; benchmark/data; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractWe introduce 3D-Fixer, a novel generalizable and efficient scheme for single-image to compositional 3D scene generation. Unlike existing feed-forward frameworks that lack generalization ability in open-set scenarios due to the limited dataset, or divide-and-conquer frameworks that suffer from slow inference or accumulated registration errors during layout alignment, 3D-Fixer extends pre-trained object-level 3D generation priors to perform in-place completion on the single-view estimated geometry, eliminating the need for pose alignment while preserving feed-forward efficiency. At its core, 3D-Fixer introduces a coarse-to-fine scheme to accurately determine the completion boundary and generate high quality completion 3D asset based on the single-view estimated fragmented geometry. Also, we design a dual-branch conditioning network that integrates 2D and 3D contextual information to guide |
| 135 | 100 | core trend paper C. cluster representative | 3D-Object Perception Transformer (3PT) 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark | multi-view images; RGB-D / depth | camera pose; depth / normals; mesh / surface | foundation/prior; scale; robustness | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractCurrent approaches to zero-shot 3D-object perception typically rely on ensembles of frozen foundation models.This limits deep object understanding and cross-domain generalization, making performance inadequate for real-world deployment. The 3D-Object Perception Transformer (3PT) addresses this limitation by unifying detection, segmentation, and 6DoF pose estimation in a single framework, directly trained for 3D-object perception. Based on two large-scale trained Transformers that specialize in 2D and 3D object-centric scene understanding respectively, 3PT continuously refines its object representations without depth input, enhancing 3D understanding by incorporating multi-view information. 3PT surpasses task-specialized models for detection and pose estimation, often achieving double-digit percentage improvements on the diverse BOP-benchmarks. Achieving high accuracy and robustness, \alg |
| 136 | 100 | core trend paper C. cluster representative | AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment 3D Vision & Geometry / Pose Estimation | general_reconstruction; pose_calibration_localization; depth_correspondence; data_benchmark | single image; multi-view images; RGB-D / depth | camera pose; depth / normals | foundation/prior; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractSingle-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions.First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation.Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose minimizing the feature discrepancy between on-the-flyrendered object features and obser |
| 137 | 100 | core trend paper C. cluster representative | AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark | multi-view images; RGB-D / depth | camera pose; depth / normals; mesh / surface | foundation/prior; unified pipeline; scale | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractWe present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and even surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks. |
| 138 | 100 | core trend paper C. cluster representative | ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark | single image | mesh / surface | foundation/prior; scale; robustness; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractSymmetry detection is a fundamental problem in computer vision, and symmetries serve as powerful priors for downstream tasks. However, existing learning-based methods for detecting 3D symmetries from single images have been almost exclusively trained and evaluated on object-centric or synthetic datasets, and thus fail to generalize to real-world scenes. Furthermore, due to the inherent scale ambiguity of monocular inputs, which makes localizing the 3D plane an ill-posed problem, many existing works only predict the plane's orientation. In this paper, we address these limitations by presenting the first framework for detecting *3D-grounded reflectional symmetries* from single, in-the-wild RGB images, focusing on architectural landmarks. We introduce two key innovations: (1) a scalable data annotation pipeline to automatically curate a large-scale dataset of architectural symmetries, ArchS |
| 139 | 100 | core trend paper C. cluster representative | CoLoR: The Devil is in Scene Coordinate Regression for Large-Scale Visual Localization 3D Vision & Geometry / Pose Estimation | general_reconstruction; pose_calibration_localization; depth_correspondence; data_benchmark | single image; multi-view images | camera pose | efficiency; scale | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractScene Coordinate Regression (SCR) has emerged as a memory-efficient paradigm for visual localization.While SCR has demonstrated performance comparable to classic feature matching based approaches in small-scale scenes, it has consistently underperformed in large-scale environments.Large-scale localization is hampered by two challenges: sparse co-visibility and local appearance ambiguity.In this work, we propose **CoLoR**, a novel training framework tailored for large-scale SCR.First, we explicitly and efficiently partition scene points into multi-view and single-view sets and introduce a two-stage bootstrapping paradigm to provide complete and strong supervision for all points.Second, we propose a multi-granularity retrieval feature, which unifies the conventional global and local features as retrieval-oriented representations at the image and pixel levels, respectively, to enforce featu |
| 140 | 100 | core trend paper C. cluster representative | Cupid: Generative 3D Reconstruction via Joint Object and Pose Modeling 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy | multi-view images | camera pose; mesh / surface | foundation/prior; unified pipeline; robustness | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractWe introduce Cupid, a generative 3D reconstruction framework that jointly models the full distribution over both canonical objects and camera poses. Our two-stage flow-based model first generates a coarse 3D structure and 2D-3D correspondences to estimate the camera pose robustly. Conditioned on this pose, a refinement stage injects pixel-aligned image features directly into the generative process, marrying the rich prior of a generative model with the geometric fidelity of reconstruction. This strategy achieves exceptional faithfulness, outperforming state-of-the-art reconstruction methods by over 3 dB PSNR and 10\% in Chamfer Distance. As a unified generative model that decouples the object and camera pose, Cupid naturally extends to multi-view and scene-level reconstruction tasks without requiring post-hoc optimization or fine-tuning. |
| 141 | 100 | core trend paper C. cluster representative | Diff4Splat: Repurposing Video Diffusion Models for Dynamic Scene Generation 3D Vision & Geometry / 3D Gaussian Splatting | gaussian_radiance; dynamic_4d; generation_editing; data_benchmark | single image; video / temporal; multimodal / language | Gaussian map; radiance field / NVS; 4D scene; editable / generative 3D | foundation/prior; unified pipeline; efficiency; scale; dynamic; benchmark/data; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractWe introduce Diff4Splat, a feed-forward framework for dynamic scene generation from a single image. Our method synergizes the powerful generative priors of video diffusion models with geometric and motion constraints learned from a large-scale 4D dataset. Given a single image, a camera trajectory, and an optional text prompt, our model directly predicts a dynamic scene represented by a deformable 3D Gaussian field. This approach captures appearance, geometry, and motion in a single pass, eliminating the need for test-time optimization or post-hoc processing. At the core of our framework is a video latent transformer that enhances existing video diffusion models, enabling them to jointly model spatio-temporal dependencies and predict 3D Gaussian Primitives over time. Supervised by objectives targeting appearance fidelity, geometric accuracy, and motion consistency, Diff4Splat generates hi |
| 142 | 100 | core trend paper C. cluster representative | Edges Compete for Trust: Group Relative Edge Optimization for Building Reconstruction from Point Clouds 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark | | point map / point cloud; mesh / surface | scale; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractBuilding reconstruction aims to extract compact wireframes from point clouds. Recent edge-based methods achieve impressive results but suffer from sparse supervision from one-to-one matching, which leaves most edge proposals under-optimized. In this paper, we present Group Relative Edge Optimization (GREO), the first attempt to incentivize dense supervision across edges proposals through reinforcement learning-style optimization in wireframe reconstruction. Specifically, GREO computes edge-level rewards based on geometric alignment quality and transforms them into target confidence distributions via group-wise normalization. In addition, we incorporate entropy regularization to maintain distributional stability and prevent confidence collapse. This joint optimization enables dense and discriminative supervision across all edge proposals through cross-entropy minimization. Experiments |
| 143 | 100 | core trend paper C. cluster representative | Event6D: Event-based Novel Object 6D Pose Tracking 3D Vision & Geometry / Pose Estimation | pose_calibration_localization; depth_correspondence; dynamic_4d; data_benchmark | video / temporal; RGB-D / depth | camera pose; depth / normals; 4D scene | efficiency; scale; dynamic; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractEvent cameras provide microsecond latency, making them suitable for 6D object pose tracking in fast, dynamic scenes where conventional RGB and depth pipelines suffer from motion blur and large pixel displacements. We introduce EventTrack6D, an event-depth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. Conditioned on the most recent depth measurement, our dual reconstruction recovers dense photometric and geometric cues from sparse event streams. Our EventTrack6D operates at over 120 FPS and maintains temporal consistency under rapid motion. To support training and evaluation, we introduce a comprehensive benchmark suite: a large-scale synthetic dataset for training and two complementary evaluation sets, including real and simulated event datasets. Trained exclus |
| 144 | 100 | core trend paper C. cluster representative | Extend3D: Town-scale 3D Generation 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; surface_occupancy; generation_editing | single image; RGB-D / depth | depth / normals; point map / point cloud; mesh / surface; editable / generative 3D | foundation/prior; dynamic; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractIn this paper, we propose Extend3D, a novel training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces of object-centric models in representing wide scenes, we extend the latent space in $x$ and $y$ directions. Then, by dividing the extended latent into overlapping patches, we use the object-centric 3D generative model on each patch and couple them at each time step. Since object-centric models are sub-optimal for sub-scene generation, we use the input image and point cloud extracted from a depth estimator as priors to enable this process. Using the point cloud prior, we initialize the scene structure and refine the occluded region iteratively with under-noised SDEdit. Also, both priors are used to optimize the extended latent during the denoising process so that the denoisi |
| 145 | 100 | core trend paper C. cluster representative | Fast Markov Random Field Optimisation for Topologically Noisy 3D Shape Matching 3D Vision & Geometry / Pose Estimation | general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark | | camera pose; mesh / surface | efficiency | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractIn many real world applications of non-rigid shape matching, the shapes are subject to topological noise (i.e. varying genus). In this paper, we propose a novel formulation based on Markov Random Fields (MRF) that can handle these cases with topological noise. The solutions to our optimisation problem can be approximated efficiently using the alpha expansion algorithm, which gives rise to theoretical approximation guarantees. In particular, we cast non-rigid 3D shape matching as a multi-labelling problem in which each triangle of the source shape is assigned a label that represents the matching to a specific surface element on the target shape. We propose a novel pairwise term that imposes that our matching prefers solutions in which neighbouring triangles on the source shape remain close on the target shape. Further, by exploiting the specific structure of our label space, we show that |
| 146 | 100 | core trend paper C. cluster representative | From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching 3D Vision & Geometry / Pose Estimation | general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy | | camera pose; mesh / surface | foundation/prior; unified pipeline; efficiency; robustness | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractShape matching is a fundamental task in computer graphics and vision, with deep functional map methods emerging as a preferred solution. However, existing approaches primarily focus on learning informative feature representations by constraining both pointwise and functional maps, while overlooking the optimization of a crucial component: the spectral basis, which plays a key role in the (deep) functional maps pipeline. This oversight leads to suboptimal matching performance. Furthermore, these approaches mostly rely on conventional functional map techniques, such as time-consuming functional map solvers, which incur substantial computational overhead. To address those, we introduce Advanced Functional Maps, which generalizes standard functional maps from fixed basis functions to learnable basis functions, supported by rigorous theoretical guarantees. In this framework, the spectral basi |
| 147 | 100 | core trend paper C. cluster representative | FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; surface_occupancy; data_benchmark | RGB-D / depth; multimodal / language | mesh / surface | foundation/prior; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractRecent work in 3D scene understanding has begun to shift from purely spatial analysis to the more complex challenge of functional scene understanding.However, existing methods often consider functional relationships between object pairs in isolation, failing to capture the scene-wide interdependencies that humans use to resolve ambiguity. We introduce FunFact, a framework for constructing probabilistic open-vocabulary functional 3D scene graphs from posed RGB-D images. FunFact first builds an object- and part-centric 3D map and uses foundation models to propose semantically plausible functional relations. These candidates are converted into factor graph variables and constrained by both LLM-derived common-sense priors and geometric priors. This formulation enables joint probabilistic inference over all functional edges and their uncertainties, yielding substantially better-calibrated con |
| 148 | 100 | core trend paper C. cluster representative | FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement 3D Vision & Geometry / Pose Estimation | general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy | multi-view images | camera pose; point map / point cloud; mesh / surface | foundation/prior; unified pipeline; efficiency | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractRegistration of multiview point clouds typically depends on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and ill-posed without holistic geometric constraints. In this paper, we propose FUSER, the first feed-forward multi-view registration transformer that processes all scans jointly in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER employs a sparse 3D CNN to encode each scan into low-resolution superpoint features preserving absolute translation cues, followed by a Geometric Alternating Attention module for efficient intra- and inter-scan reasoning. Particularly, we transfer 2D attention priors from off-the-shelf foundation models (i.e., $\pi^3$) to enhance 3D feature attention. Building upon FUSER and its estimates, we further introduce |
| 149 | 100 | core trend paper C. cluster representative | JRM: Joint Reconstruction Model for Multiple Objects without Alignment 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark | | mesh / surface; editable / generative 3D | foundation/prior; robustness; benchmark/data; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractObject-centric reconstruction seeks to recover the 3D structure of a scene through composition of independent objects. While this independence can simplify modeling, it discards strong signals that could improve reconstruction, notably repetition where the same object model is seen multiple times in a scene, or across scans. We propose the Joint Reconstruction Model (JRM) to leverage repetition by framing object reconstruction as one of personalized generation: multiple observations share a common subject that should be consistent for all observations, while still adhering to the specific pose and state from each. Prior methods in this direction rely on explicit matching and rigid alignment across observations, making them sensitive to errors and difficult to extend to non-rigid transformations. In contrast, JRM is a 3D flow-matching generative model that implicitly aggregates unaligned |
| 150 | 100 | core trend paper C. cluster representative | Long-Tail Internet Photo Reconstruction 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; surface_occupancy; data_benchmark | RGB-D / depth | depth / normals; mesh / surface | foundation/prior; scale; robustness; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractInternet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed, while most real-world sites contain only sparse, noisy, and uneven imagery that defeats classical and learned 3D methods. Existing 3D foundation models generalize well to curated datasets but collapse under the sparsity, ambiguity, and irregularity of Internet photos. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large-scale, clean, and depth-refined dataset, together with a sparse-aware sampling strategy that mimic camera distributions in long-tai |
| 151 | 100 | core trend paper C. cluster representative | ManifoldNeuS: Manifold-aware View Optimizability for Pose-Free Neural Surface Reconstruction 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; surface_occupancy; data_benchmark | multi-view images; multimodal / language | camera pose; mesh / surface | scale; robustness; dynamic; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractJointly optimizing camera poses and object geometry from unposed images is a challenging task in neural surface reconstruction. Existing methods often suffer from pose drift and geometric distortion, stemming from the easy-view bias --- uniform view optimization favors easy-to-optimize views with abundant texture and good overlap that dominate gradient updates, while hard-to-optimize counterparts with weak texture or limited overlap yet critical for geometric completeness are progressively marginalized. To address this, we propose ManifoldNeuS, a novel framework that explicitly models and leverages per-view optimizability to guide pose-free neural surface reconstruction. Specifically, we introduce the manifold-aware view optimizability score (MaVOS), which jointly assesses immediate fitness (the ease of optimizing each view) and long-term coverage gain (the value of optimizing each view) |
| 152 | 100 | core trend paper C. cluster representative | Native and Compact Structured Latents for 3D Generation 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; surface_occupancy; generation_editing; data_benchmark | multimodal / language | mesh / surface; occupancy / voxel; editable / generative 3D | efficiency; scale; robustness; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractRecent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models compris |
| 153 | 100 | core trend paper C. cluster representative | Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy | multi-view images; multimodal / language | mesh / surface | foundation/prior; robustness | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractThe 3D characterization of microstructures is crucial for understanding and designing functional materials. However, the scanning electron microscope (SEM), widely used in scientific research, captures only 2D electron intensity distributions. Existing SEM 3D reconstruction methods struggle with textureless regions, shadowing artifacts, and calibration dependencies, whereas advanced learning-based approaches fail to generalize to microscopic SEM domains due to the lack of physical priors and domain-specific data. To address these challenges, we introduce NFH-SEM, a neural field-based hybrid reconstruction framework that recovers high-fidelity 3D surfaces from multi-view, multi-detector SEM images. NFH-SEM integrates coarse multi-view geometry with photometric stereo cues from detector signals through a continuous neural field, incorporating a learnable forward model that embeds SEM imagi |
| 154 | 100 | core trend paper C. cluster representative | Pano360: Perspective to Panoramic Vision with Geometric Consistency 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark | multi-view images; panorama; multimodal / language | camera pose; mesh / surface | foundation/prior; scale; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractPrior panorama stitching approaches heavily rely on pairwise feature correspondences and are unable to leverage geometric consistency across multiple views. This leads to severe distortion and misalignment, especially in challenging scenes with weak textures, large parallax, and repetitive patterns.Given that multi-view geometric correspondences can be directly constructed in 3D space, making them more accurate and globally consistent, we extend the 2D alignment task to the 3D photogrammetric space. We adopt a novel transformer-based architecture to achieve 3D awareness and aggregate global information across all views. It directly utilizes camera poses to guide image warping for global alignment in 3D space and employs a multi-feature joint optimization strategy to compute the seams.Additionally, to establish an evaluation benchmark and train our network, we collected a large-scale data |
| 155 | 100 | core trend paper C. cluster representative | PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark | panorama; RGB-D / depth | camera pose; depth / normals; point map / point cloud; mesh / surface | foundation/prior; unified pipeline; scale; robustness; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractPanoramic imagery offers a full $360^\circ$ field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting.We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth a |
| 156 | 100 | core trend paper C. cluster representative | PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization 3D Vision & Geometry / Pose Estimation | general_reconstruction; pose_calibration_localization; depth_correspondence; data_benchmark | video / temporal; RGB-D / depth | camera pose; depth / normals | unified pipeline; efficiency; scale; robustness; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractWe present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. Conventional approaches rely on decoupled pipelines that fuse GNSS and Visual-Inertial Odometry (VIO) for ego-pose estimation, and active sensors like laser rangefinders for target localization. However, these methods are susceptible to failure in GNSS-denied environments and incur substantial hardware costs and complexity.PiLoT breaks this paradigm by directly registering live video stream against a geo-referenced 3D map. To achieve robust, accurate, and real-time performance, we introduce three key contributions: 1) a Dual-Thread Engine that decouples map rendering from core localization thread, ensuring both low latency while maintaining drift-free accuracy; 2) a large-scale synthetic dataset with precise geometric annotations (camera pose, depth maps). This dataset enables the training of a |
| 157 | 100 | core trend paper C. cluster representative | PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning 3D Vision & Geometry / Pose Estimation | general_reconstruction; pose_calibration_localization; depth_correspondence; data_benchmark | multi-view images | camera pose | foundation/prior; scale; robustness; benchmark/data | Read early; it likely changes the framing of the 3D reconstruction cluster.abstract6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and ge |
| 158 | 100 | core trend paper C. cluster representative | Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy | multi-view images | mesh / surface | | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractRecent learning-based face reconstruction and registration frameworks such as ToFu and TEMPEH have shown that dense correspondence between facial scans and a common topology can be learned directly from images. However, these approaches still depend on precomputed registrations obtained through iterative optimization pipelines that often require manual verification and correction by human annotators. We introduce MOCHI (Multi-view Optimizable Correspondence of Heads from Images), a fully differentiable and registration-free alternative. Instead of relying on optimization-based registrations, we employ a pseudo-linear inverse kinematic solver in conjunction with dense 2D keypoints produced by a tracker trained only on synthetic data to directly enforce a common face topology at the vertex level. We further find that the commonly used point-to-surface distance can lead to unstable training |
| 159 | 100 | core trend paper C. cluster representative | Rethinking 2D-3D Registration: A Novel Network for High-Value Zone Selection and Representation Consistency Alignment 3D Vision & Geometry / Pose Estimation | general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy | RGB-D / depth; multimodal / language | camera pose; point map / point cloud | | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractBoth detection-then-match and detection-free methods have been extensively studied for image-to-point cloud registration, yet they still face significant challenges. The detection-then-match approach emphasizes high-quality correspondences but is limited by the availability of repeatable keypoints, making it susceptible to errors from incorrect matches. In contrast, detection-free methods aim for dense correspondences using a coarse-to-fine strategy to mitigate matching errors. However, non-overlapping regions and low-quality matches still introduce inaccuracies, and the differences between image texture and point cloud structure cause inconsistent region representations, increasing the likelihood of incorrect matches.To address these challenges, we propose two innovative modules: the High-Value Zone Reinforced Selection Module (HZRS) and the Zone Representation Consistency Alignment Mod |
| 160 | 100 | core trend paper C. cluster representative | SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; surface_occupancy; generation_editing; data_benchmark | | camera pose; mesh / surface; editable / generative 3D | foundation/prior; unified pipeline; benchmark/data; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractWe propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor |
| 161 | 100 | core trend paper C. cluster representative | ShapeR: Robust Conditional 3D Shape Generation from Casual Captures 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; surface_occupancy; data_benchmark | multi-view images | mesh / surface; editable / generative 3D | robustness; benchmark/data; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractRecent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given a image sequence, we leverage off-the-shelf visual-inertial SLAM,3D detection algorithms and VLMs to extract for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strateg |
| 162 | 100 | core trend paper C. cluster representative | SpatialVID: A Large-Scale Video Dataset with Spatial Annotations 3D Vision & Geometry / Pose Estimation | pose_calibration_localization; depth_correspondence; dynamic_4d; generation_editing; data_benchmark | video / temporal; RGB-D / depth | camera pose; depth / normals; 4D scene; editable / generative 3D | foundation/prior; scale; robustness; dynamic; benchmark/data; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractSignificant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion.To this end, we collect **SpatialVID**, a dataset consisting of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions.Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subseq |
| 163 | 100 | core trend paper C. cluster representative | TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking 3D Vision & Geometry / 3D Gaussian Splatting | gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing | video / temporal | Gaussian map; mesh / surface; radiance field / NVS; 4D scene; editable / generative 3D | dynamic; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractTopology-consistent dynamic model sequences are essential for applications such as animation and model editing. However, existing 4D reconstruction methods face challenges in generating high-quality topology-consistent meshes. To address this, we propose a topology-aware dynamic reconstruction framework based on Gaussian Splatting. We introduce a Gaussian topological structure that explicitly encodes spatial connectivity. This structure enables topology-aware densification and pruning, preserving the manifold consistency of the Gaussian representation. Temporal regularization terms further ensure topological coherence over time, while differentiable mesh rasterization improves mesh quality. Experimental results demonstrate that our method reconstructs topology-consistent mesh sequences with significantly higher accuracy than existing approaches. Moreover, the resulting meshes enable prec |
| 164 | 100 | core trend paper C. cluster representative | TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; surface_occupancy; generation_editing | multimodal / language | point map / point cloud; mesh / surface; occupancy / voxel; editable / generative 3D | foundation/prior; unified pipeline; efficiency; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractThe dominant paradigm for high-fidelity 3D generation relies on a VAE-Diffusion pipeline, where the VAE's reconstruction capability sets a firm upper bound on generation quality. A fundamental challenge limiting existing VAEs is the \textit{representation mismatch} between ground-truth meshes and network predictions: GT meshes have arbitrary, variable topology, while VAEs typically predict fixed-structure implicit fields (\eg, SDF on regular grids). This inherent misalignment prevents establishing explicit mesh-level correspondences, forcing prior work to rely on indirect supervision signals such as SDF or rendering losses. Consequently, fine geometric details, particularly sharp features, are poorly preserved during reconstruction. To address this, we introduce TopoMesh, a sparse voxel-based VAE that unifies both GT and predicted meshes under a shared Dual Marching Cubes (DMC) topologic |
| 165 | 100 | core trend paper C. cluster representative | UniCorn: Unified Correspondence Transformer Across 2D and 3D 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy | RGB-D / depth | camera pose; depth / normals; point map / point cloud; mesh / surface | foundation/prior; unified pipeline | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractVisual correspondence across image-to-image (2D-2D), image-to-point cloud (2D-3D), and point cloud-to-point cloud (3D-3D) geometric matching forms the foundation for numerous 3D vision tasks. Despite sharing a similar problem structure, current methods use task-specific designs with separate models for each modality combination. We present UniCorn, the first correspondence model with shared weights that unifies geometric matching across all three tasks. Our key insight is that Transformer attention naturally captures cross-modal feature similarity. We propose a dual-stream decoder that maintains separate appearance and positional feature streams. This design enables end-to-end learning through stack-able layers while supporting flexible query-based correspondence estimation across heterogeneous modalities. Our architecture employs modality-specific backbones followed by shared encoder an |
| 166 | 100 | core trend paper C. cluster representative | UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; data_benchmark | RGB-D / depth | camera pose; depth / normals; point map / point cloud; mesh / surface | foundation/prior; unified pipeline; scale; robustness | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractWe present UniSH, a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction. A key challenge in this domain is the scarcity of large-scale, annotated real-world data, forcing a reliance on synthetic datasets. This reliance introduces a significant sim-to-real domain gap, leading to poor generalization, low-fidelity human geometry, and poor alignment on in-the-wild videos. To address this, we propose an innovative training paradigm that effectively leverages unlabeled in-the-wild data. Our framework bridges strong, disparate priors from scene reconstruction and HMR, and is trained with two core components: (1) a robust distillation strategy to refine human surface details by distilling high-frequency details from an expert depth model, and (2) a two-stage supervision scheme, which first learns coarse localization on synthetic data, then fine-tunes on real |
| 167 | 100 | core trend paper C. cluster representative | ViLearn: Accelerating Training Convergence of Image-to-3D Generation via Visibility Learning 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; depth_correspondence; surface_occupancy; generation_editing | RGB-D / depth; multimodal / language | depth / normals; mesh / surface; editable / generative 3D | efficiency; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractSingle-image-to-3D shape generation has seen remarkable progress, driven by latent diffusion models trained on the compressed latent space of 3D VAEs. However, the task remains intrinsically ill-posed: recovering complete 3D geometry—especially occluded surfaces—from a single view is inherently ambiguous. Existing VecSet-based approaches further exacerbate this challenge by treating shape tokens as an unordered set without explicit positional encoding. This design forces diffusion models to simultaneously learn visible correspondences from the input image and hallucinate invisible geometry within a large, permutation-invariant token space, where the lack of structure significantly hinders training efficiency and convergence stability.To address this, we propose \textit{Visibility Learning}, a training paradigm that injects visibility structure and positional inductive bias into the image |
| 168 | 100 | core trend paper C. cluster representative | Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion 3D Vision & Geometry / 3D Reconstruction | general_reconstruction; surface_occupancy; generation_editing; data_benchmark | multimodal / language | mesh / surface; editable / generative 3D | benchmark/data; editing/generation | Read early; it likely changes the framing of the 3D reconstruction cluster.abstractRealistic 3D city generation is fundamental to a wide range of applications, including virtual reality and digital twins. However, most existing methods rely on training a single diffusion model, which limits their ability to generate personalized and boundless city-scale scenes. In this paper, we present Yo'City, a novel agentic framework that enables user-customized and infinitely expandable 3D city generation by leveraging the reasoning and compositional capabilities of off-the-shelf large models. Specifically, Yo'City first conceptualize the city through a top-down planning strategy that defines a hierarchical “City–District–Grid” structure. The Global Planner determines the overall layout and potential functional districts, while the Local Designer further refines each district with detailed grid-level descriptions. Subsequently, the grid-level 3D generation is achieved through a pr |
| 169 | 100 | core trend paper D. adjacent but useful context | 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation Generative Models / Video Generation | general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; surface_occupancy; generation_editing; data_benchmark | single image; multi-view images; video / temporal; RGB-D / depth; multimodal / language | depth / normals; radiance field / NVS; editable / generative 3D | foundation/prior; scale; dynamic; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractExisting methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, v |
| 170 | 100 | core trend paper D. adjacent but useful context | 4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models Data & Evaluation / Benchmark | general_reconstruction; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark | video / temporal; LiDAR / driving; multimodal / language | mesh / surface; 4D scene; editable / generative 3D | foundation/prior; unified pipeline; dynamic; benchmark/data; editing/generation | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractWorld Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high-fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, Embodied Intelligence, and content creation.However, prior benchmarks, however, each emphasize different evaluation dimensions and lack a unified assessment of world-realism capability.To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition–4D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as |
| 171 | 100 | core trend paper D. adjacent but useful context | Breaking the 3D Dataset Bottleneck: Fast Scalable Generation of Aligned 3D Assets from Scratch for Category 6D Pose Estimation and Robotic Grasping Data & Evaluation / Benchmark | general_reconstruction; pose_calibration_localization; depth_correspondence; surface_occupancy; robotics_mapping; generation_editing; data_benchmark | RGB-D / depth; multimodal / language | camera pose; depth / normals; mesh / surface; editable / generative 3D | foundation/prior; efficiency; scale; benchmark/data; editing/generation | Use for the robotics/SLAM angle: reconstruction becomes a map or world model, not just a visual asset.abstractWhile 2D vision has been revolutionized by large-scale datasets like ImageNet, 3D vision remains constrained by the scarcity of high-quality, canonically aligned data. We introduce the first scalable, automated framework that generates complete category-level 6D pose datasets directly from text prompts, bypassing the need for existing 3D assets. Our method overcomes key challenges by: (1) ensuring reliable, scalable asset generation via a controlled text-to-image-to-3D pipeline; (2) enforcing built-in canonical alignment through depth-conditioned generation, achieving a 96\% pose consistency rate; and (3) enabling large-scale 6D annotation via mixed reality rendering. The pipeline produces high-quality, aligned 3D meshes in under 3 minutes per object—a 5–20$\times$ speedup over traditional scanning. We generate over 1,000 instances for each of the 153 categories in the Omni6Dpose benchma |
| 172 | 100 | core trend paper D. adjacent but useful context | Dexterous World Models Robotics & Embodied AI / Embodied AI | general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark | video / temporal | mesh / surface; radiance field / NVS; editable / generative 3D | dynamic; benchmark/data; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractRecent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static—limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), an scene-action-conditioned video diffusion model enabling embodied interaction within static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human–scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues in the egocentric view to model action-conditioned dynamics directly. We train our model on a synthetic h |
| 173 | 100 | core trend paper D. adjacent but useful context | DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images Autonomous Driving / Autonomous Driving | general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark | multi-view images; video / temporal; LiDAR / driving | camera pose; Gaussian map; mesh / surface; 4D scene | foundation/prior; unified pipeline; efficiency; scale; dynamic; benchmark/data | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractAutonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce Driving Gaussian Grounded Transformer (DGGT), a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves tem |
| 174 | 100 | core trend paper D. adjacent but useful context | Ego-1K – A Large-Scale Multiview Video Dataset for Egocentric Vision Data & Evaluation / Benchmark | general_reconstruction; gaussian_radiance; pose_calibration_localization; dynamic_4d; surface_occupancy; robotics_mapping; data_benchmark | multi-view images; video / temporal | mesh / surface; radiance field / NVS; 4D scene | scale; dynamic; benchmark/data | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractWe present Ego-1K, a large-scale, time-synchronized collection of egocentric multiview videos designed to advance neural 3D video synthesis, dynamic scene understanding, and embodied perception. The dataset contains nearly 1,000 short egocentric videos taken with a custom rig with 12 synchronous cameras surrounding a VR headset worn by the user. Scene content focuses on hand motions and hand-object interactions in different settings. We describe rig design, data processing, and calibration. Our dataset enables new ways to benchmark egocentric scene reconstruction methods. We believe this is an important area of research as smart glasses with multiple cameras become omnipresent. Our experiments demonstrate that our dataset presents unique challenges for existing 3D and 4D novel view synthesis methods due to high disparities and image motion caused by close dynamic objects and rig ego |
| 175 | 100 | core trend paper D. adjacent but useful context | FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain Autonomous Driving / Autonomous Driving | general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark | video / temporal; LiDAR / driving | Gaussian map; mesh / surface; editable / generative 3D | foundation/prior; unified pipeline; dynamic; benchmark/data; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractIn controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce **FaithFusion**, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications. Extensive experiments on |
| 176 | 100 | core trend paper D. adjacent but useful context | RAYNOVA: Geometry-Free Auto-Regressive 4D World Modeling with Unified Spatio-Temporal Representation Generative Models / Video Generation | general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark | multi-view images; video / temporal | mesh / surface; radiance field / NVS; 4D scene; editable / generative 3D | foundation/prior; unified pipeline; robustness; dynamic; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractWorld foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RayNova, a geometry-free world model that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RayNova constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Plücker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RayNova achieves state-of-the-art multi-vi |
| 177 | 100 | core trend paper D. adjacent but useful context | UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching Robotics & Embodied AI / Embodied AI | general_reconstruction; gaussian_radiance; depth_correspondence; dynamic_4d; robotics_mapping; generation_editing; data_benchmark | video / temporal | Gaussian map; editable / generative 3D | foundation/prior; unified pipeline; efficiency; dynamic; benchmark/data; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractRecent progress in 3D reconstruction, such as NeRFs and 3D Gaussian Splatting, has made it easy to recover geometry and appearance from images. However, these static representations remain blind to the physics that govern how objects deform and respond to forces. Building interactive 3D worlds therefore requires predicting not only shape but the underlying material properties. Prior approaches either rely on slow test-time optimization or, more recently, a fast feed-forward predictor such as Pixie. However, these models produce only a single point estimate of physical parameters and are limited to a single simulation backend, restricting both expressiveness and portability. We introduce UniPixie, a generative physics-from-pixels framework that overcomes both limitations. UniPixie predicts a controllable, continuous soft-to-stiff distribution of plausible material properties from a single |
| 178 | 100 | core trend paper D. adjacent but useful context | Video2Robo: 3DGS-based Synthetic Data from One Video Enables Scalable Robot Learning Robotics & Embodied AI / Embodied AI | general_reconstruction; gaussian_radiance; dynamic_4d; surface_occupancy; robotics_mapping; generation_editing; data_benchmark | video / temporal; multimodal / language | Gaussian map; mesh / surface; radiance field / NVS; editable / generative 3D | foundation/prior; scale; dynamic; benchmark/data; editing/generation | Use as evidence that Gaussian/radiance representations are moving from static NVS toward dynamic scene models.abstractScalable robot learning is hindered by the high cost of acquiring diverse, high-quality embodied data. Existing data generation approaches partially mitigate this issue but typically depend on hard-to-access hardware and labor-intensive manual effort, with limited generalization to diverse scene configurations. To overcome these limitations, we propose Video2Robo, a framework that generates high-quality and diverse robot data directly from a single human demonstration video, enabling seamless deployment on physical robots. At its core, Video2Robo leverages 3D Gaussian Splatting (3DGS) as a powerful scene representation, enabling high-fidelity rendering and explicit 3D scene editing. The framework tracks temporally consistent motion trajectories of task-relevant objects from raw video footage and identifies key task skills, guiding robots to execute tasks kinematically plausibly under nov |