Computer Vision and Pattern Recognition 151
★ Whole-Body Conditioned Egocentric Video Prediction
We train models to Predict Ego-centric Video from human Actions (PEVA), given
the past video and an action represented by the relative 3D body pose. By
conditioning on kinematic pose trajectories, structured by the joint hierarchy
of the body, our model learns to simulate how physical human actions shape the
environment from a first-person point of view. We train an auto-regressive
conditional diffusion transformer on Nymeria, a large-scale dataset of
real-world egocentric video and body pose capture. We further design a
hierarchical evaluation protocol with increasingly challenging tasks, enabling
a comprehensive analysis of the model's embodied prediction and control
abilities. Our work represents an initial attempt to tackle the challenges of
modeling complex real-world environments and embodied agent behaviors with
video prediction from the perspective of a human.
comment: Project Page: https://dannytran123.github.io/PEVA
☆ SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark
Alex Costanzino, Pierluigi Zama Ramirez, Luigi Lella, Matteo Ragaglia, Alessandro Oliva, Giuseppe Lisanti, Luigi Di Stefano
We propose SiM3D, the first benchmark considering the integration of
multiview and multimodal information for comprehensive 3D anomaly detection and
segmentation (ADS), where the task is to produce a voxel-based Anomaly Volume.
Moreover, SiM3D focuses on a scenario of high interest in manufacturing:
single-instance anomaly detection, where only one object, either real or
synthetic, is available for training. In this respect, SiM3D stands out as the
first ADS benchmark that addresses the challenge of generalising from synthetic
training data to real test data. SiM3D includes a novel multimodal multiview
dataset acquired using top-tier industrial sensors and robots. The dataset
features multiview high-resolution images (12 Mpx) and point clouds (7M points)
for 333 instances of eight types of objects, alongside a CAD model for each
type. We also provide manually annotated 3D segmentation GTs for anomalous test
samples. To establish reference baselines for the proposed multiview 3D ADS
task, we adapt prominent singleview methods and assess their performance using
novel metrics that operate on Anomaly Volumes.
☆ SAM4D: Segment Anything in Camera and LiDAR Streams ICCV2025
We present SAM4D, a multi-modal and temporal foundation model designed for
promptable segmentation across camera and LiDAR streams. Unified Multi-modal
Positional Encoding (UMPE) is introduced to align camera and LiDAR features in
a shared 3D space, enabling seamless cross-modal prompting and interaction.
Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA),
which leverages ego-motion compensation to enhance temporal consistency and
long-horizon feature retrieval, ensuring robust segmentation across dynamically
changing autonomous driving scenes. To avoid annotation bottlenecks, we develop
a multi-modal automated data engine that synergizes VFM-driven video masklets,
spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This
framework generates camera-LiDAR aligned pseudo-labels at a speed orders of
magnitude faster than human annotation while preserving VFM-derived semantic
fidelity in point cloud representations. We conduct extensive experiments on
the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal
segmentation ability and great potential in data annotation of proposed SAM4D.
comment: Accepted by ICCV2025, Project Page: https://SAM4D-Project.github.io
☆ HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation
Recent progress in vision-language segmentation has significantly advanced
grounded visual understanding. However, these models often exhibit
hallucinations by producing segmentation masks for objects not grounded in the
image content or by incorrectly labeling irrelevant regions. Existing
evaluation protocols for segmentation hallucination primarily focus on label or
textual hallucinations without manipulating the visual context, limiting their
capacity to diagnose critical failures. In response, we introduce
HalluSegBench, the first benchmark specifically designed to evaluate
hallucinations in visual grounding through the lens of counterfactual visual
reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual
instance pairs spanning 281 unique object classes, and a set of newly
introduced metrics that quantify hallucination sensitivity under visually
coherent scene edits. Experiments on HalluSegBench with state-of-the-art
vision-language segmentation models reveal that vision-driven hallucinations
are significantly more prevalent than label-driven ones, with models often
persisting in false segmentation, highlighting the need for counterfactual
reasoning to diagnose grounding fidelity.
comment: Project webpage: https://plan-lab.github.io/hallusegbench/
☆ DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion
Reconstructing 3D objects from a single image is a long-standing challenge,
especially under real-world occlusions. While recent diffusion-based view
synthesis models can generate consistent novel views from a single RGB image,
they generally assume fully visible inputs and fail when parts of the object
are occluded. This leads to inconsistent views and degraded 3D reconstruction
quality. To overcome this limitation, we propose an end-to-end framework for
occlusion-aware multi-view generation. Our method directly synthesizes six
structurally consistent novel views from a single partially occluded image,
enabling downstream 3D reconstruction without requiring prior inpainting or
manual annotations. We construct a self-supervised training pipeline using the
Pix2Gestalt dataset, leveraging occluded-unoccluded image pairs and
pseudo-ground-truth views to teach the model structure-aware completion and
view consistency. Without modifying the original architecture, we fully
fine-tune the view synthesis model to jointly learn completion and multi-view
generation. Additionally, we introduce the first benchmark for occlusion-aware
reconstruction, encompassing diverse occlusion levels, object categories, and
mask patterns. This benchmark provides a standardized protocol for evaluating
future methods under partial occlusions. Our code is available at
https://github.com/Quyans/DeOcc123.
☆ StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning ICCV 2025
Recently, Mamba-based methods have demonstrated impressive performance in
point cloud representation learning by leveraging State Space Model (SSM) with
the efficient context modeling ability and linear complexity. However, these
methods still face two key issues that limit the potential of SSM: Destroying
the adjacency of 3D points during SSM processing and failing to retain
long-sequence memory as the input length increases in downstream tasks. To
address these issues, we propose StruMamba3D, a novel paradigm for
self-supervised point cloud representation learning. It enjoys several merits.
First, we design spatial states and use them as proxies to preserve spatial
dependencies among points. Second, we enhance the SSM with a state-wise update
strategy and incorporate a lightweight convolution to facilitate interactions
between spatial states for efficient structure modeling. Third, our method
reduces the sensitivity of pre-trained Mamba-based models to varying input
lengths by introducing a sequence length-adaptive strategy. Experimental
results across four downstream tasks showcase the superior performance of our
method. In addition, our method attains the SOTA 95.1% accuracy on ModelNet40
and 92.75% accuracy on the most challenging split of ScanObjectNN without
voting strategy.
comment: Accepted by ICCV 2025
☆ Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval ACL 2025
Cross-modal image-text retrieval is challenging because of the diverse
possible associations between content from different modalities. Traditional
methods learn a single-vector embedding to represent semantics of each sample,
but struggle to capture nuanced and diverse relationships that can exist across
modalities. Set-based approaches, which represent each sample with multiple
embeddings, offer a promising alternative, as they can capture richer and more
diverse relationships. In this paper, we show that, despite their promise,
these set-based representations continue to face issues including sparse
supervision and set collapse, which limits their effectiveness. To address
these challenges, we propose Maximal Pair Assignment Similarity to optimize
one-to-one matching between embedding sets which preserve semantic diversity
within the set. We also introduce two loss functions to further enhance the
representations: Global Discriminative Loss to enhance distinction among
embeddings, and Intra-Set Divergence Loss to prevent collapse within each set.
Our method achieves state-of-the-art performance on MS-COCO and Flickr30k
without relying on external data.
comment: Accepted at the 63rd Annual Meeting of the Association for
Computational Linguistics (ACL 2025 Main)
☆ ResQ: A Novel Framework to Implement Residual Neural Networks on Analog Rydberg Atom Quantum Computers ICCV
Research in quantum machine learning has recently proliferated due to the
potential of quantum computing to accelerate machine learning. An area of
machine learning that has not yet been explored is neural ordinary differential
equation (neural ODE) based residual neural networks (ResNets), which aim to
improve the effectiveness of neural networks using the principles of ordinary
differential equations. In this work, we present our insights about why analog
Rydberg atom quantum computers are especially well-suited for ResNets. We also
introduce ResQ, a novel framework to optimize the dynamics of Rydberg atom
quantum computers to solve classification problems in machine learning using
analog quantum neural ODEs.
comment: ResQ will appear in the Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 2025
☆ Exploring the Design Space of 3D MLLMs for CT Report Generation
Multimodal Large Language Models (MLLMs) have emerged as a promising way to
automate Radiology Report Generation (RRG). In this work, we systematically
investigate the design space of 3D MLLMs, including visual input
representation, projectors, Large Language Models (LLMs), and fine-tuning
techniques for 3D CT report generation. We also introduce two knowledge-based
report augmentation methods that improve performance on the GREEN score by up
to 10\%, achieving the 2nd place on the MICCAI 2024 AMOS-MM challenge. Our
results on the 1,687 cases from the AMOS-MM dataset show that RRG is largely
independent of the size of LLM under the same training protocol. We also show
that larger volume size does not always improve performance if the original ViT
was pre-trained on a smaller volume size. Lastly, we show that using a
segmentation mask along with the CT volume improves performance. The code is
publicly available at https://github.com/bowang-lab/AMOS-MM-Solution
☆ WAFT: Warping-Alone Field Transforms for Optical Flow
We introduce Warping-Alone Field Transforms (WAFT), a simple and effective
method for optical flow. WAFT is similar to RAFT but replaces cost volume with
high-resolution warping, achieving better accuracy with lower memory cost. This
design challenges the conventional wisdom that constructing cost volumes is
necessary for strong performance. WAFT is a simple and flexible
meta-architecture with minimal inductive biases and reliance on custom designs.
Compared with existing methods, WAFT ranks 1st on Spring and KITTI benchmarks,
achieves the best zero-shot generalization on KITTI, while being up to 4.1x
faster than methods with similar performance. Code and model weights are
available at https://github.com/princeton-vl/WAFT.
☆ MADrive: Memory-Augmented Driving Scene Modeling
Polina Karpikova, Daniil Selikhanovych, Kirill Struminsky, Ruslan Musaev, Maria Golitsyna, Dmitry Baranchuk
Recent advances in scene reconstruction have pushed toward highly realistic
modeling of autonomous driving (AD) environments using 3D Gaussian splatting.
However, the resulting reconstructions remain closely tied to the original
observations and struggle to support photorealistic synthesis of significantly
altered or novel driving scenarios. This work introduces MADrive, a
memory-augmented reconstruction framework designed to extend the capabilities
of existing scene reconstruction methods by replacing observed vehicles with
visually similar 3D assets retrieved from a large-scale external memory bank.
Specifically, we release MAD-Cars, a curated dataset of ${\sim}70$K 360{\deg}
car videos captured in the wild and present a retrieval module that finds the
most similar car instances in the memory bank, reconstructs the corresponding
3D assets from video, and integrates them into the target scene through
orientation alignment and relighting. The resulting replacements provide
complete multi-view representations of vehicles in the scene, enabling
photorealistic synthesis of substantially altered configurations, as
demonstrated in our experiments. Project page:
https://yandex-research.github.io/madrive/
☆ G$^{2}$D: Boosting Multimodal Learning with Gradient-Guided Distillation ICCV 2025
Multimodal learning aims to leverage information from diverse data modalities
to achieve more comprehensive performance. However, conventional multimodal
models often suffer from modality imbalance, where one or a few modalities
dominate model optimization, leading to suboptimal feature representation and
underutilization of weak modalities. To address this challenge, we introduce
Gradient-Guided Distillation (G$^{2}$D), a knowledge distillation framework
that optimizes the multimodal model with a custom-built loss function that
fuses both unimodal and multimodal objectives. G$^{2}$D further incorporates a
dynamic sequential modality prioritization (SMP) technique in the learning
process to ensure each modality leads the learning process, avoiding the
pitfall of stronger modalities overshadowing weaker ones. We validate G$^{2}$D
on multiple real-world datasets and show that G$^{2}$D amplifies the
significance of weak modalities while training and outperforms state-of-the-art
methods in classification and regression tasks. Our code is available at
https://github.com/rAIson-Lab/G2D.
comment: Accepted at ICCV 2025
☆ GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation ICCV 2025
Wentao Hu, Shunkai Li, Ziqiao Peng, Haoxian Zhang, Fan Shi, Xiaoqiang Liu, Pengfei Wan, Di Zhang, Hui Tian
Creating high-quality, generalizable speech-driven 3D talking heads remains a
persistent challenge. Previous methods achieve satisfactory results for fixed
viewpoints and small-scale audio variations, but they struggle with large head
rotations and out-of-distribution (OOD) audio. Moreover, they are constrained
by the need for time-consuming, identity-specific training. We believe the core
issue lies in the lack of sufficient 3D priors, which limits the extrapolation
capabilities of synthesized talking heads. To address this, we propose
GGTalker, which synthesizes talking heads through a combination of
generalizable priors and identity-specific adaptation. We introduce a two-stage
Prior-Adaptation training strategy to learn Gaussian head priors and adapt to
individual characteristics. We train Audio-Expression and Expression-Visual
priors to capture the universal patterns of lip movements and the general
distribution of head textures. During the Customized Adaptation, individual
speaking styles and texture details are precisely modeled. Additionally, we
introduce a color MLP to generate fine-grained, motion-aligned textures and a
Body Inpainter to blend rendered results with the background, producing
indistinguishable, photorealistic video frames. Comprehensive experiments show
that GGTalker achieves state-of-the-art performance in rendering quality, 3D
consistency, lip-sync accuracy, and training efficiency.
comment: ICCV 2025, Project page: https://vincenthu19.github.io/GGTalker/
☆ Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration
Jiahe Chen, Jiaying He, Qian Shao, Qiyuan Chen, Jiahe Ying, Hongxia Xu, Jintai Chen, Jianwei Zheng, Jian Wu
Large Vision-Language Models (LVLMs) have demonstrated significant
advancements in multimodal understanding, yet they are frequently hampered by
hallucination-the generation of text that contradicts visual input. Existing
training-free decoding strategies exhibit critical limitations, including the
use of static constraints that do not adapt to semantic drift during
generation, inefficiency stemming from the need for multiple forward passes,
and degradation of detail due to overly rigid intervention rules. To overcome
these challenges, this paper introduces Dynamic Logits Calibration (DLC), a
novel training-free decoding framework designed to dynamically align text
generation with visual evidence at inference time. At the decoding phase, DLC
step-wise employs CLIP to assess the semantic alignment between the input image
and the generated text sequence. Then, the Relative Visual Advantage (RVA) of
candidate tokens is evaluated against a dynamically updated contextual
baseline, adaptively adjusting output logits to favor tokens that are visually
grounded. Furthermore, an adaptive weighting mechanism, informed by a real-time
context alignment score, carefully balances the visual guidance while ensuring
the overall quality of the textual output. Extensive experiments conducted
across diverse benchmarks and various LVLM architectures (such as LLaVA,
InstructBLIP, and MiniGPT-4) demonstrate that DLC significantly reduces
hallucinations, outperforming current methods while maintaining high inference
efficiency by avoiding multiple forward passes. Overall, we present an
effective and efficient decoding-time solution to mitigate hallucinations,
thereby enhancing the reliability of LVLMs for more practices. Code will be
released on Github.
☆ Lightweight Physics-Informed Zero-Shot Ultrasound Plane Wave Denoising
Ultrasound Coherent Plane Wave Compounding (CPWC) enhances image contrast by
combining echoes from multiple steered transmissions. While increasing the
number of angles generally improves image quality, it drastically reduces the
frame rate and can introduce blurring artifacts in fast-moving targets.
Moreover, compounded images remain susceptible to noise, particularly when
acquired with a limited number of transmissions. We propose a zero-shot
denoising framework tailored for low-angle CPWC acquisitions, which enhances
contrast without relying on a separate training dataset. The method divides the
available transmission angles into two disjoint subsets, each used to form
compound images that include higher noise levels. The new compounded images are
then used to train a deep model via a self-supervised residual learning scheme,
enabling it to suppress incoherent noise while preserving anatomical
structures. Because angle-dependent artifacts vary between the subsets while
the underlying tissue response is similar, this physics-informed pairing allows
the network to learn to disentangle the inconsistent artifacts from the
consistent tissue signal. Unlike supervised methods, our model requires no
domain-specific fine-tuning or paired data, making it adaptable across
anatomical regions and acquisition setups. The entire pipeline supports
efficient training with low computational cost due to the use of a lightweight
architecture, which comprises only two convolutional layers. Evaluations on
simulation, phantom, and in vivo data demonstrate superior contrast enhancement
and structure preservation compared to both classical and deep learning-based
denoising methods.
☆ Towards Reliable Detection of Empty Space: Conditional Marked Point Processes for Object Detection
Deep neural networks have set the state-of-the-art in computer vision tasks
such as bounding box detection and semantic segmentation. Object detectors and
segmentation models assign confidence scores to predictions, reflecting the
model's uncertainty in object detection or pixel-wise classification. However,
these confidence estimates are often miscalibrated, as their architectures and
loss functions are tailored to task performance rather than probabilistic
foundation. Even with well calibrated predictions, object detectors fail to
quantify uncertainty outside detected bounding boxes, i.e., the model does not
make a probability assessment of whether an area without detected objects is
truly free of obstacles. This poses a safety risk in applications such as
automated driving, where uncertainty in empty areas remains unexplored. In this
work, we propose an object detection model grounded in spatial statistics.
Bounding box data matches realizations of a marked point process, commonly used
to describe the probabilistic occurrence of spatial point events identified as
bounding box centers, where marks are used to describe the spatial extension of
bounding boxes and classes. Our statistical framework enables a
likelihood-based training and provides well-defined confidence estimates for
whether a region is drivable, i.e., free of objects. We demonstrate the
effectiveness of our method through calibration assessments and evaluation of
performance.
comment: 15 pages, 4 figures, 3 tables
☆ TITAN: Query-Token based Domain Adaptive Adversarial Learning ICCV 2025
We focus on the source-free domain adaptive object detection (SF-DAOD)
problem when source data is unavailable during adaptation and the model must
adapt to an unlabeled target domain. The majority of approaches for the problem
employ a self-supervised approach using a student-teacher (ST) framework where
pseudo-labels are generated via a source-pretrained model for further
fine-tuning. We observe that the performance of a student model often degrades
drastically, due to the collapse of the teacher model, primarily caused by high
noise in pseudo-labels, resulting from domain bias, discrepancies, and a
significant domain shift across domains. To obtain reliable pseudo-labels, we
propose a Target-based Iterative Query-Token Adversarial Network (TITAN), which
separates the target images into two subsets: those similar to the source
(easy) and those dissimilar (hard). We propose a strategy to estimate variance
to partition the target domain. This approach leverages the insight that higher
detection variances correspond to higher recall and greater similarity to the
source domain. Also, we incorporate query-token-based adversarial modules into
a student-teacher baseline framework to reduce the domain gaps between two
feature representations. Experiments conducted on four natural imaging datasets
and two challenging medical datasets have substantiated the superior
performance of TITAN compared to existing state-of-the-art (SOTA)
methodologies. We report an mAP improvement of +22.7, +22.2, +21.1, and +3.7
percent over the current SOTA on C2F, C2B, S2C, and K2C benchmarks,
respectively.
comment: ICCV 2025
☆ Global and Local Entailment Learning for Natural World Imagery ICCV 2025
Learning the hierarchical structure of data in vision-language models is a
significant challenge. Previous works have attempted to address this challenge
by employing entailment learning. However, these approaches fail to model the
transitive nature of entailment explicitly, which establishes the relationship
between order and semantics within a representation space. In this work, we
introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the
explicit modeling of transitivity-enforced entailment. Our proposed framework
optimizes for the partial order of concepts within vision-language models. By
leveraging our framework, we develop a hierarchical vision-language foundation
model capable of representing the hierarchy in the Tree of Life. Our
experiments on hierarchical species classification and hierarchical retrieval
tasks demonstrate the enhanced performance of our models compared to the
existing state-of-the-art models. Our code and models are open-sourced at
https://vishu26.github.io/RCME/index.html.
comment: Accepted at ICCV 2025
☆ Logios : An open source Greek Polytonic Optical Character Recognition system
In this paper, we present an Optical Character Recognition (OCR) system
specifically designed for the accurate recognition and digitization of Greek
polytonic texts. By leveraging the combined strengths of convolutional layers
for feature extraction and recurrent layers for sequence learning, our system
addresses the unique challenges posed by Greek polytonic scripts. This approach
aims to overcome the limitations of traditional OCR methods, offering
significant improvements in accuracy and efficiency. We release the underlying
model as an open-source library and make our OCR platform available for
academic use.
☆ Evaluation of Traffic Signals for Daily Traffic Pattern
The turning movement count data is crucial for traffic signal design,
intersection geometry planning, traffic flow, and congestion analysis. This
work proposes three methods called dynamic, static, and hybrid configuration
for TMC-based traffic signals. A vision-based tracking system is developed to
estimate the TMC of six intersections in Las Vegas using traffic cameras. The
intersection design, route (e.g. vehicle movement directions), and signal
configuration files with compatible formats are synthesized and imported into
Simulation of Urban MObility for signal evaluation with realistic data. The
initial experimental results based on estimated waiting times indicate that the
cycle time of 90 and 120 seconds works best for all intersections. In addition,
four intersections show better performance for dynamic signal timing
configuration, and the other two with lower performance have a lower ratio of
total vehicle count to total lanes of the intersection leg. Since daily traffic
flow often exhibits a bimodal pattern, we propose a hybrid signal method that
switches between dynamic and static methods, adapting to peak and off-peak
traffic conditions for improved flow management. So, a built-in traffic
generator module creates vehicle routes for 4 hours, including peak hours, and
a signal design module produces signal schedule cycles according to static,
dynamic, and hybrid methods. Vehicle count distributions are weighted
differently for each zone (i.e., West, North, East, South) to generate diverse
traffic patterns. The extended experimental results for 6 intersections with 4
hours of simulation time imply that zone-based traffic pattern distributions
affect signal design selection. Although the static method works great for
evenly zone-based traffic distribution, the hybrid method works well for highly
weighted traffic at intersection pairs of the West-East and North-South zones.
☆ Spatial Mental Modeling from Limited Views
Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, Li Fei-Fei
Can Vision Language Models (VLMs) imagine the full scene from just a few
views, like humans do? Humans form spatial mental models, internal
representations of unseen space, to reason about layout, perspective, and
motion. Our new MindCube benchmark with 21,154 questions across 3,268 images
exposes this critical gap, where existing VLMs exhibit near-random performance.
Using MindCube, we systematically evaluate how well VLMs build robust spatial
mental models through representing positions (cognitive mapping), orientations
(perspective-taking), and dynamics (mental simulation for "what-if" movements).
We then explore three approaches to help VLMs approximate spatial mental
models, including unseen intermediate views, natural language reasoning chains,
and cognitive maps. The significant improvement comes from a synergistic
approach, "map-then-reason", that jointly trains the model to first generate a
cognitive map and then reason upon it. By training models to reason over these
internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding
reinforcement learning pushed performance even further to 70.7% (+32.9%). Our
key insight is that such scaffolding of spatial mental models, actively
constructing and utilizing internal structured spatial representations with
flexible reasoning processes, significantly improves understanding of
unobservable space.
comment: Preprint version
☆ Rethinking Oversaturation in Classifier-Free Guidance via Low Frequency
Classifier-free guidance (CFG) succeeds in condition diffusion models that
use a guidance scale to balance the influence of conditional and unconditional
terms. A high guidance scale is used to enhance the performance of the
conditional term. However, the high guidance scale often results in
oversaturation and unrealistic artifacts. In this paper, we introduce a new
perspective based on low-frequency signals, identifying the accumulation of
redundant information in these signals as the key factor behind oversaturation
and unrealistic artifacts. Building on this insight, we propose low-frequency
improved classifier-free guidance (LF-CFG) to mitigate these issues.
Specifically, we introduce an adaptive threshold-based measurement to pinpoint
the locations of redundant information. We determine a reasonable threshold by
analyzing the change rate of low-frequency information between prior and
current steps. We then apply a down-weight strategy to reduce the impact of
redundant information in the low-frequency signals. Experimental results
demonstrate that LF-CFG effectively alleviates oversaturation and unrealistic
artifacts across various diffusion models, including Stable Diffusion-XL,
Stable Diffusion 2.1, 3.0, 3.5, and SiT-XL.
☆ A Comprehensive Dataset for Underground Miner Detection in Diverse Scenario
Underground mining operations face significant safety challenges that make
emergency response capabilities crucial. While robots have shown promise in
assisting with search and rescue operations, their effectiveness depends on
reliable miner detection capabilities. Deep learning algorithms offer potential
solutions for automated miner detection, but require comprehensive training
datasets, which are currently lacking for underground mining environments. This
paper presents a novel thermal imaging dataset specifically designed to enable
the development and validation of miner detection systems for potential
emergency applications. We systematically captured thermal imagery of various
mining activities and scenarios to create a robust foundation for detection
algorithms. To establish baseline performance metrics, we evaluated several
state-of-the-art object detection algorithms including YOLOv8, YOLOv10, YOLO11,
and RT-DETR on our dataset. While not exhaustive of all possible emergency
situations, this dataset serves as a crucial first step toward developing
reliable thermal-based miner detection systems that could eventually be
deployed in real emergency scenarios. This work demonstrates the feasibility of
using thermal imaging for miner detection and establishes a foundation for
future research in this critical safety application.
☆ ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing
While end-to-end video-to-audio generation has greatly improved, producing
high-fidelity audio that authentically captures the nuances of visual content
remains challenging. Like professionals in the creative industries, such
generation requires sophisticated reasoning about items such as visual
dynamics, acoustic environments, and temporal relationships. We present
\textbf{ThinkSound}, a novel framework that leverages Chain-of-Thought (CoT)
reasoning to enable stepwise, interactive audio generation and editing for
videos. Our approach decomposes the process into three complementary stages:
foundational foley generation that creates semantically coherent soundscapes,
interactive object-centric refinement through precise user interactions, and
targeted editing guided by natural language instructions. At each stage, a
multimodal large language model generates contextually aligned CoT reasoning
that guides a unified audio foundation model. Furthermore, we introduce
\textbf{AudioCoT}, a comprehensive dataset with structured reasoning
annotations that establishes connections between visual content, textual
descriptions, and sound synthesis. Experiments demonstrate that ThinkSound
achieves state-of-the-art performance in video-to-audio generation across both
audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio
benchmark. The demo page is available at https://ThinkSound-Demo.github.io.
☆ Controllable 3D Placement of Objects with Scene-Aware Diffusion Models
Image editing approaches have become more powerful and flexible with the
advent of powerful text-conditioned generative models. However, placing objects
in an environment with a precise location and orientation still remains a
challenge, as this typically requires carefully crafted inpainting masks or
prompts. In this work, we show that a carefully designed visual map, combined
with coarse object masks, is sufficient for high quality object placement. We
design a conditioning signal that resolves ambiguities, while being flexible
enough to allow for changing of shapes or object orientations. By building on
an inpainting model, we leave the background intact by design, in contrast to
methods that model objects and background jointly. We demonstrate the
effectiveness of our method in the automotive setting, where we compare
different conditioning signals in novel object placement tasks. These tasks are
designed to measure edit quality not only in terms of appearance, but also in
terms of pose and location accuracy, including cases that require non-trivial
shape changes. Lastly, we show that fine location control can be combined with
appearance control to place existing objects in precise locations in a scene.
☆ Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation
Sweta Banerjee, Viktoria Weiss, Taryn A. Donovan, Rutger A. Fick, Thomas Conrad, Jonas Ammeling, Nils Porsche, Robert Klopfleisch, Christopher Kaltenecker, Katharina Breininger, Marc Aubreville, Christof A. Bertram
Atypical mitoses mark a deviation in the cell division process that can be an
independent prognostically relevant marker for tumor malignancy. However, their
identification remains challenging due to low prevalence, at times subtle
morphological differences from normal mitoses, low inter-rater agreement among
pathologists, and class imbalance in datasets. Building on the Atypical Mitosis
dataset for Breast Cancer (AMi-Br), this study presents a comprehensive
benchmark comparing deep learning approaches for automated atypical mitotic
figure (AMF) classification, including baseline models, foundation models with
linear probing, and foundation models fine-tuned with low-rank adaptation
(LoRA). For rigorous evaluation, we further introduce two new hold-out AMF
datasets - AtNorM-Br, a dataset of mitoses from the The TCGA breast cancer
cohort, and AtNorM-MD, a multi-domain dataset of mitoses from the MIDOG++
training set. We found average balanced accuracy values of up to 0.8135,
0.7696, and 0.7705 on the in-domain AMi-Br and the out-of-domain AtNorm-Br and
AtNorM-MD datasets, respectively, with the results being particularly good for
LoRA-based adaptation of the Virchow-line of foundation models. Our work shows
that atypical mitosis classification, while being a challenging problem, can be
effectively addressed through the use of recent advances in transfer learning
and model fine-tuning techniques. We make available all code and data used in
this paper in this github repository:
https://github.com/DeepMicroscopy/AMi-Br_Benchmark.
☆ HyperSORT: Self-Organising Robust Training with hyper-networks MICCAI 2025
Medical imaging datasets often contain heterogeneous biases ranging from
erroneous labels to inconsistent labeling styles. Such biases can negatively
impact deep segmentation networks performance. Yet, the identification and
characterization of such biases is a particularly tedious and challenging task.
In this paper, we introduce HyperSORT, a framework using a hyper-network
predicting UNets' parameters from latent vectors representing both the image
and annotation variability. The hyper-network parameters and the latent vector
collection corresponding to each data sample from the training set are jointly
learned. Hence, instead of optimizing a single neural network to fit a dataset,
HyperSORT learns a complex distribution of UNet parameters where low density
areas can capture noise-specific patterns while larger modes robustly segment
organs in differentiated but meaningful manners. We validate our method on two
3D abdominal CT public datasets: first a synthetically perturbed version of the
AMOS dataset, and TotalSegmentator, a large scale dataset containing real
unknown biases and errors. Our experiments show that HyperSORT creates a
structured mapping of the dataset allowing the identification of relevant
systematic biases and erroneous samples. Latent space clusters yield UNet
parameters performing the segmentation task in accordance with the underlying
learned systematic bias. The code and our analysis of the TotalSegmentator
dataset are made available: https://github.com/ImFusionGmbH/HyperSORT
comment: Accepted at MICCAI 2025
☆ EndoFlow-SLAM: Real-Time Endoscopic SLAM with Flow-Constrained Gaussian Splatting
Efficient three-dimensional reconstruction and real-time visualization are
critical in surgical scenarios such as endoscopy. In recent years, 3D Gaussian
Splatting (3DGS) has demonstrated remarkable performance in efficient 3D
reconstruction and rendering. Most 3DGS-based Simultaneous Localization and
Mapping (SLAM) methods only rely on the appearance constraints for optimizing
both 3DGS and camera poses. However, in endoscopic scenarios, the challenges
include photometric inconsistencies caused by non-Lambertian surfaces and
dynamic motion from breathing affects the performance of SLAM systems. To
address these issues, we additionally introduce optical flow loss as a
geometric constraint, which effectively constrains both the 3D structure of the
scene and the camera motion. Furthermore, we propose a depth regularisation
strategy to mitigate the problem of photometric inconsistencies and ensure the
validity of 3DGS depth rendering in endoscopic scenes. In addition, to improve
scene representation in the SLAM system, we improve the 3DGS refinement
strategy by focusing on viewpoints corresponding to Keyframes with suboptimal
rendering quality frames, achieving better rendering results. Extensive
experiments on the C3VD static dataset and the StereoMIS dynamic dataset
demonstrate that our method outperforms existing state-of-the-art methods in
novel view synthesis and pose estimation, exhibiting high performance in both
static and dynamic surgical scenes. The source code will be publicly available
upon paper acceptance.
☆ XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation
Achieving fine-grained control over subject identity and semantic attributes
(pose, style, lighting) in text-to-image generation, particularly for multiple
subjects, often undermines the editability and coherence of Diffusion
Transformers (DiTs). Many approaches introduce artifacts or suffer from
attribute entanglement. To overcome these challenges, we propose a novel
multi-subject controlled generation model XVerse. By transforming reference
images into offsets for token-specific text-stream modulation, XVerse allows
for precise and independent control for specific subject without disrupting
image latents or features. Consequently, XVerse offers high-fidelity, editable
multi-subject image synthesis with robust control over individual subject
characteristics and semantic attributes. This advancement significantly
improves personalized and complex scene generation capabilities.
comment: Project Page: https://bytedance.github.io/XVerse Github Link:
https://github.com/bytedance/XVerse
☆ Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction ICCV 2025
This paper presents an end-to-end framework for reconstructing 3D parametric
curves directly from multi-view edge maps. Contrasting with existing two-stage
methods that follow a sequential ``edge point cloud reconstruction and
parametric curve fitting'' pipeline, our one-stage approach optimizes 3D
parametric curves directly from 2D edge maps, eliminating error accumulation
caused by the inherent optimization gap between disconnected stages. However,
parametric curves inherently lack suitability for rendering-based multi-view
optimization, necessitating a complementary representation that preserves their
geometric properties while enabling differentiable rendering. We propose a
novel bi-directional coupling mechanism between parametric curves and
edge-oriented Gaussian components. This tight correspondence formulates a
curve-aware Gaussian representation, \textbf{CurveGaussian}, that enables
differentiable rendering of 3D curves, allowing direct optimization guided by
multi-view evidence. Furthermore, we introduce a dynamically adaptive topology
optimization framework during training to refine curve structures through
linearization, merging, splitting, and pruning operations. Comprehensive
evaluations on the ABC dataset and real-world benchmarks demonstrate our
one-stage method's superiority over two-stage alternatives, particularly in
producing cleaner and more robust reconstructions. Additionally, by directly
optimizing parametric curves, our method significantly reduces the parameter
count during training, achieving both higher efficiency and superior
performance compared to existing approaches.
comment: Code: https://github.com/zhirui-gao/Curve-Gaussian Accepted by ICCV
2025
☆ FastRef:Fast Prototype Refinement for Few-Shot Industrial Anomaly Detection
Few-shot industrial anomaly detection (FS-IAD) presents a critical challenge
for practical automated inspection systems operating in data-scarce
environments. While existing approaches predominantly focus on deriving
prototypes from limited normal samples, they typically neglect to
systematically incorporate query image statistics to enhance prototype
representativeness. To address this issue, we propose FastRef, a novel and
efficient prototype refinement framework for FS-IAD. Our method operates
through an iterative two-stage process: (1) characteristic transfer from query
features to prototypes via an optimizable transformation matrix, and (2)
anomaly suppression through prototype alignment. The characteristic transfer is
achieved through linear reconstruction of query features from prototypes, while
the anomaly suppression addresses a key observation in FS-IAD that unlike
conventional IAD with abundant normal prototypes, the limited-sample setting
makes anomaly reconstruction more probable. Therefore, we employ optimal
transport (OT) for non-Gaussian sampled features to measure and minimize the
gap between prototypes and their refined counterparts for anomaly suppression.
For comprehensive evaluation, we integrate FastRef with three competitive
prototype-based FS-IAD methods: PatchCore, FastRecon, WinCLIP, and AnomalyDINO.
Extensive experiments across four benchmark datasets of MVTec, ViSA, MPDD and
RealIAD demonstrate both the effectiveness and computational efficiency of our
approach under 1/2/4-shots.
comment: 18pages, 7figures, 6tables
☆ GenFlow: Interactive Modular System for Image Generation
Generative art unlocks boundless creative possibilities, yet its full
potential remains untapped due to the technical expertise required for advanced
architectural concepts and computational workflows. To bridge this gap, we
present GenFlow, a novel modular framework that empowers users of all skill
levels to generate images with precision and ease. Featuring a node-based
editor for seamless customization and an intelligent assistant powered by
natural language processing, GenFlow transforms the complexity of workflow
creation into an intuitive and accessible experience. By automating deployment
processes and minimizing technical barriers, our framework makes cutting-edge
generative art tools available to everyone. A user study demonstrated GenFlow's
ability to optimize workflows, reduce task completion times, and enhance user
understanding through its intuitive interface and adaptive features. These
results position GenFlow as a groundbreaking solution that redefines
accessibility and efficiency in the realm of generative art.
☆ CA-I2P: Channel-Adaptive Registration Network with Global Optimal Selection ICCV 2025
Zhixin Cheng, Jiacheng Deng, Xinjun Li, Xiaotian Yin, Bohao Liao, Baoqun Yin, Wenfei Yang, Tianzhu Zhang
Detection-free methods typically follow a coarse-to-fine pipeline, extracting
image and point cloud features for patch-level matching and refining dense
pixel-to-point correspondences. However, differences in feature channel
attention between images and point clouds may lead to degraded matching
results, ultimately impairing registration accuracy. Furthermore, similar
structures in the scene could lead to redundant correspondences in cross-modal
matching. To address these issues, we propose Channel Adaptive Adjustment
Module (CAA) and Global Optimal Selection Module (GOS). CAA enhances
intra-modal features and suppresses cross-modal sensitivity, while GOS replaces
local selection with global optimization. Experiments on RGB-D Scenes V2 and
7-Scenes demonstrate the superiority of our method, achieving state-of-the-art
performance in image-to-point cloud registration.
comment: ICCV 2025 accepted
☆ ToosiCubix: Monocular 3D Cuboid Labeling via Vehicle Part Annotations
Many existing methods for 3D cuboid annotation of vehicles rely on expensive
and carefully calibrated camera-LiDAR or stereo setups, limiting their
accessibility for large-scale data collection. We introduce ToosiCubix, a
simple yet powerful approach for annotating ground-truth cuboids using only
monocular images and intrinsic camera parameters. Our method requires only
about 10 user clicks per vehicle, making it highly practical for adding 3D
annotations to existing datasets originally collected without specialized
equipment. By annotating specific features (e.g., wheels, car badge,
symmetries) across different vehicle parts, we accurately estimate each
vehicle's position, orientation, and dimensions up to a scale ambiguity (8
DoF). The geometric constraints are formulated as an optimization problem,
which we solve using a coordinate descent strategy, alternating between
Perspective-n-Points (PnP) and least-squares subproblems. To handle common
ambiguities such as scale and unobserved dimensions, we incorporate
probabilistic size priors, enabling 9 DoF cuboid placements. We validate our
annotations against the KITTI and Cityscapes3D datasets, demonstrating that our
method offers a cost-effective and scalable solution for high-quality 3D cuboid
annotation.
☆ CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations
Julian Lorenz, Mrunmai Phatak, Robin Schön, Katja Ludwig, Nico Hörmann, Annemarie Friedrich, Rainer Lienhart
2D scene graphs provide a structural and explainable framework for scene
understanding. However, current work still struggles with the lack of accurate
scene graph data. To overcome this data bottleneck, we present CoPa-SG, a
synthetic scene graph dataset with highly precise ground truth and exhaustive
relation annotations between all objects. Moreover, we introduce parametric and
proto-relations, two new fundamental concepts for scene graphs. The former
provides a much more fine-grained representation than its traditional
counterpart by enriching relations with additional parameters such as angles or
distances. The latter encodes hypothetical relations in a scene graph and
describes how relations would form if new objects are placed in the scene.
Using CoPa-SG, we compare the performance of various scene graph generation
models. We demonstrate how our new relation types can be integrated in
downstream applications to enhance planning and reasoning capabilities.
☆ ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models
Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, Ziwei Liu
Cinematography, the fundamental visual language of film, is essential for
conveying narrative, emotion, and aesthetic quality. While recent
Vision-Language Models (VLMs) demonstrate strong general visual understanding,
their proficiency in comprehending the nuanced cinematic grammar embedded
within individual shots remains largely unexplored and lacks robust evaluation.
This critical gap limits both fine-grained visual comprehension and the
precision of AI-assisted video generation. To address this, we introduce
\textbf{ShotBench}, a comprehensive benchmark specifically designed for
cinematic language understanding. It features over 3.5k expert-annotated QA
pairs from images and video clips, meticulously curated from over 200 acclaimed
(predominantly Oscar-nominated) films and spanning eight key cinematography
dimensions. Our evaluation of 24 leading VLMs on ShotBench reveals their
substantial limitations: even the top-performing model achieves less than 60\%
average accuracy, particularly struggling with fine-grained visual cues and
complex spatial reasoning. To catalyze advancement in this domain, we construct
\textbf{ShotQA}, a large-scale multimodal dataset comprising approximately 70k
cinematic QA pairs. Leveraging ShotQA, we develop \textbf{ShotVL} through
supervised fine-tuning and Group Relative Policy Optimization. ShotVL
significantly outperforms all existing open-source and proprietary models on
ShotBench, establishing new \textbf{state-of-the-art} performance. We
open-source our models, data, and code to foster rapid progress in this crucial
area of AI-driven cinematic understanding and generation.
☆ Generalizable Neural Electromagnetic Inverse Scattering
Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in
applications such as medical imaging, where the goal is to reconstruct the
relative permittivity from scattered electromagnetic field. This inverse
process is inherently ill-posed and highly nonlinear, making it particularly
challenging. A recent machine learning-based approach, Img-Interiors, shows
promising results by leveraging continuous implicit functions. However, it
requires case-specific optimization, lacks generalization to unseen data, and
fails under sparse transmitter setups (e.g., with only one transmitter). To
address these limitations, we revisit EISP from a physics-informed perspective,
reformulating it as a two stage inverse transmission-scattering process. This
formulation reveals the induced current as a generalizable intermediate
representation, effectively decoupling the nonlinear scattering process from
the ill-posed inverse problem. Built on this insight, we propose the first
generalizable physics-driven framework for EISP, comprising a current estimator
and a permittivity solver, working in an end-to-end manner. The current
estimator explicitly learns the induced current as a physical bridge between
the incident and scattered field, while the permittivity solver computes the
relative permittivity directly from the estimated induced current. This design
enables data-driven training and generalizable feed-forward prediction of
relative permittivity on unseen data while maintaining strong robustness to
transmitter sparsity. Extensive experiments show that our method outperforms
state-of-the-art approaches in reconstruction accuracy, generalization, and
robustness. This work offers a fundamentally new perspective on electromagnetic
inverse scattering and represents a major step toward cost-effective practical
solutions for electromagnetic imaging.
☆ PanSt3R: Multi-view Consistent Panoptic Segmentation ICCV 2025
Lojze Zust, Yohann Cabon, Juliette Marrie, Leonid Antsfeld, Boris Chidlovskii, Jerome Revaud, Gabriela Csurka
Panoptic segmentation of 3D scenes, involving the segmentation and
classification of object instances in a dense 3D reconstruction of a scene, is
a challenging problem, especially when relying solely on unposed 2D images.
Existing approaches typically leverage off-the-shelf models to extract
per-frame 2D panoptic segmentations, before optimizing an implicit geometric
representation (often based on NeRF) to integrate and fuse the 2D predictions.
We argue that relying on 2D panoptic segmentation for a problem inherently 3D
and multi-view is likely suboptimal as it fails to leverage the full potential
of spatial relationships across views. In addition to requiring camera
parameters, these approaches also necessitate computationally expensive
test-time optimization for each scene. Instead, in this work, we propose a
unified and integrated approach PanSt3R, which eliminates the need for
test-time optimization by jointly predicting 3D geometry and multi-view
panoptic segmentation in a single forward pass. Our approach builds upon recent
advances in 3D reconstruction, specifically upon MUSt3R, a scalable multi-view
version of DUSt3R, and enhances it with semantic awareness and multi-view
panoptic segmentation capabilities. We additionally revisit the standard
post-processing mask merging procedure and introduce a more principled approach
for multi-view segmentation. We also introduce a simple method for generating
novel-view predictions based on the predictions of PanSt3R and vanilla 3DGS.
Overall, the proposed PanSt3R is conceptually simple, yet fast and scalable,
and achieves state-of-the-art performance on several benchmarks, while being
orders of magnitude faster than existing methods.
comment: Accepted at ICCV 2025
☆ Automatic Reviewers Assignment to a Research Paper Based on Allied References and Publications Weight
Everyday, a vast stream of research documents is submitted to conferences,
anthologies, journals, newsletters, annual reports, daily papers, and various
periodicals. Many such publications use independent external specialists to
review submissions. This process is called peer review, and the reviewers are
called referees. However, it is not always possible to pick the best referee
for reviewing. Moreover, new research fields are emerging in every sector, and
the number of research papers is increasing dramatically. To review all these
papers, every journal assigns a small team of referees who may not be experts
in all areas. For example, a research paper in communication technology should
be reviewed by an expert from the same field. Thus, efficiently selecting the
best reviewer or referee for a research paper is a big challenge.
In this research, we propose and implement program that uses a new strategy
to automatically select the best reviewers for a research paper. Every research
paper contains references at the end, usually from the same area. First, we
collect the references and count authors who have at least one paper in the
references. Then, we automatically browse the web to extract research topic
keywords. Next, we search for top researchers in the specific topic and count
their h-index, i10-index, and citations for the first n authors. Afterward, we
rank the top n authors based on a score and automatically browse their
homepages to retrieve email addresses. We also check their co-authors and
colleagues online and discard them from the list. The remaining top n authors,
generally professors, are likely the best referees for reviewing the research
paper.
comment: IEEE Conference Proceedings (5 Pages)
☆ Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models
Haoyang Wu, Tsun-Hsuan Wang, Mathias Lechner, Ramin Hasani, Jennifer A. Eckhoff, Paul Pak, Ozanan R. Meireles, Guy Rosman, Yutong Ban, Daniela Rus
Surgical workflow analysis is essential in robot-assisted surgeries, yet the
long duration of such procedures poses significant challenges for comprehensive
video analysis. Recent approaches have predominantly relied on transformer
models; however, their quadratic attention mechanism restricts efficient
processing of lengthy surgical videos. In this paper, we propose a novel
hierarchical input-dependent state space model that leverages the linear
scaling property of state space models to enable decision making on full-length
videos while capturing both local and global dynamics. Our framework
incorporates a temporally consistent visual feature extractor, which appends a
state space model head to a visual feature extractor to propagate temporal
information. The proposed model consists of two key modules: a
local-aggregation state space model block that effectively captures intricate
local dynamics, and a global-relation state space model block that models
temporal dependencies across the entire video. The model is trained using a
hybrid discrete-continuous supervision strategy, where both signals of discrete
phase labels and continuous phase progresses are propagated through the
network. Experiments have shown that our method outperforms the current
state-of-the-art methods by a large margin (+2.8% on Cholec80, +4.3% on
MICCAI2016, and +12.9% on Heichole datasets). Code will be publicly available
after paper acceptance.
☆ Multimodal LLMs for Visualization Reconstruction and Understanding
Visualizations are crucial for data communication, yet understanding them
requires comprehension of both visual elements and their underlying data
relationships. Current multimodal large models, while effective in natural
image understanding, struggle with visualization due to their inability to
decode the data-to-visual mapping rules and extract structured information. To
address these challenges, we present a novel dataset and train multimodal
visualization LLMs specifically designed for understanding. Our approach
combines chart images with their corresponding vectorized representations,
encoding schemes, and data features. The proposed vector format enables compact
and accurate reconstruction of visualization content. Experimental results
demonstrate significant improvements in both data extraction accuracy and chart
reconstruction quality.
☆ LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning
Current vision-language models (VLMs) are well-adapted for general visual
understanding tasks. However, they perform inadequately when handling complex
visual tasks related to human poses and actions due to the lack of specialized
vision-language instruction-following data. We introduce a method for
generating such data by integrating human keypoints with traditional visual
features such as captions and bounding boxes, enabling more precise
understanding of human-centric scenes. Our approach constructs a dataset
comprising 200,328 samples tailored to fine-tune models for human-centric
tasks, focusing on three areas: conversation, detailed description, and complex
reasoning. We establish an Extended Human Pose and Action Understanding
Benchmark (E-HPAUB) to assess model performance on human pose and action
understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and
evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant
improvements. Experimental results show an overall improvement of 33.2%
compared to the original LLaVA-1.5-7B model. These findings highlight the
effectiveness of keypoint-integrated data in enhancing multimodal models for
human-centric visual understanding. Code is available at
https://github.com/Ody-trek/LLaVA-Pose.
comment: arXiv admin note: substantial text overlap with arXiv:2409.09306
☆ DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images
Visual grounding in text-rich document images is a critical yet underexplored
challenge for document intelligence and visual question answering (VQA)
systems. We present \drishtikon, a multi-granular visual grounding framework
designed to enhance interpretability and trust in VQA for complex, multilingual
documents. Our approach integrates robust multi-lingual OCR, large language
models, and a novel region matching algorithm to accurately localize answer
spans at block, line, word, and point levels. We curate a new benchmark from
the CircularsVQA test set, providing fine-grained, human-verified annotations
across multiple granularities. Extensive experiments demonstrate that our
method achieves state-of-the-art grounding accuracy, with line-level
granularity offering the best trade-off between precision and recall. Ablation
studies further highlight the benefits of multi-block and multi-line reasoning.
Comparative evaluations with leading vision-language models reveal the
limitations of current VLMs in precise localization, underscoring the
effectiveness of our structured, alignment-based approach. Our findings pave
the way for more robust and interpretable document understanding systems in
real-world, text-centric scenarios. Code and dataset has been made available at
https://github.com/kasuba-badri-vishal/DhrishtiKon.
comment: Work in progress
☆ Continual Self-Supervised Learning with Masked Autoencoders in Remote Sensing
The development of continual learning (CL) methods, which aim to learn new
tasks in a sequential manner from the training data acquired continuously, has
gained great attention in remote sensing (RS). The existing CL methods in RS,
while learning new tasks, enhance robustness towards catastrophic forgetting.
This is achieved by using a large number of labeled training samples, which is
costly and not always feasible to gather in RS. To address this problem, we
propose a novel continual self-supervised learning method in the context of
masked autoencoders (denoted as CoSMAE). The proposed CoSMAE consists of two
components: i) data mixup; and ii) model mixup knowledge distillation. Data
mixup is associated with retaining information on previous data distributions
by interpolating images from the current task with those from the previous
tasks. Model mixup knowledge distillation is associated with distilling
knowledge from past models and the current model simultaneously by
interpolating their model weights to form a teacher for the knowledge
distillation. The two components complement each other to regularize the MAE at
the data and model levels to facilitate better generalization across tasks and
reduce the risk of catastrophic forgetting. Experimental results show that
CoSMAE achieves significant improvements of up to 4.94% over state-of-the-art
CL methods applied to MAE. Our code is publicly available at:
https://git.tu-berlin.de/rsim/CoSMAE.
comment: Accepted to IEEE Geoscience and Remote Sensing Letters. Our code is
available at https://git.tu-berlin.de/rsim/CoSMAE
☆ HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation MICCAI 2025
Surgical Video Synthesis has emerged as a promising research direction
following the success of diffusion models in general-domain video generation.
Although existing approaches achieve high-quality video generation, most are
unconditional and fail to maintain consistency with surgical actions and
phases, lacking the surgical understanding and fine-grained guidance necessary
for factual simulation. We address these challenges by proposing HieraSurg, a
hierarchy-aware surgical video generation framework consisting of two
specialized diffusion models. Given a surgical phase and an initial frame,
HieraSurg first predicts future coarse-grained semantic changes through a
segmentation prediction model. The final video is then generated by a
second-stage model that augments these temporal segmentation maps with
fine-grained visual features, leading to effective texture rendering and
integration of semantic information in the video space. Our approach leverages
surgical information at multiple levels of abstraction, including surgical
phase, action triplets, and panoptic segmentation maps. The experimental
results on Cholecystectomy Surgical Video Generation demonstrate that the model
significantly outperforms prior work both quantitatively and qualitatively,
showing strong generalization capabilities and the ability to generate higher
frame-rate videos. The model exhibits particularly fine-grained adherence when
provided with existing segmentation maps, suggesting its potential for
practical surgical applications.
comment: Accepted at MICCAI 2025
☆ HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, Jingren Zhou
With the rapid evolution of multimodal large language models, the capacity to
deeply understand and interpret human intentions has emerged as a critical
capability, which demands detailed and thoughtful reasoning. In recent studies,
Reinforcement Learning (RL) has demonstrated potential in enhancing the
reasoning capabilities of Large Language Models (LLMs). Nonetheless, the
challenges associated with adapting RL to multimodal data and formats remain
largely unaddressed. In this paper, we identify two issues in existing
multimodal reasoning models: insufficient global context understanding and
shortcut problems. Insufficient context understanding can happen when a model
misinterprets multimodal context, resulting in incorrect answers. The shortcut
problem occurs when the model overlooks crucial clues in multimodal inputs,
directly addressing the query without considering the multimodal information.
To tackle these issues, we emphasize the necessity for the model to reason with
a clear understanding of the global context within multimodal inputs. This
global context understanding can effectively prevent the model from overlooking
key multimodal cues and ensure a thorough reasoning process. To ensure the
accurate interpretation of multimodal context information, we implement a
context reward judged by a large language model, alongside format and accuracy
rewards. Additionally, to improve complex reasoning capability, we employ the
LLM to assess the logical reward, determining whether the reasoning process
successfully integrates multimodal information with logical methods. We also
introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating
models in understanding complex human intentions and emotions. Our proposed
method demonstrates advanced performance across multiple omni-modal benchmarks
compared to other open-source omni-modal models.
☆ WordCon: Word-level Typography Control in Scene Text Rendering
Achieving precise word-level typography control within generated images
remains a persistent challenge. To address it, we newly construct a word-level
controlled scene text dataset and introduce the Text-Image Alignment (TIA)
framework. This framework leverages cross-modal correspondence between text and
local image regions provided by grounding models to enhance the Text-to-Image
(T2I) model training. Furthermore, we propose WordCon, a hybrid
parameter-efficient fine-tuning (PEFT) method. WordCon reparameterizes
selective key parameters, improving both efficiency and portability. This
allows seamless integration into diverse pipelines, including artistic text
rendering, text editing, and image-conditioned text rendering. To further
enhance controllability, the masked loss at the latent level is applied to
guide the model to concentrate on learning the text region in the image, and
the joint-attention loss provides feature-level supervision to promote
disentanglement between different words. Both qualitative and quantitative
results demonstrate the superiority of our method to the state of the art. The
datasets and source code will be available for academic use.
☆ FairyGen: Storied Cartoon Video from a Single Child-Drawn Character
We propose FairyGen, an automatic system for generating story-driven cartoon
videos from a single child's drawing, while faithfully preserving its unique
artistic style. Unlike previous storytelling methods that primarily focus on
character consistency and basic motion, FairyGen explicitly disentangles
character modeling from stylized background generation and incorporates
cinematic shot design to support expressive and coherent storytelling. Given a
single character sketch, we first employ an MLLM to generate a structured
storyboard with shot-level descriptions that specify environment settings,
character actions, and camera perspectives. To ensure visual consistency, we
introduce a style propagation adapter that captures the character's visual
style and applies it to the background, faithfully retaining the character's
full visual identity while synthesizing style-consistent scenes. A shot design
module further enhances visual diversity and cinematic quality through frame
cropping and multi-view synthesis based on the storyboard. To animate the
story, we reconstruct a 3D proxy of the character to derive physically
plausible motion sequences, which are then used to fine-tune an MMDiT-based
image-to-video diffusion model. We further propose a two-stage motion
customization adapter: the first stage learns appearance features from
temporally unordered frames, disentangling identity from motion; the second
stage models temporal dynamics using a timestep-shift strategy with frozen
identity weights. Once trained, FairyGen directly renders diverse and coherent
video scenes aligned with the storyboard. Extensive experiments demonstrate
that our system produces animations that are stylistically faithful,
narratively structured natural motion, highlighting its potential for
personalized and engaging story animation. The code will be available at
https://github.com/GVCLab/FairyGen
comment: Project Page: https://jayleejia.github.io/FairyGen/ ; Code:
https://github.com/GVCLab/FairyGen
☆ Video Virtual Try-on with Conditional Diffusion Transformer Inpainter
Video virtual try-on aims to naturally fit a garment to a target person in
consecutive video frames. It is a challenging task, on the one hand, the output
video should be in good spatial-temporal consistency, on the other hand, the
details of the given garment need to be preserved well in all the frames.
Naively using image-based try-on methods frame by frame can get poor results
due to severe inconsistency. Recent diffusion-based video try-on methods,
though very few, happen to coincide with a similar solution: inserting temporal
attention into image-based try-on model to adapt it for video try-on task,
which have shown improvements but there still exist inconsistency problems. In
this paper, we propose ViTI (Video Try-on Inpainter), formulate and implement
video virtual try-on as a conditional video inpainting task, which is different
from previous methods. In this way, we start with a video generation problem
instead of an image-based try-on problem, which from the beginning has a better
spatial-temporal consistency. Specifically, at first we build a video
inpainting framework based on Diffusion Transformer with full 3D
spatial-temporal attention, and then we progressively adapt it for video
garment inpainting, with a collection of masking strategies and multi-stage
training. After these steps, the model can inpaint the masked garment area with
appropriate garment pixels according to the prompt with good spatial-temporal
consistency. Finally, as other try-on methods, garment condition is added to
the model to make sure the inpainted garment appearance and details are as
expected. Both quantitative and qualitative experimental results show that ViTI
is superior to previous works.
comment: 10 pages, 6 figures
☆ DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic ICCV 2025
Real-world object detection systems, such as those in autonomous driving and
surveillance, must continuously learn new object categories and simultaneously
adapt to changing environmental conditions. Existing approaches, Class
Incremental Object Detection (CIOD) and Domain Incremental Object Detection
(DIOD) only address one aspect of this challenge. CIOD struggles in unseen
domains, while DIOD suffers from catastrophic forgetting when learning new
classes, limiting their real-world applicability. To overcome these
limitations, we introduce Dual Incremental Object Detection (DuIOD), a more
practical setting that simultaneously handles class and domain shifts in an
exemplar-free manner. We propose DuET, a Task Arithmetic-based model merging
framework that enables stable incremental learning while mitigating sign
conflicts through a novel Directional Consistency Loss. Unlike prior methods,
DuET is detector-agnostic, allowing models like YOLO11 and RT-DETR to function
as real-time incremental object detectors. To comprehensively evaluate both
retention and adaptation, we introduce the Retention-Adaptability Index (RAI),
which combines the Average Retention Index (Avg RI) for catastrophic forgetting
and the Average Generalization Index for domain adaptability into a common
ground. Extensive experiments on the Pascal Series and Diverse Weather Series
demonstrate DuET's effectiveness, achieving a +13.12% RAI improvement while
preserving 89.3% Avg RI on the Pascal Series (4 tasks), as well as a +11.39%
RAI improvement with 88.57% Avg RI on the Diverse Weather Series (3 tasks),
outperforming existing methods.
comment: Accepted at ICCV 2025
☆ Temporal Rate Reduction Clustering for Human Motion Segmentation ICCV 2025
Human Motion Segmentation (HMS), which aims to partition videos into
non-overlapping human motions, has attracted increasing research attention
recently. Existing approaches for HMS are mainly dominated by subspace
clustering methods, which are grounded on the assumption that high-dimensional
temporal data align with a Union-of-Subspaces (UoS) distribution. However, the
frames in video capturing complex human motions with cluttered backgrounds may
not align well with the UoS distribution. In this paper, we propose a novel
approach for HMS, named Temporal Rate Reduction Clustering
($\text{TR}^2\text{C}$), which jointly learns structured representations and
affinity to segment the frame sequences in video. Specifically, the structured
representations learned by $\text{TR}^2\text{C}$ maintain temporally consistent
and align well with a UoS structure, which is favorable for the HMS task. We
conduct extensive experiments on five benchmark HMS datasets and achieve
state-of-the-art performances with different feature extractors.
comment: The paper is accepted by ICCV 2025. The first two authors are equally
contributed
☆ GANet-Seg: Adversarial Learning for Brain Tumor Segmentation with Hybrid Generative Models
This work introduces a novel framework for brain tumor segmentation
leveraging pre-trained GANs and Unet architectures. By combining a global
anomaly detection module with a refined mask generation network, the proposed
model accurately identifies tumor-sensitive regions and iteratively enhances
segmentation precision using adversarial loss constraints. Multi-modal MRI data
and synthetic image augmentation are employed to improve robustness and address
the challenge of limited annotated datasets. Experimental results on the BraTS
dataset demonstrate the effectiveness of the approach, achieving high
sensitivity and accuracy in both lesion-wise Dice and HD95 metrics than the
baseline. This scalable method minimizes the dependency on fully annotated
data, paving the way for practical real-world applications in clinical
settings.
☆ DiMPLe -- Disentangled Multi-Modal Prompt Learning: Enhancing Out-Of-Distribution Alignment with Invariant and Spurious Feature Separation
We introduce DiMPLe (Disentangled Multi-Modal Prompt Learning), a novel
approach to disentangle invariant and spurious features across vision and
language modalities in multi-modal learning. Spurious correlations in visual
data often hinder out-of-distribution (OOD) performance. Unlike prior methods
focusing solely on image features, DiMPLe disentangles features within and
across modalities while maintaining consistent alignment, enabling better
generalization to novel classes and robustness to distribution shifts. Our
method combines three key objectives: (1) mutual information minimization
between invariant and spurious features, (2) spurious feature regularization,
and (3) contrastive learning on invariant features. Extensive experiments
demonstrate DiMPLe demonstrates superior performance compared to CoOp-OOD, when
averaged across 11 diverse datasets, and achieves absolute gains of 15.27 in
base class accuracy and 44.31 in novel class accuracy.
☆ Real-Time ESFP: Estimating, Smoothing, Filtering, and Pose-Mapping
This paper presents ESFP, an end-to-end pipeline that converts monocular RGB
video into executable joint trajectories for a low-cost 4-DoF desktop arm. ESFP
comprises four sequential modules. (1) Estimating: ROMP lifts each frame to a
24-joint 3-D skeleton. (2) Smoothing: the proposed HPSTM-a sequence-to-sequence
Transformer with self-attention-combines long-range temporal context with a
differentiable forward-kinematics decoder, enforcing constant bone lengths and
anatomical plausibility while jointly predicting joint means and full
covariances. (3) Filtering: root-normalized trajectories are variance-weighted
according to HPSTM's uncertainty estimates, suppressing residual noise. (4)
Pose-Mapping: a geometric retargeting layer transforms shoulder-elbow-wrist
triples into the uArm's polar workspace, preserving wrist orientation.
☆ ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation ICCV 2025
Training-free open-vocabulary semantic segmentation (OVS) aims to segment
images given a set of arbitrary textual categories without costly model
fine-tuning. Existing solutions often explore attention mechanisms of
pre-trained models, such as CLIP, or generate synthetic data and design complex
retrieval processes to perform OVS. However, their performance is limited by
the capability of reliant models or the suboptimal quality of reference sets.
In this work, we investigate the largely overlooked data quality problem for
this challenging dense scene understanding task, and identify that a
high-quality reference set can significantly benefit training-free OVS. With
this observation, we introduce a data-quality-oriented framework, comprising a
data pipeline to construct a reference set with well-paired segment-text
embeddings and a simple similarity-based retrieval to unveil the essential
effect of data. Remarkably, extensive evaluations on ten benchmark datasets
demonstrate that our method outperforms all existing training-free OVS
approaches, highlighting the importance of data-centric design for advancing
OVS without training. Our code is available at https://github.com/xiweix/ReME .
comment: Accepted to ICCV 2025
☆ BitMark for Infinity: Watermarking Bitwise Autoregressive Image Generative Models
State-of-the-art text-to-image models like Infinity generate photorealistic
images at an unprecedented speed. These models operate in a bitwise
autoregressive manner over a discrete set of tokens that is practically
infinite in size. However, their impressive generative power comes with a
growing risk: as their outputs increasingly populate the Internet, they are
likely to be scraped and reused as training data-potentially by the very same
models. This phenomenon has been shown to lead to model collapse, where
repeated training on generated content, especially from the models' own
previous versions, causes a gradual degradation in performance. A promising
mitigation strategy is watermarking, which embeds human-imperceptible yet
detectable signals into generated images-enabling the identification of
generated content. In this work, we introduce BitMark, a robust bitwise
watermarking framework for Infinity. Our method embeds a watermark directly at
the bit level of the token stream across multiple scales (also referred to as
resolutions) during Infinity's image generation process. Our bitwise watermark
subtly influences the bits to preserve visual fidelity and generation speed
while remaining robust against a spectrum of removal techniques. Furthermore,
it exhibits high radioactivity, i.e., when watermarked generated images are
used to train another image generative model, this second model's outputs will
also carry the watermark. The radioactive traces remain detectable even when
only fine-tuning diffusion or image autoregressive models on images watermarked
with our BitMark. Overall, our approach provides a principled step toward
preventing model collapse in image generative models by enabling reliable
detection of generated outputs.
☆ MedPrompt: LLM-CNN Fusion with Weight Routing for Medical Image Segmentation and Classification
Current medical image analysis systems are typically task-specific, requiring
separate models for classification and segmentation, and lack the flexibility
to support user-defined workflows. To address these challenges, we introduce
MedPrompt, a unified framework that combines a few-shot prompted Large Language
Model (Llama-4-17B) for high-level task planning with a modular Convolutional
Neural Network (DeepFusionLab) for low-level image processing. The LLM
interprets user instructions and generates structured output to dynamically
route task-specific pretrained weights. This weight routing approach avoids
retraining the entire framework when adding new tasks-only task-specific
weights are required, enhancing scalability and deployment. We evaluated
MedPrompt across 19 public datasets, covering 12 tasks spanning 5 imaging
modalities. The system achieves a 97% end-to-end correctness in interpreting
and executing prompt-driven instructions, with an average inference latency of
2.5 seconds, making it suitable for near real-time applications. DeepFusionLab
achieves competitive segmentation accuracy (e.g., Dice 0.9856 on lungs) and
strong classification performance (F1 0.9744 on tuberculosis). Overall,
MedPrompt enables scalable, prompt-driven medical imaging by combining the
interpretability of LLMs with the efficiency of modular CNNs.
comment: 40 pages, 8 Tables, 9 Figures
☆ Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation ICCV 2025
Panoramic image processing is essential for omni-context perception, yet
faces constraints like distortions, perspective occlusions, and limited
annotations. Previous unsupervised domain adaptation methods transfer knowledge
from labeled pinhole data to unlabeled panoramic images, but they require
access to source pinhole data. To address these, we introduce a more practical
task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and
propose its first solution, called UNconstrained Learning Omni-Context
Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni
Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting
without relying on source data or target labels, this framework enhances models
to achieve segmentation with 360{\deg} viewpoint coverage and occlusion-aware
reasoning. Furthermore, we benchmark the proposed SFOASS task through both
real-to-real and synthetic-to-real adaptation settings. Experimental results
show that our source-free method achieves performance comparable to
source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and
11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the
source-only method. All data and code will be made publicly available at
https://github.com/yihong-97/UNLOCK.
comment: Accepted to ICCV 2025. All data and code will be made publicly
available at https://github.com/yihong-97/UNLOCK
☆ GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding
Sequential grounding in 3D point clouds (SG3D) refers to locating sequences
of objects by following text instructions for a daily activity with detailed
steps. Current 3D visual grounding (3DVG) methods treat text instructions with
multiple steps as a whole, without extracting useful temporal information from
each step. However, the instructions in SG3D often contain pronouns such as
"it", "here" and "the same" to make language expressions concise. This requires
grounding methods to understand the context and retrieve relevant information
from previous steps to correctly locate object sequences. Due to the lack of an
effective module for collecting related historical information,
state-of-the-art 3DVG methods face significant challenges in adapting to the
SG3D task. To fill this gap, we propose GroundFlow -- a plug-in module for
temporal reasoning on 3D point cloud sequential grounding. Firstly, we
demonstrate that integrating GroundFlow improves the task accuracy of 3DVG
baseline methods by a large margin (+7.5\% and +10.2\%) in the SG3D benchmark,
even outperforming a 3D large language model pre-trained on various datasets.
Furthermore, we selectively extract both short-term and long-term step
information based on its relevance to the current instruction, enabling
GroundFlow to take a comprehensive view of historical information and maintain
its temporal understanding advantage as step counts increase. Overall, our work
introduces temporal reasoning capabilities to existing 3DVG models and achieves
state-of-the-art performance in the SG3D benchmark across five datasets.
☆ Out-of-Distribution Semantic Occupancy Prediction
Yuheng Zhang, Mengfei Duan, Kunyu Peng, Yuhang Wang, Ruiping Liu, Fei Teng, Kai Luo, Zhiyong Li, Kailun Yang
3D Semantic Occupancy Prediction is crucial for autonomous driving, providing
a dense, semantically rich environmental representation. However, existing
methods focus on in-distribution scenes, making them susceptible to
Out-of-Distribution (OoD) objects and long-tail distributions, which increases
the risk of undetected anomalies and misinterpretations, posing safety hazards.
To address these challenges, we introduce Out-of-Distribution Semantic
Occupancy Prediction, targeting OoD detection in 3D voxel space. To fill the
gaps in the dataset, we propose a Synthetic Anomaly Integration Pipeline that
injects synthetic anomalies while preserving realistic spatial and occlusion
patterns, enabling the creation of two datasets: VAA-KITTI and VAA-KITTI-360.
We introduce OccOoD, a novel framework integrating OoD detection into 3D
semantic occupancy prediction, with Voxel-BEV Progressive Fusion (VBPF)
leveraging an RWKV-based branch to enhance OoD detection via geometry-semantic
fusion. Experimental results demonstrate that OccOoD achieves state-of-the-art
OoD detection with an AuROC of 67.34% and an AuPRCr of 29.21% within a 1.2m
region, while maintaining competitive occupancy prediction performance. The
established datasets and source code will be made publicly available at
https://github.com/7uHeng/OccOoD.
comment: The established datasets and source code will be made publicly
available at https://github.com/7uHeng/OccOoD
☆ Task-Aware KV Compression For Cost-Effective Long Video Understanding
Minghao Qin, Yan Shu, Peitian Zhang, Kun Lun, Huaying Yuan, Juenjie Zhou, Shitao Xiao, Bo Zhao, Zheng Liu
Long-video understanding (LVU) remains a severe challenge for existing
multimodal large language models (MLLMs), primarily due to the prohibitive
computational cost. Recent approaches have explored KV compression to mitigate
this issue, but they often suffer from significant information loss at high
compression ratios. In this paper, we introduce Video-X^2L, which flexibly
preserves critical video information for each LVU task. Video-X^2L involves two
key operations. The first one is called bi-level KV compression. During the
MLLM's pre-filling stage, Video-X^2L generates two types of compressed KVs:
low-compression KVs (L-KVs) to capture fine-grained video details and
high-compression KVs (H-KVs) to offer compact video representations. The second
one is called selective KV re-loading. During the MLLM's decoding stage,
Video-X^2L selectively re-loads L-KVs for the most critical video chunks while
using H-KVs for other less important ones. This allows the MLLM to fully
utilize task-specific information while maintaining the overall compactness.
Video-X^2L is simple yet effective: it is free from additional training and
directly compatible with existing KV-compressible MLLMs. We evaluate Video-X^2L
with a variety of popular LVU benchmarks, including VideoMME, MLVU,
LongVideoBench, and VNBench. Our experiment result shows that Video-X^2L
outperforms existing KV-compression methods by a huge advantage while
substantially saving the computation cost.
comment: 14 pages, 3 figures, 6 tables
☆ Uncover Treasures in DCT: Advancing JPEG Quality Enhancement by Exploiting Latent Correlations
Joint Photographic Experts Group (JPEG) achieves data compression by
quantizing Discrete Cosine Transform (DCT) coefficients, which inevitably
introduces compression artifacts. Most existing JPEG quality enhancement
methods operate in the pixel domain, suffering from the high computational
costs of decoding. Consequently, direct enhancement of JPEG images in the DCT
domain has gained increasing attention. However, current DCT-domain methods
often exhibit limited performance. To address this challenge, we identify two
critical types of correlations within the DCT coefficients of JPEG images.
Building on this insight, we propose an Advanced DCT-domain JPEG Quality
Enhancement (AJQE) method that fully exploits these correlations. The AJQE
method enables the adaptation of numerous well-established pixel-domain models
to the DCT domain, achieving superior performance with reduced computational
complexity. Compared to the pixel-domain counterparts, the DCT-domain models
derived by our method demonstrate a 0.35 dB improvement in PSNR and a 60.5%
increase in enhancement throughput on average.
☆ Topology-Aware Modeling for Unsupervised Simulation-to-Reality Point Cloud Recognition
Learning semantic representations from point sets of 3D object shapes is
often challenged by significant geometric variations, primarily due to
differences in data acquisition methods. Typically, training data is generated
using point simulators, while testing data is collected with distinct 3D
sensors, leading to a simulation-to-reality (Sim2Real) domain gap that limits
the generalization ability of point classifiers. Current unsupervised domain
adaptation (UDA) techniques struggle with this gap, as they often lack robust,
domain-insensitive descriptors capable of capturing global topological
information, resulting in overfitting to the limited semantic patterns of the
source domain. To address this issue, we introduce a novel Topology-Aware
Modeling (TAM) framework for Sim2Real UDA on object point clouds. Our approach
mitigates the domain gap by leveraging global spatial topology, characterized
by low-level, high-frequency 3D structures, and by modeling the topological
relations of local geometric features through a novel self-supervised learning
task. Additionally, we propose an advanced self-training strategy that combines
cross-domain contrastive learning with self-training, effectively reducing the
impact of noisy pseudo-labels and enhancing the robustness of the adaptation
process. Experimental results on three public Sim2Real benchmarks validate the
effectiveness of our TAM framework, showing consistent improvements over
state-of-the-art methods across all evaluated tasks. The source code of this
work will be available at https://github.com/zou-longkun/TAG.git.
☆ Geometry and Perception Guided Gaussians for Multiview-consistent 3D Generation from a Single Image
Generating realistic 3D objects from single-view images requires natural
appearance, 3D consistency, and the ability to capture multiple plausible
interpretations of unseen regions. Existing approaches often rely on
fine-tuning pretrained 2D diffusion models or directly generating 3D
information through fast network inference or 3D Gaussian Splatting, but their
results generally suffer from poor multiview consistency and lack geometric
detail. To takle these issues, we present a novel method that seamlessly
integrates geometry and perception priors without requiring additional model
training to reconstruct detailed 3D objects from a single image. Specifically,
we train three different Gaussian branches initialized from the geometry prior,
perception prior and Gaussian noise, respectively. The geometry prior captures
the rough 3D shapes, while the perception prior utilizes the 2D pretrained
diffusion model to enhance multiview information. Subsequently, we refine 3D
Gaussian branches through mutual interaction between geometry and perception
priors, further enhanced by a reprojection-based strategy that enforces depth
consistency. Experiments demonstrate the higher-fidelity reconstruction results
of our method, outperforming existing methods on novel view synthesis and 3D
reconstruction, demonstrating robust and consistent 3D object generation.
comment: 10 pages, 5 figures
☆ Robust Deep Learning for Myocardial Scar Segmentation in Cardiac MRI with Noisy Labels MICCAI 2025
Aida Moafi, Danial Moafi, Evgeny M. Mirkes, Gerry P. McCann, Abbas S. Alatrany, Jayanth R. Arnold, Mostafa Mehdipour Ghazi
The accurate segmentation of myocardial scars from cardiac MRI is essential
for clinical assessment and treatment planning. In this study, we propose a
robust deep-learning pipeline for fully automated myocardial scar detection and
segmentation by fine-tuning state-of-the-art models. The method explicitly
addresses challenges of label noise from semi-automatic annotations, data
heterogeneity, and class imbalance through the use of Kullback-Leibler loss and
extensive data augmentation. We evaluate the model's performance on both acute
and chronic cases and demonstrate its ability to produce accurate and smooth
segmentations despite noisy labels. In particular, our approach outperforms
state-of-the-art models like nnU-Net and shows strong generalizability in an
out-of-distribution test set, highlighting its robustness across various
imaging conditions and clinical tasks. These results establish a reliable
foundation for automated myocardial scar quantification and support the broader
clinical adoption of deep learning in cardiac imaging.
comment: MICCAI 2025
☆ Tree-based Semantic Losses: Application to Sparsely-supervised Large Multi-class Hyperspectral Segmentation
Hyperspectral imaging (HSI) shows great promise for surgical applications,
offering detailed insights into biological tissue differences beyond what the
naked eye can perceive. Refined labelling efforts are underway to train vision
systems to distinguish large numbers of subtly varying classes. However,
commonly used learning methods for biomedical segmentation tasks penalise all
errors equivalently and thus fail to exploit any inter-class semantics in the
label space. In this work, we introduce two tree-based semantic loss functions
which take advantage of a hierarchical organisation of the labels. We further
incorporate our losses in a recently proposed approach for training with
sparse, background-free annotations. Extensive experiments demonstrate that our
proposed method reaches state-of-the-art performance on a sparsely annotated
HSI dataset comprising $107$ classes organised in a clinically-defined semantic
tree structure. Furthermore, our method enables effective detection of
out-of-distribution (OOD) pixels without compromising segmentation performance
on in-distribution (ID) pixels.
☆ Personalized Federated Learning via Dual-Prompt Optimization and Cross Fusion
Federated learning (FL) enables collaborative model training across
decentralized clients without sharing local data, but is challenged by
heterogeneity in data, computation, and communication. Pretrained
vision-language models (VLMs), with their strong generalization and lightweight
tuning via prompts, offer a promising solution. However, existing federated
prompt-learning methods rely only on text prompts and overlook joint
label-domain distribution shifts. In this paper, we propose a personalized FL
framework based on dual-prompt learning and cross fusion, termed pFedDC.
Specifically, each client maintains both global and local prompts across vision
and language modalities: global prompts capture common knowledge shared across
the federation, while local prompts encode client-specific semantics and domain
characteristics. Meanwhile, a cross-fusion module is designed to adaptively
integrate prompts from different levels, enabling the model to generate
personalized representations aligned with each client's unique data
distribution. Extensive experiments across nine datasets with various types of
heterogeneity show that pFedDC consistently outperforms state-of-the-art
methods.
☆ YOLO-FDA: Integrating Hierarchical Attention and Detail Enhancement for Surface Defect Detection
Surface defect detection in industrial scenarios is both crucial and
technically demanding due to the wide variability in defect types, irregular
shapes and sizes, fine-grained requirements, and complex material textures.
Although recent advances in AI-based detectors have improved performance,
existing methods often suffer from redundant features, limited detail
sensitivity, and weak robustness under multiscale conditions. To address these
challenges, we propose YOLO-FDA, a novel YOLO-based detection framework that
integrates fine-grained detail enhancement and attention-guided feature fusion.
Specifically, we adopt a BiFPN-style architecture to strengthen bidirectional
multilevel feature aggregation within the YOLOv5 backbone. To better capture
fine structural changes, we introduce a Detail-directional Fusion Module (DDFM)
that introduces a directional asymmetric convolution in the second-lowest layer
to enrich spatial details and fuses the second-lowest layer with low-level
features to enhance semantic consistency. Furthermore, we propose two novel
attention-based fusion strategies, Attention-weighted Concatenation (AC) and
Cross-layer Attention Fusion (CAF) to improve contextual representation and
reduce feature noise. Extensive experiments on benchmark datasets demonstrate
that YOLO-FDA consistently outperforms existing state-of-the-art methods in
terms of both accuracy and robustness across diverse types of defects and
scales.
comment: 14 pages, 6 figures. Submitted to The 8th Chinese Conference on
Pattern Recognition and Computer Vision
☆ Learning to See in the Extremely Dark ICCV 2025
Learning-based methods have made promising advances in low-light RAW image
enhancement, while their capability to extremely dark scenes where the
environmental illuminance drops as low as 0.0001 lux remains to be explored due
to the lack of corresponding datasets. To this end, we propose a
paired-to-paired data synthesis pipeline capable of generating well-calibrated
extremely low-light RAW images at three precise illuminance ranges of 0.01-0.1
lux, 0.001-0.01 lux, and 0.0001-0.001 lux, together with high-quality sRGB
references to comprise a large-scale paired dataset named
See-in-the-Extremely-Dark (SIED) to benchmark low-light RAW image enhancement
approaches. Furthermore, we propose a diffusion-based framework that leverages
the generative ability and intrinsic denoising property of diffusion models to
restore visually pleasing results from extremely low-SNR RAW inputs, in which
an Adaptive Illumination Correction Module (AICM) and a color consistency loss
are introduced to ensure accurate exposure correction and color restoration.
Extensive experiments on the proposed SIED and publicly available benchmarks
demonstrate the effectiveness of our method. The code and dataset are available
at https://github.com/JianghaiSCU/SIED.
comment: Accepted by ICCV 2025
☆ GoIRL: Graph-Oriented Inverse Reinforcement Learning for Multimodal Trajectory Prediction ICML 2025
Trajectory prediction for surrounding agents is a challenging task in
autonomous driving due to its inherent uncertainty and underlying
multimodality. Unlike prevailing data-driven methods that primarily rely on
supervised learning, in this paper, we introduce a novel Graph-oriented Inverse
Reinforcement Learning (GoIRL) framework, which is an IRL-based predictor
equipped with vectorized context representations. We develop a feature adaptor
to effectively aggregate lane-graph features into grid space, enabling seamless
integration with the maximum entropy IRL paradigm to infer the reward
distribution and obtain the policy that can be sampled to induce multiple
plausible plans. Furthermore, conditioned on the sampled plans, we implement a
hierarchical parameterized trajectory generator with a refinement module to
enhance prediction accuracy and a probability fusion strategy to boost
prediction confidence. Extensive experimental results showcase our approach not
only achieves state-of-the-art performance on the large-scale Argoverse &
nuScenes motion forecasting benchmarks but also exhibits superior
generalization abilities compared to existing supervised models.
comment: Accepted by ICML 2025
☆ CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization ICCV 2025
Jan Ackermann, Jonas Kulhanek, Shengqu Cai, Haofei Xu, Marc Pollefeys, Gordon Wetzstein, Leonidas Guibas, Songyou Peng
In dynamic 3D environments, accurately updating scene representations over
time is crucial for applications in robotics, mixed reality, and embodied AI.
As scenes evolve, efficient methods to incorporate changes are needed to
maintain up-to-date, high-quality reconstructions without the computational
overhead of re-optimizing the entire scene. This paper introduces CL-Splats,
which incrementally updates Gaussian splatting-based 3D representations from
sparse scene captures. CL-Splats integrates a robust change-detection module
that segments updated and static components within the scene, enabling focused,
local optimization that avoids unnecessary re-computation. Moreover, CL-Splats
supports storing and recovering previous scene states, facilitating temporal
segmentation and new scene-analysis applications. Our extensive experiments
demonstrate that CL-Splats achieves efficient updates with improved
reconstruction quality over the state-of-the-art. This establishes a robust
foundation for future real-time adaptation in 3D scene reconstruction tasks.
comment: ICCV 2025, Project Page: https://cl-splats.github.io
☆ IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes
Video Large Language Models (VideoLLMs) have demonstrated remarkable
understanding capabilities, but are found struggling to tackle multi-shot
scenarios,e.g., video clips with varying camera angles or scene changes. This
challenge can render failures such as instance identity forgetting and key
frame negligence. In this work, we first attribute the challenge to the lack of
multi-shot annotations among existing datasets and therefore we introduce a new
dataset termed MultiClip-Bench, featuring dense descriptions and
instruction-based question-answering pairs tailored for multi-shot scenarios.
We empirically find that the training set significantly boosts the multi-shot
performance, while the testing benchmark provides a reliable measure of the
model capability in multi-shot scenarios. By further analyzing and discovering
that current models only encode instance features in a discrete or lossy
manner, at the risk of missing identity information, we then contribute a new
model IPFormer-VideoLLM. Its key idea is the injection of instance-level
features as instance prompts through an efficient attention-based connector.
This allows for the aggregation of instance-specific information across scenes.
Experiments demonstrate that our proposed dataset and model not only enhance
the multi-scene video understanding significantly, but also offer distinct
advantages across various video benchmarks.
☆ Pushing Trade-Off Boundaries: Compact yet Effective Remote Sensing Change Detection
Remote sensing change detection is essential for monitoring urban expansion,
disaster assessment, and resource management, offering timely, accurate, and
large-scale insights into dynamic landscape transformations. While deep
learning has revolutionized change detection, the increasing complexity and
computational demands of modern models have not necessarily translated into
significant accuracy gains. Instead of following this trend, this study
explores a more efficient approach, focusing on lightweight models that
maintain high accuracy while minimizing resource consumption, which is an
essential requirement for on-satellite processing. To this end, we propose
FlickCD, which means quick flick then get great results, pushing the boundaries
of the performance-resource trade-off. FlickCD introduces an Enhanced
Difference Module (EDM) to amplify critical feature differences between
temporal phases while suppressing irrelevant variations such as lighting and
weather changes, thereby reducing computational costs in the subsequent change
decoder. Additionally, the FlickCD decoder incorporates Local-Global Fusion
Blocks, leveraging Shifted Window Self-Attention (SWSA) and Enhanced Global
Self-Attention (EGSA) to efficiently capture semantic information at multiple
scales, preserving both coarse- and fine-grained changes. Extensive experiments
on four benchmark datasets demonstrate that FlickCD reduces computational and
storage overheads by more than an order of magnitude while achieving
state-of-the-art (SOTA) performance or incurring only a minor (<1\% F1)
accuracy trade-off. The implementation code is publicly available at
https://github.com/xulsh8/FlickCD.
comment: 12 pages
☆ OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography ICCV 2025
Caoshuo Li, Zengmao Ding, Xiaobin Hu, Bang Li, Donghao Luo, AndyPian Wu, Chaoyang Wang, Chengjie Wang, Taisong Jin, SevenShu, Yunsheng Wu, Yongge Liu, Rongrong Ji
As one of the earliest ancient languages, Oracle Bone Script (OBS)
encapsulates the cultural records and intellectual expressions of ancient
civilizations. Despite the discovery of approximately 4,500 OBS characters,
only about 1,600 have been deciphered. The remaining undeciphered ones, with
their complex structure and abstract imagery, pose significant challenges for
interpretation. To address these challenges, this paper proposes a novel
two-stage semantic typography framework, named OracleFusion. In the first
stage, this approach leverages the Multimodal Large Language Model (MLLM) with
enhanced Spatial Awareness Reasoning (SAR) to analyze the glyph structure of
the OBS character and perform visual localization of key components. In the
second stage, we introduce Oracle Structural Vector Fusion (OSVF),
incorporating glyph structure constraints and glyph maintenance constraints to
ensure the accurate generation of semantically enriched vector fonts. This
approach preserves the objective integrity of the glyph structure, offering
visually enhanced representations that assist experts in deciphering OBS.
Extensive qualitative and quantitative experiments demonstrate that
OracleFusion outperforms state-of-the-art baseline models in terms of
semantics, visual appeal, and glyph maintenance, significantly enhancing both
readability and aesthetic quality. Furthermore, OracleFusion provides
expert-like insights on unseen oracle characters, making it a valuable tool for
advancing the decipherment of OBS.
comment: Accepted to ICCV 2025
☆ ESMStereo: Enhanced ShuffleMixer Disparity Upsampling for Real-Time and Accurate Stereo Matching
Stereo matching has become an increasingly important component of modern
autonomous systems. Developing deep learning-based stereo matching models that
deliver high accuracy while operating in real-time continues to be a major
challenge in computer vision. In the domain of cost-volume-based stereo
matching, accurate disparity estimation depends heavily on large-scale cost
volumes. However, such large volumes store substantial redundant information
and also require computationally intensive aggregation units for processing and
regression, making real-time performance unattainable. Conversely, small-scale
cost volumes followed by lightweight aggregation units provide a promising
route for real-time performance, but lack sufficient information to ensure
highly accurate disparity estimation. To address this challenge, we propose the
Enhanced Shuffle Mixer (ESM) to mitigate information loss associated with
small-scale cost volumes. ESM restores critical details by integrating primary
features into the disparity upsampling unit. It quickly extracts features from
the initial disparity estimation and fuses them with image features. These
features are mixed by shuffling and layer splitting then refined through a
compact feature-guided hourglass network to recover more detailed scene
geometry. The ESM focuses on local contextual connectivity with a large
receptive field and low computational cost, leading to the reconstruction of a
highly accurate disparity map at real-time. The compact version of ESMStereo
achieves an inference speed of 116 FPS on high-end GPUs and 91 FPS on the AGX
Orin.
comment: Under peer review
☆ EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception ICCV 2025
Sanjoy Chowdhury, Subrata Biswas, Sayan Nag, Tushar Nagarajan, Calvin Murdock, Ishwarya Ananthabhotla, Yijun Qian, Vamsi Krishna Ithapu, Dinesh Manocha, Ruohan Gao
Modern perception models, particularly those designed for multisensory
egocentric tasks, have achieved remarkable performance but often come with
substantial computational costs. These high demands pose challenges for
real-world deployment, especially in resource-constrained environments. In this
paper, we introduce EgoAdapt, a framework that adaptively performs cross-modal
distillation and policy learning to enable efficient inference across different
egocentric perception tasks, including egocentric action recognition, active
speaker localization, and behavior anticipation. Our proposed policy module is
adaptable to task-specific action spaces, making it broadly applicable.
Experimental results on three challenging egocentric datasets EPIC-Kitchens,
EasyCom, and Aria Everyday Activities demonstrate that our method significantly
enhances efficiency, reducing GMACs by up to 89.09%, parameters up to 82.02%,
and energy up to 9.6x, while still on-par and in many cases outperforming, the
performance of corresponding state-of-the-art models.
comment: Accepted at ICCV 2025
☆ PoseMaster: Generating 3D Characters in Arbitrary Poses from a Single Image
3D characters play a crucial role in our daily entertainment. To improve the
efficiency of 3D character modeling, recent image-based methods use two
separate models to achieve pose standardization and 3D reconstruction of the
A-pose character. However, these methods are prone to generating distorted and
degraded images in the pose standardization stage due to self-occlusion and
viewpoints, which further affects the geometric quality of the subsequent
reconstruction process. To tackle these problems, we propose PoseMaster, an
end-to-end controllable 3D character generation framework. Specifically, we
unify pose transformation and 3D character generation into a flow-based 3D
native generation framework. To achieve accurate arbitrary-pose control, we
propose to leverage the 3D body bones existing in the skeleton of an animatable
character as the pose condition. Furthermore, considering the specificity of
multi-condition control, we randomly empty the pose condition and the image
condition during training to improve the effectiveness and generalizability of
pose control. Finally, we create a high-quality pose-control dataset derived
from realistic character animation data to make the model learning the implicit
relationships between skeleton and skinning weights. Extensive experiments show
that PoseMaster outperforms current state-of-the-art techniques in both
qualitative and quantitative evaluations for A-pose character generation while
demonstrating its powerful ability to achieve precise control for arbitrary
poses.
☆ SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification
Retrieving 3D objects in complex indoor environments using only a masked 2D
image and a natural language description presents significant challenges. The
ROOMELSA challenge limits access to full 3D scene context, complicating
reasoning about object appearance, geometry, and semantics. These challenges
are intensified by distorted viewpoints, textureless masked regions, ambiguous
language prompts, and noisy segmentation masks. To address this, we propose
SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification. SAMURAI
integrates CLIP-based semantic matching with shape-guided re-ranking derived
from binary silhouettes of masked regions, alongside a robust majority voting
strategy. A dedicated preprocessing pipeline enhances mask quality by
extracting the largest connected component and removing background noise. Our
hybrid retrieval framework leverages both language and shape cues, achieving
competitive performance on the ROOMELSA private test set. These results
highlight the importance of combining shape priors with language understanding
for robust open-world 3D object retrieval.
☆ Class-Agnostic Region-of-Interest Matching in Document Images ICDAR2025
Document understanding and analysis have received a lot of attention due to
their widespread application. However, existing document analysis solutions,
such as document layout analysis and key information extraction, are only
suitable for fixed category definitions and granularities, and cannot achieve
flexible applications customized by users. Therefore, this paper defines a new
task named ``Class-Agnostic Region-of-Interest Matching'' (``RoI-Matching'' for
short), which aims to match the customized regions in a flexible, efficient,
multi-granularity, and open-set manner. The visual prompt of the reference
document and target document images are fed into our model, while the output is
the corresponding bounding boxes in the target document images. To meet the
above requirements, we construct a benchmark RoI-Matching-Bench, which sets
three levels of difficulties following real-world conditions, and propose the
macro and micro metrics to evaluate. Furthermore, we also propose a new
framework RoI-Matcher, which employs a siamese network to extract multi-level
features both in the reference and target domains, and cross-attention layers
to integrate and align similar semantics in different domains. Experiments show
that our method with a simple procedure is effective on RoI-Matching-Bench, and
serves as the baseline for further research. The code is available at
https://github.com/pd162/RoI-Matching.
comment: Accepted by ICDAR2025
☆ Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features ICCV 2025
The ability of deep neural networks (DNNs) come from extracting and
interpreting features from the data provided. By exploiting intermediate
features in DNNs instead of relying on hard labels, we craft adversarial
perturbation that generalize more effectively, boosting black-box
transferability. These features ubiquitously come from supervised learning in
previous work. Inspired by the exceptional synergy between self-supervised
learning and the Transformer architecture, this paper explores whether
exploiting self-supervised Vision Transformer (ViT) representations can improve
adversarial transferability. We present dSVA -- a generative dual
self-supervised ViT features attack, that exploits both global structural
features from contrastive learning (CL) and local textural features from masked
image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We
design a novel generative training framework that incorporates a generator to
create black-box adversarial examples, and strategies to train the generator by
exploiting joint features and the attention mechanism of self-supervised ViTs.
Our findings show that CL and MIM enable ViTs to attend to distinct feature
tendencies, which, when exploited in tandem, boast great adversarial
generalizability. By disrupting dual deep features distilled by self-supervised
ViTs, we are rewarded with remarkable black-box transferability to models of
various architectures that outperform state-of-the-arts. Code available at
https://github.com/spencerwooo/dSVA.
comment: 14 pages, 9 figures, to appear in ICCV 2025
☆ Improving Diffusion-Based Image Editing Faithfulness via Guidance and Scheduling
Text-guided diffusion models have become essential for high-quality image
synthesis, enabling dynamic image editing. In image editing, two crucial
aspects are editability, which determines the extent of modification, and
faithfulness, which reflects how well unaltered elements are preserved.
However, achieving optimal results is challenging because of the inherent
trade-off between editability and faithfulness. To address this, we propose
Faithfulness Guidance and Scheduling (FGS), which enhances faithfulness with
minimal impact on editability. FGS incorporates faithfulness guidance to
strengthen the preservation of input image information and introduces a
scheduling strategy to resolve misalignment between editability and
faithfulness. Experimental results demonstrate that FGS achieves superior
faithfulness while maintaining editability. Moreover, its compatibility with
various editing methods enables precise, high-quality image edits across
diverse tasks.
comment: preprint
☆ Boosting Domain Generalized and Adaptive Detection with Diffusion Models: Fitness, Generalization, and Transferability ICCV2025
Detectors often suffer from performance drop due to domain gap between
training and testing data. Recent methods explore diffusion models applied to
domain generalization (DG) and adaptation (DA) tasks, but still struggle with
large inference costs and have not yet fully leveraged the capabilities of
diffusion models. We propose to tackle these problems by extracting
intermediate features from a single-step diffusion process, improving feature
collection and fusion to reduce inference time by 75% while enhancing
performance on source domains (i.e., Fitness). Then, we construct an
object-centered auxiliary branch by applying box-masked images with class
prompts to extract robust and domain-invariant features that focus on object.
We also apply consistency loss to align the auxiliary and ordinary branch,
balancing fitness and generalization while preventing overfitting and improving
performance on target domains (i.e., Generalization). Furthermore, within a
unified framework, standard detectors are guided by diffusion detectors through
feature-level and object-level alignment on source domains (for DG) and
unlabeled target domains (for DA), thereby improving cross-domain detection
performance (i.e., Transferability). Our method achieves competitive results on
3 DA benchmarks and 5 DG benchmarks. Additionally, experiments on COCO
generalization benchmark demonstrate that our method maintains significant
advantages and show remarkable efficiency in large domain shifts and low-data
scenarios. Our work shows the superiority of applying diffusion models to
domain generalized and adaptive detection tasks and offers valuable insights
for visual perception tasks across diverse domains. The code is available at
\href{https://github.com/heboyong/Fitness-Generalization-Transferability}{Fitness-Generalization-Transferability}.
comment: Accepted by ICCV2025. arXiv admin note: text overlap with
arXiv:2503.02101
☆ V2X-REALM: Vision-Language Model-Based Robust End-to-End Cooperative Autonomous Driving with Adaptive Long-Tail Modeling
Ensuring robust planning and decision-making under rare, diverse, and
visually degraded long-tail scenarios remains a fundamental challenge for
autonomous driving in urban environments. This issue becomes more critical in
cooperative settings, where vehicles and infrastructure jointly perceive and
reason across complex environments. To address this challenge, we propose
V2X-REALM, a vision-language model (VLM)-based framework with adaptive
multimodal learning for robust cooperative autonomous driving under long-tail
scenarios. V2X-REALM introduces three core innovations: (i) a prompt-driven
long-tail scenario generation and evaluation pipeline that leverages foundation
models to synthesize realistic long-tail conditions such as snow and fog across
vehicle- and infrastructure-side views, enriching training diversity
efficiently; (ii) a gated multi-scenario adaptive attention module that
modulates the visual stream using scenario priors to recalibrate ambiguous or
corrupted features; and (iii) a multi-task scenario-aware contrastive learning
objective that improves multimodal alignment and promotes cross-scenario
feature separability. Extensive experiments demonstrate that V2X-REALM
significantly outperforms existing baselines in robustness, semantic reasoning,
safety, and planning accuracy under complex, challenging driving conditions,
advancing the scalability of end-to-end cooperative autonomous driving.
☆ RL-Selector: Reinforcement Learning-Guided Data Selection via Redundancy Assessment ICCV 2025
Modern deep architectures often rely on large-scale datasets, but training on
these datasets incurs high computational and storage overhead. Real-world
datasets often contain substantial redundancies, prompting the need for more
data-efficient training paradigms. Data selection has shown promise to mitigate
redundancy by identifying the most representative samples, thereby reducing
training costs without compromising performance. Existing methods typically
rely on static scoring metrics or pretrained models, overlooking the combined
effect of selected samples and their evolving dynamics during training. We
introduce the concept of epsilon-sample cover, which quantifies sample
redundancy based on inter-sample relationships, capturing the intrinsic
structure of the dataset. Based on this, we reformulate data selection as a
reinforcement learning (RL) process and propose RL-Selector, where a
lightweight RL agent optimizes the selection policy by leveraging
epsilon-sample cover derived from evolving dataset distribution as a reward
signal. Extensive experiments across benchmark datasets and diverse
architectures demonstrate that our method consistently outperforms existing
state-of-the-art baselines. Models trained with our selected datasets show
enhanced generalization performance with improved training efficiency.
comment: ICCV 2025
☆ DidSee: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation
Commercial RGB-D cameras often produce noisy, incomplete depth maps for
non-Lambertian objects. Traditional depth completion methods struggle to
generalize due to the limited diversity and scale of training data. Recent
advances exploit visual priors from pre-trained text-to-image diffusion models
to enhance generalization in dense prediction tasks. However, we find that
biases arising from training-inference mismatches in the vanilla diffusion
framework significantly impair depth completion performance. Additionally, the
lack of distinct visual features in non-Lambertian regions further hinders
precise prediction. To address these issues, we propose \textbf{DidSee}, a
diffusion-based framework for depth completion on non-Lambertian objects.
First, we integrate a rescaled noise scheduler enforcing a zero terminal
signal-to-noise ratio to eliminate signal leakage bias. Second, we devise a
noise-agnostic single-step training formulation to alleviate error accumulation
caused by exposure bias and optimize the model with a task-specific loss.
Finally, we incorporate a semantic enhancer that enables joint depth completion
and semantic segmentation, distinguishing objects from backgrounds and yielding
precise, fine-grained depth maps. DidSee achieves state-of-the-art performance
on multiple benchmarks, demonstrates robust real-world generalization, and
effectively improves downstream tasks such as category-level pose estimation
and robotic grasping.Project page: https://wenzhoulyu.github.io/DidSee/
☆ Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation
Ze Wang, Hao Chen, Benran Hu, Jiang Liu, Ximeng Sun, Jialian Wu, Yusheng Su, Xiaodong Yu, Emad Barsoum, Zicheng Liu
Image tokenization plays a critical role in reducing the computational
demands of modeling high-resolution images, significantly improving the
efficiency of image and multimodal understanding and generation. Recent
advances in 1D latent spaces have reduced the number of tokens required by
eliminating the need for a 2D grid structure. In this paper, we further advance
compact discrete image representation by introducing 1D binary image latents.
By representing each image as a sequence of binary vectors, rather than using
traditional one-hot codebook tokens, our approach preserves high-resolution
details while maintaining the compactness of 1D latents. To the best of our
knowledge, our text-to-image models are the first to achieve competitive
performance in both diffusion and auto-regressive generation using just 128
discrete tokens for images up to 1024x1024, demonstrating up to a 32-fold
reduction in token numbers compared to standard VQ-VAEs. The proposed 1D binary
latent space, coupled with simple model architectures, achieves marked
improvements in speed training and inference speed. Our text-to-image models
allow for a global batch size of 4096 on a single GPU node with 8 AMD MI300X
GPUs, and the training can be completed within 200 GPU days. Our models achieve
competitive performance compared to modern image generation models without any
in-house private training data or post-training refinements, offering a
scalable and efficient alternative to conventional tokenization methods.
☆ LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection
Effective deep feature extraction via feature-level fusion is crucial for
multimodal object detection. However, previous studies often involve complex
training processes that integrate modality-specific features by stacking
multiple feature-level fusion units, leading to significant computational
overhead. To address this issue, we propose a new fusion detection baseline
that uses a single feature-level fusion unit to enable high-performance
detection, thereby simplifying the training process. Based on this approach, we
propose a lightweight attention-guided self-modulation feature fusion network
(LASFNet), which introduces a novel attention-guided self-modulation feature
fusion (ASFF) module that adaptively adjusts the responses of fusion features
at both global and local levels based on attention information from different
modalities, thereby promoting comprehensive and enriched feature generation.
Additionally, a lightweight feature attention transformation module (FATM) is
designed at the neck of LASFNet to enhance the focus on fused features and
minimize information loss. Extensive experiments on three representative
datasets demonstrate that, compared to state-of-the-art methods, our approach
achieves a favorable efficiency-accuracy trade-off, reducing the number of
parameters and computational cost by as much as 90% and 85%, respectively,
while improving detection accuracy (mAP) by 1%-3%. The code will be
open-sourced at https://github.com/leileilei2000/LASFNet.
☆ Multimodal Prompt Alignment for Facial Expression Recognition ICCV2025
Prompt learning has been widely adopted to efficiently adapt vision-language
models (VLMs) like CLIP for various downstream tasks. Despite their success,
current VLM-based facial expression recognition (FER) methods struggle to
capture fine-grained textual-visual relationships, which are essential for
distinguishing subtle differences between facial expressions. To address this
challenge, we propose a multimodal prompt alignment framework for FER, called
MPA-FER, that provides fine-grained semantic guidance to the learning process
of prompted visual features, resulting in more precise and interpretable
representations. Specifically, we introduce a multi-granularity hard prompt
generation strategy that utilizes a large language model (LLM) like ChatGPT to
generate detailed descriptions for each facial expression. The LLM-based
external knowledge is injected into the soft prompts by minimizing the feature
discrepancy between the soft prompts and the hard prompts. To preserve the
generalization abilities of the pretrained CLIP model, our approach
incorporates prototype-guided visual feature alignment, ensuring that the
prompted visual features from the frozen image encoder align closely with
class-specific prototypes. Additionally, we propose a cross-modal global-local
alignment module that focuses on expression-relevant facial features, further
improving the alignment between textual and visual features. Extensive
experiments demonstrate our framework outperforms state-of-the-art methods on
three FER benchmark datasets, while retaining the benefits of the pretrained
model and minimizing computational costs.
comment: To appear in ICCV2025
☆ HybridQ: Hybrid Classical-Quantum Generative Adversarial Network for Skin Disease Image Generation
Machine learning-assisted diagnosis is gaining traction in skin disease
detection, but training effective models requires large amounts of high-quality
data. Skin disease datasets often suffer from class imbalance, privacy
concerns, and object bias, making data augmentation essential. While classical
generative models are widely used, they demand extensive computational
resources and lengthy training time. Quantum computing offers a promising
alternative, but existing quantum-based image generation methods can only yield
grayscale low-quality images. Through a novel classical-quantum latent space
fusion technique, our work overcomes this limitation and introduces the first
classical-quantum generative adversarial network (GAN) capable of generating
color medical images. Our model outperforms classical deep convolutional GANs
and existing hybrid classical-quantum GANs in both image generation quality and
classification performance boost when used as data augmentation. Moreover, the
performance boost is comparable with that achieved using state-of-the-art
classical generative models, yet with over 25 times fewer parameters and 10
times fewer training epochs. Such results suggest a promising future for
quantum image generation as quantum hardware advances. Finally, we demonstrate
the robust performance of our model on real IBM quantum machine with hardware
noise.
☆ FedSC: Federated Learning with Semantic-Aware Collaboration KDD 2025
Federated learning (FL) aims to train models collaboratively across clients
without sharing data for privacy-preserving. However, one major challenge is
the data heterogeneity issue, which refers to the biased labeling preferences
at multiple clients. A number of existing FL methods attempt to tackle data
heterogeneity locally (e.g., regularizing local models) or globally (e.g.,
fine-tuning global model), often neglecting inherent semantic information
contained in each client. To explore the possibility of using intra-client
semantically meaningful knowledge in handling data heterogeneity, in this
paper, we propose Federated Learning with Semantic-Aware Collaboration (FedSC)
to capture client-specific and class-relevant knowledge across heterogeneous
clients. The core idea of FedSC is to construct relational prototypes and
consistent prototypes at semantic-level, aiming to provide fruitful class
underlying knowledge and stable convergence signals in a prototype-wise
collaborative way. On the one hand, FedSC introduces an inter-contrastive
learning strategy to bring instance-level embeddings closer to relational
prototypes with the same semantics and away from distinct classes. On the other
hand, FedSC devises consistent prototypes via a discrepancy aggregation manner,
as a regularization penalty to constrain the optimization region of the local
model. Moreover, a theoretical analysis for FedSC is provided to ensure a
convergence guarantee. Experimental results on various challenging scenarios
demonstrate the effectiveness of FedSC and the efficiency of crucial
components.
comment: 12 pages, KDD 2025
☆ Bridging Video Quality Scoring and Justification via Large Multimodal Models
Classical video quality assessment (VQA) methods generate a numerical score
to judge a video's perceived visual fidelity and clarity. Yet, a score fails to
describe the video's complex quality dimensions, restricting its applicability.
Benefiting from the linguistic output, adapting video large multimodal models
(LMMs) to VQA via instruction tuning has the potential to address this issue.
The core of the approach lies in the video quality-centric instruction data.
Previous explorations mainly focus on the image domain, and their data
generation processes heavily rely on human quality annotations and proprietary
systems, limiting data scalability and effectiveness. To address these
challenges, we propose the Score-based Instruction Generation (SIG) pipeline.
Specifically, SIG first scores multiple quality dimensions of an unlabeled
video and maps scores to text-defined levels. It then explicitly incorporates a
hierarchical Chain-of-Thought (CoT) to model the correlation between specific
dimensions and overall quality, mimicking the human visual system's reasoning
process. The automated pipeline eliminates the reliance on expert-written
quality descriptions and proprietary systems, ensuring data scalability and
generation efficiency. To this end, the resulting Score2Instruct (S2I) dataset
contains over 320K diverse instruction-response pairs, laying the basis for
instruction tuning. Moreover, to advance video LMMs' quality scoring and
justification abilities simultaneously, we devise a progressive tuning strategy
to fully unleash the power of S2I. Built upon SIG, we further curate a
benchmark termed S2I-Bench with 400 open-ended questions to better evaluate the
quality justification capacity of video LMMs. Experimental results on the
S2I-Bench and existing benchmarks indicate that our method consistently
improves quality scoring and justification capabilities across multiple video
LMMs.
comment: 15 pages, 4 figures, 8 tables
☆ User-in-the-Loop View Sampling with Error Peaking Visualization ICIP 2025
Augmented reality (AR) provides ways to visualize missing view samples for
novel view synthesis. Existing approaches present 3D annotations for new view
samples and task users with taking images by aligning the AR display. This data
collection task is known to be mentally demanding and limits capture areas to
pre-defined small areas due to the ideal but restrictive underlying sampling
theory. To free users from 3D annotations and limited scene exploration, we
propose using locally reconstructed light fields and visualizing errors to be
removed by inserting new views. Our results show that the error-peaking
visualization is less invasive, reduces disappointment in final results, and is
satisfactory with fewer view samples in our mobile view synthesis system. We
also show that our approach can contribute to recent radiance field
reconstruction for larger scenes, such as 3D Gaussian splatting.
comment: Accepted at IEEE ICIP 2025, Project Page:
https://mediated-reality.github.io/projects/yasunaga_icip25/
☆ The Aging Multiverse: Generating Condition-Aware Facial Aging Tree via Training-Free Diffusion
Bang Gong, Luchao Qi, Jiaye Wu, Zhicheng Fu, Chunbo Song, David W. Jacobs, John Nicholson, Roni Sengupta
We introduce the Aging Multiverse, a framework for generating multiple
plausible facial aging trajectories from a single image, each conditioned on
external factors such as environment, health, and lifestyle. Unlike prior
methods that model aging as a single deterministic path, our approach creates
an aging tree that visualizes diverse futures. To enable this, we propose a
training-free diffusion-based method that balances identity preservation, age
accuracy, and condition control. Our key contributions include attention mixing
to modulate editing strength and a Simulated Aging Regularization strategy to
stabilize edits. Extensive experiments and user studies demonstrate
state-of-the-art performance across identity preservation, aging realism, and
conditional alignment, outperforming existing editing and age-progression
models, which often fail to account for one or more of the editing criteria. By
transforming aging into a multi-dimensional, controllable, and interpretable
process, our approach opens up new creative and practical avenues in digital
storytelling, health education, and personalized visualization.
☆ Detection of Breast Cancer Lumpectomy Margin with SAM-incorporated Forward-Forward Contrastive Learning
Tyler Ward, Xiaoqin Wang, Braxton McFarland, Md Atik Ahamed, Sahar Nozad, Talal Arshad, Hafsa Nebbache, Jin Chen, Abdullah Imran
Complete removal of cancer tumors with a negative specimen margin during
lumpectomy is essential in reducing breast cancer recurrence. However, 2D
specimen radiography (SR), the current method used to assess intraoperative
specimen margin status, has limited accuracy, resulting in nearly a quarter of
patients requiring additional surgery. To address this, we propose a novel deep
learning framework combining the Segment Anything Model (SAM) with
Forward-Forward Contrastive Learning (FFCL), a pre-training strategy leveraging
both local and global contrastive learning for patch-level classification of SR
images. After annotating SR images with regions of known maligancy,
non-malignant tissue, and pathology-confirmed margins, we pre-train a ResNet-18
backbone with FFCL to classify margin status, then reconstruct coarse binary
masks to prompt SAM for refined tumor margin segmentation. Our approach
achieved an AUC of 0.8455 for margin classification and segmented margins with
a 27.4% improvement in Dice similarity over baseline models, while reducing
inference time to 47 milliseconds per image. These results demonstrate that
FFCL-SAM significantly enhances both the speed and accuracy of intraoperative
margin assessment, with strong potential to reduce re-excision rates and
improve surgical outcomes in breast cancer treatment. Our code is available at
https://github.com/tbwa233/FFCL-SAM/.
comment: 19 pages, 7 figures, 3 tables
☆ VisionGuard: Synergistic Framework for Helmet Violation Detection
Enforcing helmet regulations among motorcyclists is essential for enhancing
road safety and ensuring the effectiveness of traffic management systems.
However, automatic detection of helmet violations faces significant challenges
due to environmental variability, camera angles, and inconsistencies in the
data. These factors hinder reliable detection of motorcycles and riders and
disrupt consistent object classification. To address these challenges, we
propose VisionGuard, a synergistic multi-stage framework designed to overcome
the limitations of frame-wise detectors, especially in scenarios with class
imbalance and inconsistent annotations. VisionGuard integrates two key
components: Adaptive Labeling and Contextual Expander modules. The Adaptive
Labeling module is a tracking-based refinement technique that enhances
classification consistency by leveraging a tracking algorithm to assign
persistent labels across frames and correct misclassifications. The Contextual
Expander module improves recall for underrepresented classes by generating
virtual bounding boxes with appropriate confidence scores, effectively
addressing the impact of data imbalance. Experimental results show that
VisionGuard improves overall mAP by 3.1% compared to baseline detectors,
demonstrating its effectiveness and potential for real-world deployment in
traffic surveillance systems, ultimately promoting safety and regulatory
compliance.
☆ Inverse Scene Text Removal
Scene text removal (STR) aims to erase textual elements from images. It was
originally intended for removing privacy-sensitiveor undesired texts from
natural scene images, but is now also appliedto typographic images. STR
typically detects text regions and theninpaints them. Although STR has advanced
through neural networksand synthetic data, misuse risks have increased. This
paper investi-gates Inverse STR (ISTR), which analyzes STR-processed images
andfocuses on binary classification (detecting whether an image has un-dergone
STR) and localizing removed text regions. We demonstrate inexperiments that
these tasks are achievable with high accuracies, en-abling detection of
potential misuse and improving STR. We also at-tempt to recover the removed
text content by training a text recognizerto understand its difficulty.
comment: 17 pages
☆ Style-Aligned Image Composition for Robust Detection of Abnormal Cells in Cytopathology
Challenges such as the lack of high-quality annotations, long-tailed data
distributions, and inconsistent staining styles pose significant obstacles to
training neural networks to detect abnormal cells in cytopathology robustly.
This paper proposes a style-aligned image composition (SAIC) method that
composes high-fidelity and style-preserved pathological images to enhance the
effectiveness and robustness of detection models. Without additional training,
SAIC first selects an appropriate candidate from the abnormal cell bank based
on attribute guidance. Then, it employs a high-frequency feature reconstruction
to achieve a style-aligned and high-fidelity composition of abnormal cells and
pathological backgrounds. Finally, it introduces a large vision-language model
to filter high-quality synthesis images. Experimental results demonstrate that
incorporating SAIC-synthesized images effectively enhances the performance and
robustness of abnormal cell detection for tail categories and styles, thereby
improving overall detection performance. The comprehensive quality evaluation
further confirms the generalizability and practicality of SAIC in clinical
application scenarios. Our code will be released at
https://github.com/Joey-Qi/SAIC.
comment: MIDL 2025 Oral
☆ DBMovi-GS: Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting CVPR
Novel view synthesis is a task of generating scenes from unseen perspectives;
however, synthesizing dynamic scenes from blurry monocular videos remains an
unresolved challenge that has yet to be effectively addressed. Existing novel
view synthesis methods are often constrained by their reliance on
high-resolution images or strong assumptions about static geometry and rigid
scene priors. Consequently, their approaches lack robustness in real-world
environments with dynamic object and camera motion, leading to instability and
degraded visual fidelity. To address this, we propose Motion-aware Dynamic View
Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting
(DBMovi-GS), a method designed for dynamic view synthesis from blurry monocular
videos. Our model generates dense 3D Gaussians, restoring sharpness from blurry
videos and reconstructing detailed 3D geometry of the scene affected by dynamic
motion variations. Our model achieves robust performance in novel view
synthesis under dynamic blurry scenes and sets a new benchmark in realistic
novel view synthesis for blurry monocular video inputs.
comment: CVPRW 2025, Neural Fields Beyond Conventional Cameras
♻ ☆ TCDiff++: An End-to-end Trajectory-Controllable Diffusion Model for Harmonious Music-Driven Group Choreography
Music-driven dance generation has garnered significant attention due to its
wide range of industrial applications, particularly in the creation of group
choreography. During the group dance generation process, however, most existing
methods still face three primary issues: multi-dancer collisions, single-dancer
foot sliding and abrupt swapping in the generation of long group dance. In this
paper, we propose TCDiff++, a music-driven end-to-end framework designed to
generate harmonious group dance. Specifically, to mitigate multi-dancer
collisions, we utilize a dancer positioning embedding to better maintain the
relative positioning among dancers. Additionally, we incorporate a
distance-consistency loss to ensure that inter-dancer distances remain within
plausible ranges. To address the issue of single-dancer foot sliding, we
introduce a swap mode embedding to indicate dancer swapping patterns and design
a Footwork Adaptor to refine raw motion, thereby minimizing foot sliding. For
long group dance generation, we present a long group diffusion sampling
strategy that reduces abrupt position shifts by injecting positional
information into the noisy input. Furthermore, we integrate a Sequence Decoder
layer to enhance the model's ability to selectively process long sequences.
Extensive experiments demonstrate that our TCDiff++ achieves state-of-the-art
performance, particularly in long-duration scenarios, ensuring high-quality and
coherent group dance generation.
♻ ☆ Towards Scalable and Generalizable Earth Observation Data Mining via Foundation Model Composition
Foundation models are rapidly transforming Earth Observation data mining by
enabling generalizable and scalable solutions for key tasks such as scene
classification and semantic segmentation. While most efforts in the geospatial
domain have focused on developing large models trained from scratch using
massive Earth Observation datasets, an alternative strategy that remains
underexplored is the reuse and combination of existing pretrained models. In
this study, we investigate whether foundation models pretrained on remote
sensing and general vision datasets can be effectively combined to improve
performance across a diverse set of key Earth Observation tasks. Using the
GEO-Bench benchmark, we evaluate several prominent models, including Prithvi,
Hiera, and DOFA, on eleven datasets covering a range of spatial resolutions,
sensor modalities, and task types. The results show that feature-level
ensembling of smaller pretrained models can match or exceed the performance of
much larger models, while requiring less training time and computational
resources. Moreover, the study highlights the potential of applying knowledge
distillation to transfer the strengths of ensembles into more compact models,
offering a practical path for deploying foundation models in real-world Earth
Observation applications.
♻ ☆ Consensus-Driven Uncertainty for Robotic Grasping based on RGB Perception IROS 2025
Deep object pose estimators are notoriously overconfident. A grasping agent
that both estimates the 6-DoF pose of a target object and predicts the
uncertainty of its own estimate could avoid task failure by choosing not to act
under high uncertainty. Even though object pose estimation improves and
uncertainty quantification research continues to make strides, few studies have
connected them to the downstream task of robotic grasping. We propose a method
for training lightweight, deep networks to predict whether a grasp guided by an
image-based pose estimate will succeed before that grasp is attempted. We
generate training data for our networks via object pose estimation on real
images and simulated grasping. We also find that, despite high object
variability in grasping trials, networks benefit from training on all objects
jointly, suggesting that a diverse variety of objects can nevertheless
contribute to the same goal.
comment: Accepted to IROS 2025
♻ ☆ Learning to Be a Transformer to Pinpoint Anomalies
To efficiently deploy strong, often pre-trained feature extractors, recent
Industrial Anomaly Detection and Segmentation (IADS) methods process
low-resolution images, e.g., 224x224 pixels, obtained by downsampling the
original input images. However, while numerous industrial applications demand
the identification of both large and small defects, downsampling the input
image to a low resolution may hinder a method's ability to pinpoint tiny
anomalies. We propose a novel Teacher--Student paradigm to leverage strong
pre-trained features while processing high-resolution input images very
efficiently. The core idea concerns training two shallow MLPs (the Students) by
nominal images so as to mimic the mappings between the patch embeddings induced
by the self-attention layers of a frozen vision Transformer (the Teacher).
Indeed, learning these mappings sets forth a challenging pretext task that
small-capacity models are unlikely to accomplish on out-of-distribution data
such as anomalous images. Our method can spot anomalies from high-resolution
images and runs way faster than competitors, achieving state-of-the-art
performance on MVTec AD and the best segmentation results on VisA. We also
propose novel evaluation metrics to capture robustness to defect size, i.e.,
the ability to preserve good localisation from large anomalies to tiny ones.
Evaluating our method also by these metrics reveals its neatly superior
performance.
comment: Accepted at IEEE Access
♻ ☆ CanFields: Consolidating Diffeomorphic Flows for Non-Rigid 4D Interpolation from Arbitrary-Length Sequences ICCV2025
We introduce Canonical Consolidation Fields (CanFields). This novel method
interpolates arbitrary-length sequences of independently sampled 3D point
clouds into a unified, continuous, and coherent deforming shape. Unlike prior
methods that oversmooth geometry or produce topological and geometric
artifacts, CanFields optimizes fine-detailed geometry and deformation jointly
in an unsupervised fitting with two novel bespoke modules. First, we introduce
a dynamic consolidator module that adjusts the input and assigns confidence
scores, balancing the optimization of the canonical shape and its motion.
Second, we represent the motion as a diffeomorphic flow parameterized by a
smooth velocity field. We have validated our robustness and accuracy on more
than 50 diverse sequences, demonstrating its superior performance even with
missing regions, noisy raw scans, and sparse data. Our project page is at:
https://wangmiaowei.github.io/CanFields.github.io/.
comment: ICCV2025 Accepted
♻ ☆ SimWorld: A Unified Benchmark for Simulator-Conditioned Scene Generation via World Model
With the rapid advancement of autonomous driving technology, a lack of data
has become a major obstacle to enhancing perception model accuracy. Researchers
are now exploring controllable data generation using world models to diversify
datasets. However, previous work has been limited to studying image generation
quality on specific public datasets. There is still relatively little research
on how to build data generation engines for real-world application scenes to
achieve large-scale data generation for challenging scenes. In this paper, a
simulator-conditioned scene generation engine based on world model is proposed.
By constructing a simulation system consistent with real-world scenes,
simulation data and labels, which serve as the conditions for data generation
in the world model, for any scenes can be collected. It is a novel data
generation pipeline by combining the powerful scene simulation capabilities of
the simulation engine with the robust data generation capabilities of the world
model. In addition, a benchmark with proportionally constructed virtual and
real data, is provided for exploring the capabilities of world models in
real-world scenes. Quantitative results show that these generated images
significantly improve downstream perception models performance. Finally, we
explored the generative performance of the world model in urban autonomous
driving scenarios. All the data and code will be available at
https://github.com/Li-Zn-H/SimWorld.
comment: 8 pages, 4 figures
♻ ☆ Chain-of-Sketch: Enabling Global Visual Reasoning
Modern vision models have achieved remarkable success in benchmarks where
local features provide critical information about the target. There is now a
growing interest in tackling tasks requiring more global reasoning, where local
features do not provide significant information. Minsky and Papert put forward
such tasks in 1969 with their connectivity study, exposing the limitations of
the perceptron model. In this paper, we introduce an expanded set of global
visual datasets involving graphs, strings, mazes, and image grids. We show that
large vision models still struggle to learn these tasks efficiently. Similarly,
state-of-the-art multi-modal LLMs perform poorly on these datasets. We explain
this learning inefficiency by means of the 'globality degree' measure. To
mitigate this, we propose a method called chain-of-sketch (CoS). Similar to the
chain-of-thought and scratchpad techniques used in language models, CoS breaks
the original task into intermediate visual steps to help learn a complex task.
In addition, we show that not all CoS strategies perform equally well. Our key
insight is to impose a Markovian structure on the CoS frames. This leads to the
introduction of 'inductive CoS' which achieves better out-of-distribution
generalization and performs well even with smaller models compared to
non-inductive variants.
comment: additional experiments added, title changed from "Visual Scratchpads:
Enabling Global Reasoning in Vision" to "Chain-of-Sketch: Enabling Global
Visual Reasoning"
♻ ☆ QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning ICCV 2025
The practical deployment of diffusion models is still hindered by the high
memory and computational overhead. Although quantization paves a way for model
compression and acceleration, existing methods face challenges in achieving
low-bit quantization efficiently. In this paper, we identify imbalanced
activation distributions as a primary source of quantization difficulty, and
propose to adjust these distributions through weight finetuning to be more
quantization-friendly. We provide both theoretical and empirical evidence
supporting finetuning as a practical and reliable solution. Building on this
approach, we further distinguish two critical types of quantized layers: those
responsible for retaining essential temporal information and those particularly
sensitive to bit-width reduction. By selectively finetuning these layers under
both local and global supervision, we mitigate performance degradation while
enhancing quantization efficiency. Our method demonstrates its efficacy across
three high-resolution image generation tasks, obtaining state-of-the-art
performance across multiple bit-width settings.
comment: ICCV 2025. Code is available at
https://github.com/hatchetProject/QuEST
♻ ☆ AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration ICCV 2025
We present AnyCalib, a method for calibrating the intrinsic parameters of a
camera from a single in-the-wild image, that is agnostic to the camera model.
Current methods are predominantly tailored to specific camera models and/or
require extrinsic cues, such as the direction of gravity, to be visible in the
image. In contrast, we argue that the perspective and distortion cues inherent
in images are sufficient for model-agnostic camera calibration. To demonstrate
this, we frame the calibration process as the regression of the rays
corresponding to each pixel. We show, for the first time, that this
intermediate representation allows for a closed-form recovery of the intrinsics
for a wide range of camera models, including but not limited to: pinhole,
Brown-Conrady and Kannala-Brandt. Our approach also applies to edited --
cropped and stretched -- images. Experimentally, we demonstrate that AnyCalib
consistently outperforms alternative methods, including 3D foundation models,
despite being trained on orders of magnitude less data. Code is available at
https://github.com/javrtg/AnyCalib.
comment: Accepted to ICCV 2025
♻ ☆ EgoM2P: Egocentric Multimodal Multitask Pretraining ICCV 2025
Understanding multimodal signals in egocentric vision, such as RGB video,
depth, camera poses, and gaze, is essential for applications in augmented
reality, robotics, and human-computer interaction, enabling systems to better
interpret the camera wearer's actions, intentions, and surrounding environment.
However, building large-scale egocentric multimodal and multitask models
presents unique challenges. Egocentric data are inherently heterogeneous, with
large variations in modality coverage across devices and settings. Generating
pseudo-labels for missing modalities, such as gaze or head-mounted camera
trajectories, is often infeasible, making standard supervised learning
approaches difficult to scale. Furthermore, dynamic camera motion and the
complex temporal and spatial structure of first-person video pose additional
challenges for the direct application of existing multimodal foundation models.
To address these challenges, we introduce a set of efficient temporal
tokenizers and propose EgoM2P, a masked modeling framework that learns from
temporally-aware multimodal tokens to train a large, general-purpose model for
egocentric 4D understanding. This unified design supports multitasking across
diverse egocentric perception and synthesis tasks, including gaze prediction,
egocentric camera tracking, and monocular depth estimation from egocentric
video, and also serves as a generative model for conditional egocentric video
synthesis. Across these tasks, EgoM2P matches or outperforms specialist models
while being an order of magnitude faster. We will fully open-source EgoM2P to
support the community and advance egocentric vision research. Project page:
https://egom2p.github.io/.
comment: Accepted by ICCV 2025
♻ ☆ Fake it till You Make it: Reward Modeling as Discriminative Prediction
An effective reward model plays a pivotal role in reinforcement learning for
post-training enhancement of visual generative models. However, current
approaches of reward modeling suffer from implementation complexity due to
their reliance on extensive human-annotated preference data or meticulously
engineered quality dimensions that are often incomplete and
engineering-intensive. Inspired by adversarial training in generative
adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward
modeling framework that eliminates manual preference annotation and explicit
quality dimension engineering. Our method trains the reward model through
discrimination between a small set of representative, unpaired target
samples(denoted as Preference Proxy Data) and model-generated ordinary outputs,
requiring only a few hundred target samples. Comprehensive experiments
demonstrate our GAN-RM's effectiveness across multiple key applications
including test-time scaling implemented as Best-of-N sample filtering,
post-training approaches like Supervised Fine-Tuning (SFT) and Direct
Preference Optimization (DPO). Code and data will be released at
https://github.com/Visualignment/GAN-RM.
♻ ☆ Materialist: Physically Based Editing Using Single-Image Inverse Rendering
Lezhong Wang, Duc Minh Tran, Ruiqi Cui, Thomson TG, Anders Bjorholm Dahl, Siavash Arjomand Bigdeli, Jeppe Revall Frisvad, Manmohan Chandraker
Achieving physically consistent image editing remains a significant challenge
in computer vision. Existing image editing methods typically rely on neural
networks, which struggle to accurately handle shadows and refractions.
Conversely, physics-based inverse rendering often requires multi-view
optimization, limiting its practicality in single-image scenarios. In this
paper, we propose Materialist, a method combining a learning-based approach
with physically based progressive differentiable rendering. Given an image, our
method leverages neural networks to predict initial material properties.
Progressive differentiable rendering is then used to optimize the environment
map and refine the material properties with the goal of closely matching the
rendered result to the input image. Our approach enables a range of
applications, including material editing, object insertion, and relighting,
while also introducing an effective method for editing material transparency
without requiring full scene geometry. Furthermore, Our envmap estimation
method also achieves state-of-the-art performance, further enhancing the
accuracy of image editing task. Experiments demonstrate strong performance
across synthetic and real-world datasets, excelling even on challenging
out-of-domain images. Project website:
https://lez-s.github.io/materialist_project/
comment: Add acknowledgements, more authors and more results. Project website:
https://lez-s.github.io/materialist_project/
♻ ☆ DisCoPatch: Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection ICCV 2025
Francisco Caetano, Christiaan Viviers, Luis A. Zavala-Mondragón, Peter H. N. de With, Fons van der Sommen
Out-of-distribution (OOD) detection holds significant importance across many
applications. While semantic and domain-shift OOD problems are well-studied,
this work focuses on covariate shifts - subtle variations in the data
distribution that can degrade machine learning performance. We hypothesize that
detecting these subtle shifts can improve our understanding of in-distribution
boundaries, ultimately improving OOD detection. In adversarial discriminators
trained with Batch Normalization (BN), real and adversarial samples form
distinct domains with unique batch statistics - a property we exploit for OOD
detection. We introduce DisCoPatch, an unsupervised Adversarial Variational
Autoencoder (VAE) framework that harnesses this mechanism. During inference,
batches consist of patches from the same image, ensuring a consistent data
distribution that allows the model to rely on batch statistics. DisCoPatch uses
the VAE's suboptimal outputs (generated and reconstructed) as negative samples
to train the discriminator, thereby improving its ability to delineate the
boundary between in-distribution samples and covariate shifts. By tightening
this boundary, DisCoPatch achieves state-of-the-art results in public OOD
detection benchmarks. The proposed model not only excels in detecting covariate
shifts, achieving 95.5% AUROC on ImageNet-1K(-C) but also outperforms all prior
methods on public Near-OOD (95.0%) benchmarks. With a compact model size of
25MB, it achieves high OOD detection performance at notably lower latency than
existing methods, making it an efficient and practical solution for real-world
OOD detection applications. The code is publicly available.
comment: ICCV 2025
♻ ☆ Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling ICCV 2025
Fengxiang Wang, Hongzhen Wang, Di Wang, Zonghao Guo, Zhenyu Zhong, Long Lan, Wenjing Yang, Jing Zhang
Masked Image Modeling (MIM) has become an essential method for building
foundational visual models in remote sensing (RS). However, the limitations in
size and diversity of existing RS datasets restrict the ability of MIM methods
to learn generalizable representations. Additionally, conventional MIM
techniques, which require reconstructing all tokens, introduce unnecessary
computational overhead. To address these issues, we present a new pre-training
pipeline for RS models, featuring the creation of a large-scale RS dataset and
an efficient MIM approach. We curated a high-quality dataset named
\textbf{OpticalRS-13M} by collecting publicly available RS datasets and
processing them through exclusion, slicing, and deduplication. OpticalRS-13M
comprises 13 million optical images covering various RS tasks, such as object
detection and pixel segmentation. To enhance efficiency, we propose
\textbf{SelectiveMAE}, a pre-training method that dynamically encodes and
reconstructs semantically rich patch tokens, thereby reducing the
inefficiencies of traditional MIM models caused by redundant background pixels
in RS images. Extensive experiments show that OpticalRS-13M significantly
improves classification, detection, and segmentation performance, while
SelectiveMAE increases training efficiency over 2$\times$ times. This
highlights the effectiveness and scalability of our pipeline in developing RS
foundational models. The dataset, source code, and trained models will be
released at https://github.com/MiliLab/SelectiveMAE.
comment: ICCV 2025
♻ ☆ OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation
Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, Hai-Bao Chen
Text-to-image (T2I) models have garnered significant attention for generating
high-quality images aligned with text prompts. However, rapid T2I model
advancements reveal limitations in early benchmarks, lacking comprehensive
evaluations, for example, the evaluation on reasoning, text rendering and
style. Notably, recent state-of-the-art models, with their rich knowledge
modeling capabilities, show promising results on the image generation problems
requiring strong reasoning ability, yet existing evaluation systems have not
adequately addressed this frontier. To systematically address these gaps, we
introduce OneIG-Bench, a meticulously designed comprehensive benchmark
framework for fine-grained evaluation of T2I models across multiple dimensions,
including prompt-image alignment, text rendering precision, reasoning-generated
content, stylization, and diversity. By structuring the evaluation, this
benchmark enables in-depth analysis of model performance, helping researchers
and practitioners pinpoint strengths and bottlenecks in the full pipeline of
image generation. Specifically, OneIG-Bench enables flexible evaluation by
allowing users to focus on a particular evaluation subset. Instead of
generating images for the entire set of prompts, users can generate images only
for the prompts associated with the selected dimension and complete the
corresponding evaluation accordingly. Our codebase and dataset are now publicly
available to facilitate reproducible evaluation studies and cross-model
comparisons within the T2I research community.
♻ ☆ Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
We introduce a diffusion-based framework that performs aligned novel view
image and geometry generation via a warping-and-inpainting methodology. Unlike
prior methods that require dense posed images or pose-embedded generative
models limited to in-domain views, our method leverages off-the-shelf geometry
predictors to predict partial geometries viewed from reference images, and
formulates novel-view synthesis as an inpainting task for both image and
geometry. To ensure accurate alignment between generated images and geometry,
we propose cross-modal attention distillation, where attention maps from the
image diffusion branch are injected into a parallel geometry diffusion branch
during both training and inference. This multi-task approach achieves
synergistic effects, facilitating geometrically robust image synthesis as well
as well-defined geometry prediction. We further introduce proximity-based mesh
conditioning to integrate depth and normal cues, interpolating between point
cloud and filtering erroneously predicted geometry from influencing the
generation process. Empirically, our method achieves high-fidelity
extrapolative view synthesis on both image and geometry across a range of
unseen scenes, delivers competitive reconstruction quality under interpolation
settings, and produces geometrically aligned colored point clouds for
comprehensive 3D completion. Project page is available at
https://cvlab-kaist.github.io/MoAI.
comment: Project page at https://cvlab-kaist.github.io/MoAI
♻ ☆ STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution
for Embodied AI and Autonomous Driving has become a prevailing trend. While
MLLMs have been extensively studied for visual semantic understanding tasks,
their ability to perform precise and quantitative spatial-temporal
understanding in real-world applications remains largely unexamined, leading to
uncertain prospects. To evaluate models' Spatial-Temporal Intelligence, we
introduce STI-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal
understanding through challenging tasks such as estimating and predicting the
appearance, pose, displacement, and motion of objects. Our benchmark
encompasses a wide range of robot and vehicle operations across desktop,
indoor, and outdoor scenarios. The extensive experiments reveals that the
state-of-the-art MLLMs still struggle in real-world spatial-temporal
understanding, especially in tasks requiring precise distance estimation and
motion analysis.
♻ ☆ Tackling fluffy clouds: robust field boundary delineation across global agricultural landscapes with Sentinel-1 and Sentinel-2 Time Series
Foivos I. Diakogiannis, Zheng-Shu Zhou, Jeff Wang, Gonzalo Mata, Dave Henry, Roger Lawes, Amy Parker, Peter Caccetta, Rodrigo Ibata, Ondrej Hlinka, Jonathan Richetti, Kathryn Batchelor, Chris Herrmann, Andrew Toovey, John Taylor
Accurate delineation of agricultural field boundaries is essential for
effective crop monitoring and resource management. However, competing
methodologies often face significant challenges, particularly in their reliance
on extensive manual efforts for cloud-free data curation and limited
adaptability to diverse global conditions. In this paper, we introduce
PTAViT3D, a deep learning architecture specifically designed for processing
three-dimensional time series of satellite imagery from either Sentinel-1 (S1)
or Sentinel-2 (S2). Additionally, we present PTAViT3D-CA, an extension of the
PTAViT3D model incorporating cross-attention mechanisms to fuse S1 and S2
datasets, enhancing robustness in cloud-contaminated scenarios. The proposed
methods leverage spatio-temporal correlations through a memory-efficient 3D
Vision Transformer architecture, facilitating accurate boundary delineation
directly from raw, cloud-contaminated imagery. We comprehensively validate our
models through extensive testing on various datasets, including Australia's
ePaddocks - CSIRO's national agricultural field boundary product - alongside
public benchmarks Fields-of-the-World, PASTIS, and AI4SmallFarms. Our results
consistently demonstrate state-of-the-art performance, highlighting excellent
global transferability and robustness. Crucially, our approach significantly
simplifies data preparation workflows by reliably processing cloud-affected
imagery, thereby offering strong adaptability across diverse agricultural
environments. Our code and models are publicly available at
https://github.com/feevos/tfcl.
comment: revision 1, under review
♻ ☆ Mr. DETR++: Instructive Multi-Route Training for Detection Transformers with Mixture-of-Experts CVPR 2025
Existing methods enhance the training of detection transformers by
incorporating an auxiliary one-to-many assignment. In this work, we treat the
model as a multi-task framework, simultaneously performing one-to-one and
one-to-many predictions. We investigate the roles of each component in the
transformer decoder across these two training targets, including
self-attention, cross-attention, and feed-forward network. Our empirical
results demonstrate that any independent component in the decoder can
effectively learn both targets simultaneously, even when other components are
shared. This finding leads us to propose a multi-route training mechanism,
featuring a primary route for one-to-one prediction and two auxiliary training
routes for one-to-many prediction. We propose a novel instructive
self-attention mechanism, integrated into the first auxiliary route, which
dynamically and flexibly guides object queries for one-to-many prediction. For
the second auxiliary route, we introduce a route-aware Mixture-of-Experts (MoE)
to facilitate knowledge sharing while mitigating potential conflicts between
routes. Additionally, we apply an MoE to low-scale features in the encoder,
optimizing the balance between efficiency and effectiveness. The auxiliary
routes are discarded during inference. We conduct extensive experiments across
various object detection baselines, achieving consistent improvements as
demonstrated in Fig. 1. Our method is highly flexible and can be readily
adapted to other tasks. To demonstrate its versatility, we conduct experiments
on both instance segmentation and panoptic segmentation, further validating its
effectiveness. Project page: https://visual-ai.github.io/mrdetr/
comment: Under review. Extended version of our CVPR 2025 paper, see
arXiv:2412.10028v3
♻ ☆ PuriDefense: Randomized Local Implicit Adversarial Purification for Defending Black-box Query-based Attacks
Black-box query-based attacks constitute significant threats to Machine
Learning as a Service (MLaaS) systems since they can generate adversarial
examples without accessing the target model's architecture and parameters.
Traditional defense mechanisms, such as adversarial training, gradient masking,
and input transformations, either impose substantial computational costs or
compromise the test accuracy of non-adversarial inputs. To address these
challenges, we propose an efficient defense mechanism, PuriDefense, that
employs random patch-wise purifications with an ensemble of lightweight
purification models at a low level of inference cost. These models leverage the
local implicit function and rebuild the natural image manifold. Our theoretical
analysis suggests that this approach slows down the convergence of query-based
attacks by incorporating randomness into purifications. Extensive experiments
on CIFAR-10 and ImageNet validate the effectiveness of our proposed
purifier-based defense mechanism, demonstrating significant improvements in
robustness against query-based attacks.
♻ ☆ Rethinking Detecting Salient and Camouflaged Objects in Unconstrained Scenes
While the human visual system employs distinct mechanisms to perceive salient
and camouflaged objects, existing models struggle to disentangle these tasks.
Specifically, salient object detection (SOD) models frequently misclassify
camouflaged objects as salient, while camouflaged object detection (COD) models
conversely misinterpret salient objects as camouflaged. We hypothesize that
this can be attributed to two factors: (i) the specific annotation paradigm of
current SOD and COD datasets, and (ii) the lack of explicit attribute
relationship modeling in current models. Prevalent SOD/COD datasets enforce a
mutual exclusivity constraint, assuming scenes contain either salient or
camouflaged objects, which poorly aligns with the real world. Furthermore,
current SOD/COD methods are primarily designed for these highly constrained
datasets and lack explicit modeling of the relationship between salient and
camouflaged objects. In this paper, to promote the development of unconstrained
salient and camouflaged object detection, we construct a large-scale dataset,
USC12K, which features comprehensive labels and four different scenes that
cover all possible logical existence scenarios of both salient and camouflaged
objects. To explicitly model the relationship between salient and camouflaged
objects, we propose a model called USCNet, which introduces two distinct prompt
query mechanisms for modeling inter-sample and intra-sample attribute
relationships. Additionally, to assess the model's ability to distinguish
between salient and camouflaged objects, we design an evaluation metric called
CSCS. The proposed method achieves state-of-the-art performance across all
scenes in various metrics. The code and dataset will be available at
https://github.com/ssecv/USCNet.
comment: 18 pages, 11 figures
♻ ☆ Recall and Refine: A Simple but Effective Source-free Open-set Domain Adaptation Framework
Open-set Domain Adaptation (OSDA) aims to adapt a model from a labeled source
domain to an unlabeled target domain, where novel classes - also referred to as
target-private unknown classes - are present. Source-free Open-set Domain
Adaptation (SF-OSDA) methods address OSDA without accessing labeled source
data, making them particularly relevant under privacy constraints. However,
SF-OSDA presents significant challenges due to distribution shifts and the
introduction of novel classes. Existing SF-OSDA methods typically rely on
thresholding the prediction entropy of a sample to identify it as either a
known or unknown class, but fail to explicitly learn discriminative features
for the target-private unknown classes. We propose Recall and Refine (RRDA), a
novel SF-OSDA framework designed to address these limitations by explicitly
learning features for target-private unknown classes. RRDA employs a two-stage
process. First, we enhance the model's capacity to recognize unknown classes by
training a target classifier with an additional decision boundary,guided by
synthetic samples generated from target domain features. This enables the
classifier to effectively separate known and unknown classes. Second, we adapt
the entire model to the target domain, addressing both domain shifts and
distinguishability to unknown classes. Any off-the-shelf source-free domain
adaptation method (e.g. SHOT, AaD) can be seamlessly integrated into our
framework at this stage. Extensive experiments on three benchmark datasets
demonstrate that RRDA significantly outperforms existing SF-OSDA and OSDA
methods.
comment: Accepted at TMLR 2025
♻ ☆ Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels SC
Finding correspondences between semantically similar points across images and
object instances is one of the everlasting challenges in computer vision. While
large pre-trained vision models have recently been demonstrated as effective
priors for semantic matching, they still suffer from ambiguities for symmetric
objects or repeated object parts. We propose to improve semantic correspondence
estimation via 3D-aware pseudo-labeling. Specifically, we train an adapter to
refine off-the-shelf features using pseudo-labels obtained via 3D-aware
chaining, filtering wrong labels through relaxed cyclic consistency, and 3D
spherical prototype mapping constraints. While reducing the need for dataset
specific annotations compared to prior work, we set a new state-of-the-art on
SPair-71k by over 4% absolute gain and by over 7% against methods with similar
supervision requirements. The generality of our proposed approach simplifies
extension of training to other data sources, which we demonstrate in our
experiments.
comment: Project page: https://genintel.github.io/DIY-SC
♻ ☆ Semantic Scene Graph for Ultrasound Image Explanation and Scanning Guidance
Understanding medical ultrasound imaging remains a long-standing challenge
due to significant visual variability caused by differences in imaging and
acquisition parameters. Recent advancements in large language models (LLMs)
have been used to automatically generate terminology-rich summaries orientated
to clinicians with sufficient physiological knowledge. Nevertheless, the
increasing demand for improved ultrasound interpretability and basic scanning
guidance among non-expert users, e.g., in point-of-care settings, has not yet
been explored. In this study, we first introduce the scene graph (SG) for
ultrasound images to explain image content to ordinary and provide guidance for
ultrasound scanning. The ultrasound SG is first computed using a
transformer-based one-stage method, eliminating the need for explicit object
detection. To generate a graspable image explanation for ordinary, the user
query is then used to further refine the abstract SG representation through
LLMs. Additionally, the predicted SG is explored for its potential in guiding
ultrasound scanning toward missing anatomies within the current imaging view,
assisting ordinary users in achieving more standardized and complete anatomical
exploration. The effectiveness of this SG-based image explanation and scanning
guidance has been validated on images from the left and right neck regions,
including the carotid and thyroid, across five volunteers. The results
demonstrate the potential of the method to maximally democratize ultrasound by
enhancing its interpretability and usability for ordinaries.
♻ ☆ Enhancing Dynamic CT Image Reconstruction with Neural Fields and Optical Flow
In this paper, we investigate image reconstruction for dynamic Computed
Tomography. The motion of the target with respect to the measurement
acquisition rate leads to highly resolved in time but highly undersampled in
space measurements. Such problems pose a major challenge: not accounting for
the dynamics of the process leads to a poor reconstruction with non-realistic
motion. Variational approaches that penalize time evolution have been proposed
to relate subsequent frames and improve image quality based on classical
grid-based discretizations. Neural fields have emerged as a novel way to
parameterize the quantity of interest using a neural network with a
low-dimensional input, benefiting from being lightweight, continuous, and
biased towards smooth representations. The latter property has been exploited
when solving dynamic inverse problems with neural fields by minimizing a
data-fidelity term only. We investigate and show the benefits of introducing
explicit motion regularizers for dynamic inverse problems based on partial
differential equations, namely, the optical flow equation, for the optimization
of neural fields. We compare it against its unregularized counterpart and show
the improvements in the reconstruction. We also compare neural fields against a
grid-based solver and show that the former outperforms the latter in terms of
PSNR in this task.
♻ ☆ 3D Hierarchical Panoptic Segmentation in Real Orchard Environments Across Different Sensors IROS 2025
Matteo Sodano, Federico Magistri, Elias Marks, Fares Hosn, Aibek Zurbayev, Rodrigo Marcuzzi, Meher V. R. Malladi, Jens Behley, Cyrill Stachniss
Crop yield estimation is a relevant problem in agriculture, because an
accurate yield estimate can support farmers' decisions on harvesting or
precision intervention. Robots can help to automate this process. To do so,
they need to be able to perceive the surrounding environment to identify target
objects such as trees and plants. In this paper, we introduce a novel approach
to address the problem of hierarchical panoptic segmentation of apple orchards
on 3D data from different sensors. Our approach is able to simultaneously
provide semantic segmentation, instance segmentation of trunks and fruits, and
instance segmentation of trees (a trunk with its fruits). This allows us to
identify relevant information such as individual plants, fruits, and trunks,
and capture the relationship among them, such as precisely estimate the number
of fruits associated to each tree in an orchard. To efficiently evaluate our
approach for hierarchical panoptic segmentation, we provide a dataset designed
specifically for this task. Our dataset is recorded in Bonn, Germany, in a real
apple orchard with a variety of sensors, spanning from a terrestrial laser
scanner to a RGB-D camera mounted on different robots platforms. The
experiments show that our approach surpasses state-of-the-art approaches in 3D
panoptic segmentation in the agricultural domain, while also providing full
hierarchical panoptic segmentation. Our dataset is publicly available at
https://www.ipb.uni-bonn.de/data/hops/. The open-source implementation of our
approach is available at https://github.com/PRBonn/hapt3D.
comment: Accepted to IROS 2025
♻ ☆ Cell Tracking according to Biological Needs -- Strong Mitosis-aware Multi-Hypothesis Tracker with Aleatoric Uncertainty
Cell tracking and segmentation assist biologists in extracting insights from
large-scale microscopy time-lapse data. Driven by local accuracy metrics,
current tracking approaches often suffer from a lack of long-term consistency
and the ability to reconstruct lineage trees correctly. To address this issue,
we introduce an uncertainty estimation technique for motion estimation
frameworks and extend the multi-hypothesis tracking framework. Our uncertainty
estimation lifts motion representations into probabilistic spatial densities
using problem-specific test-time augmentations. Moreover, we introduce a novel
mitosis-aware assignment problem formulation that allows multi-hypothesis
trackers to model cell splits and to resolve false associations and mitosis
detections based on long-term conflicts. In our framework, explicit biological
knowledge is modeled in assignment costs. We evaluate our approach on nine
competitive datasets and demonstrate that we outperform the current
state-of-the-art on biologically inspired metrics substantially, achieving
improvements by a factor of approximately 6 and uncover new insights into the
behavior of motion estimation uncertainty.
comment: 13 pages, 4 figures, 4 tables. This work has been accepted to the
IEEE for publication
♻ ☆ SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking
Text-based person retrieval aims to identify a target individual from a
gallery of images based on a natural language description. It presents a
significant challenge due to the complexity of real-world scenes and the
ambiguity of appearance-related descriptions. Existing methods primarily
emphasize appearance-based cross-modal retrieval, often neglecting the
contextual information embedded within the scene, which can offer valuable
complementary insights for retrieval. To address this, we introduce
SCENEPERSON-13W, a large-scale dataset featuring over 100,000 scenes with rich
annotations covering both pedestrian appearance and environmental cues. Based
on this, we propose SA-Person, a two-stage retrieval framework. In the first
stage, it performs discriminative appearance grounding by aligning textual cues
with pedestrian-specific regions. In the second stage, it introduces
SceneRanker, a training-free, scene-aware re-ranking method leveraging
multimodal large language models to jointly reason over pedestrian appearance
and the global scene context. Experiments on SCENEPERSON-13W validate the
effectiveness of our framework in challenging scene-level retrieval scenarios.
The code and dataset will be made publicly available.
comment: 22 pages, 7 figures. Under review
♻ ☆ Variational Supervised Contrastive Learning
Contrastive learning has proven to be highly efficient and adaptable in
shaping representation spaces across diverse modalities by pulling similar
samples together and pushing dissimilar ones apart. However, two key
limitations persist: (1) Without explicit regulation of the embedding
distribution, semantically related instances can inadvertently be pushed apart
unless complementary signals guide pair selection, and (2) excessive reliance
on large in-batch negatives and tailored augmentations hinders generalization.
To address these limitations, we propose Variational Supervised Contrastive
Learning (VarCon), which reformulates supervised contrastive learning as
variational inference over latent class variables and maximizes a
posterior-weighted evidence lower bound (ELBO) that replaces exhaustive
pair-wise comparisons for efficient class-aware matching and grants
fine-grained control over intra-class dispersion in the embedding space.
Trained exclusively on image data, our experiments on CIFAR-10, CIFAR-100,
ImageNet-100, and ImageNet-1K show that VarCon (1) achieves state-of-the-art
performance for contrastive learning frameworks, reaching 79.36% Top-1 accuracy
on ImageNet-1K and 78.29% on CIFAR-100 with a ResNet-50 encoder while
converging in just 200 epochs; (2) yields substantially clearer decision
boundaries and semantic organization in the embedding space, as evidenced by
KNN classification, hierarchical clustering results, and transfer-learning
assessments; and (3) demonstrates superior performance in few-shot learning
than supervised baseline and superior robustness across various augmentation
strategies.
♻ ☆ Structure-Preserving Patch Decoding for Efficient Neural Video Representation
Implicit neural representations (INRs) are the subject of extensive research,
particularly in their application to modeling complex signals by mapping
spatial and temporal coordinates to corresponding values. When handling videos,
mapping compact inputs to entire frames or spatially partitioned patch images
is an effective approach. This strategy better preserves spatial relationships,
reduces computational overhead, and improves reconstruction quality compared to
coordinate-based mapping. However, predicting entire frames often limits the
reconstruction of high-frequency visual details. Additionally, conventional
patch-based approaches based on uniform spatial partitioning tend to introduce
boundary discontinuities that degrade spatial coherence. We propose a neural
video representation method based on Structure-Preserving Patches (SPPs) to
address such limitations. Our method separates each video frame into patch
images of spatially aligned frames through a deterministic pixel-based
splitting similar to PixelUnshuffle. This operation preserves the global
spatial structure while allowing patch-level decoding. We train the decoder to
reconstruct these structured patches, enabling a global-to-local decoding
strategy that captures the global layout first and refines local details. This
effectively reduces boundary artifacts and mitigates distortions from naive
upsampling. Experiments on standard video datasets demonstrate that our method
achieves higher reconstruction quality and better compression performance than
existing INR-based baselines.
♻ ☆ StateSpaceDiffuser: Bringing Long Context to Diffusion World Models
World models have recently become promising tools for predicting realistic
visuals based on actions in complex environments. However, their reliance on
only a few recent observations leads them to lose track of the long-term
context. Consequently, in just a few steps the generated scenes drift from what
was previously observed, undermining the temporal coherence of the sequence.
This limitation of the state-of-the-art world models, most of which rely on
diffusion, comes from their lack of a lasting environment state. To address
this problem, we introduce StateSpaceDiffuser, where a diffusion model is
enabled to perform long-context tasks by integrating features from a
state-space model, representing the entire interaction history. This design
restores long-term memory while preserving the high-fidelity synthesis of
diffusion models. To rigorously measure temporal consistency, we develop an
evaluation protocol that probes a model's ability to reinstantiate seen content
in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser
significantly outperforms a strong diffusion-only baseline, maintaining a
coherent visual context for an order of magnitude more steps. It delivers
consistent views in both a 2D maze navigation and a complex 3D environment.
These results establish that bringing state-space representations into
diffusion models is highly effective in demonstrating both visual details and
long-term memory.
♻ ☆ Moderating the Generalization of Score-based Generative Model
Score-based Generative Models (SGMs) have demonstrated remarkable
generalization abilities, e.g. generating unseen, but natural data. However,
the greater the generalization power, the more likely the unintended
generalization, and the more dangerous the abuse. Research on moderated
generalization in SGMs remains limited. To fill this gap, we first examine the
current 'gold standard' in Machine Unlearning (MU), i.e., re-training the model
after removing the undesirable training data, and find it does not work in
SGMs. Further analysis of score functions reveals that the MU 'gold standard'
does not alter the original score function, which explains its ineffectiveness.
Based on this insight, we propose the first Moderated Score-based Generative
Model (MSGM), which introduces a novel score adjustment strategy that redirects
the score function away from undesirable data during the continuous-time
stochastic differential equation process. Extensive experimental results
demonstrate that MSGM significantly reduces the likelihood of generating
undesirable content while preserving high visual quality for normal image
generation. Albeit designed for SGMs, MSGM is a general and flexible MU
framework that is compatible with diverse diffusion architectures (SGM and
DDPM) and training strategies (re-training and fine-tuning), and enables
zero-shot transfer of the pre-trained models to downstream tasks, e.g. image
inpainting and reconstruction. The code will be shared upon acceptance.
♻ ☆ Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning
Recent advancements in large language models (LLMs) have witnessed a surge in
the development of advanced reasoning paradigms, which are now being integrated
into multimodal large language models (MLLMs). However, existing approaches
often fall short: methods solely employing reinforcement learning (RL) can
struggle with sample inefficiency and activating entirely absent reasoning
capabilities, while conventional pipelines that initiate with a cold-start
supervised fine-tuning (SFT) phase before RL may restrict the model's
exploratory capacity and face suboptimal convergence. In this work, we
introduce \textbf{Metis-RISE} (\textbf{R}L \textbf{I}ncentivizes and
\textbf{S}FT \textbf{E}nhances) for multimodal reasoning model learning. Unlike
conventional approaches, Metis-RISE distinctively omits an initial SFT stage,
beginning instead with an RL phase (e.g., using a Group Relative Policy
Optimization variant) to incentivize and activate the model's latent reasoning
capacity. Subsequently, the targeted SFT stage addresses two key challenges
identified during RL: (1) \textit{inefficient trajectory sampling} for tasks
where the model possesses but inconsistently applies correct reasoning, which
we tackle using self-distilled reasoning trajectories from the RL model itself;
and (2) \textit{fundamental capability absence}, which we address by injecting
expert-augmented knowledge for prompts where the model entirely fails. This
strategic application of RL for incentivization followed by SFT for enhancement
forms the core of Metis-RISE, leading to two versions of our MLLMs (7B and 72B
parameters). Evaluations on the OpenCompass Multimodal Reasoning Leaderboard
demonstrate that both models achieve state-of-the-art performance among
similar-sized models, with the 72B version ranking fourth overall. Please refer
to our project page for open-source information.
comment: Project Page: https://github.com/MM-Thinking/Metis-RISE
♻ ☆ Self-Regulated Neurogenesis for Online Data-Incremental Learning
Neural networks often struggle with catastrophic forgetting when learning
sequences of tasks or data streams, unlike humans who can continuously learn
and consolidate new concepts even in the absence of explicit cues. Online
data-incremental learning seeks to emulate this capability by processing each
sample only once, without having access to task or stream cues at any point in
time since this is more realistic compared to offline setups, where all data
from novel class(es) is assumed to be readily available. However, existing
methods typically rely on storing the subsets of data in memory or expanding
the initial model architecture, resulting in significant computational
overhead. Drawing inspiration from 'self-regulated neurogenesis'-brain's
mechanism for creating specialized regions or circuits for distinct
functions-we propose a novel approach SERENA which encodes each concept in a
specialized network path called 'concept cell', integrated into a single
over-parameterized network. Once a concept is learned, its corresponding
concept cell is frozen, effectively preventing the forgetting of previously
acquired information. Furthermore, we introduce two new continual learning
scenarios that more closely reflect real-world conditions, characterized by
gradually changing sample sizes. Experimental results show that our method not
only establishes new state-of-the-art results across ten benchmarks but also
remarkably surpasses offline supervised batch learning performance. The code is
available at https://github.com/muratonuryildirim/serena.
comment: Published at Conference on Lifelong Learning Agents (CoLLAs) 2025
♻ ☆ Referring Expression Instance Retrieval and A Strong End-to-End Baseline
Using natural language to query visual information is a fundamental need in
real-world applications. Text-Image Retrieval (TIR) retrieves a target image
from a gallery based on an image-level description, while Referring Expression
Comprehension (REC) localizes a target object within a given image using an
instance-level description. However, real-world applications often present more
complex demands. Users typically query an instance-level description across a
large gallery and expect to receive both relevant image and the corresponding
instance location. In such scenarios, TIR struggles with fine-grained
descriptions and object-level localization, while REC is limited in its ability
to efficiently search large galleries and lacks an effective ranking mechanism.
In this paper, we introduce a new task called \textbf{Referring Expression
Instance Retrieval (REIR)}, which supports both instance-level retrieval and
localization based on fine-grained referring expressions. First, we propose a
large-scale benchmark for REIR, named REIRCOCO, constructed by prompting
advanced vision-language models to generate high-quality referring expressions
for instances in the MSCOCO and RefCOCO datasets. Second, we present a baseline
method, Contrastive Language-Instance Alignment with Relation Experts (CLARE),
which employs a dual-stream architecture to address REIR in an end-to-end
manner. Given a referring expression, the textual branch encodes it into a
query embedding. The visual branch detects candidate objects and extracts their
instance-level visual features. The most similar candidate to the query is
selected for bounding box prediction. CLARE is first trained on object
detection and REC datasets to establish initial grounding capabilities, then
optimized via Contrastive Language-Instance Alignment (CLIA) for improved
retrieval across images. We will release our code and benchmark publicly.
♻ ☆ ROA-BEV: 2D Region-Oriented Attention for BEV-based 3D Object Detection IROS 2025
Vision-based Bird's-Eye-View (BEV) 3D object detection has recently become
popular in autonomous driving. However, objects with a high similarity to the
background from a camera perspective cannot be detected well by existing
methods. In this paper, we propose a BEV-based 3D Object Detection Network with
2D Region-Oriented Attention (ROA-BEV), which enables the backbone to focus
more on feature learning of the regions where objects exist. Moreover, our
method further enhances the information feature learning ability of ROA through
multi-scale structures. Each block of ROA utilizes a large kernel to ensure
that the receptive field is large enough to catch information about large
objects. Experiments on nuScenes show that ROA-BEV improves the performance
based on BEVDepth. The source codes of this work will be available at
https://github.com/DFLyan/ROA-BEV.
comment: accepted by IROS 2025
♻ ☆ Is my Data in your AI Model? Membership Inference Test with Application to Face Images
Daniel DeAlcala, Aythami Morales, Julian Fierrez, Gonzalo Mancera, Ruben Tolosana, Javier Ortega-Garcia
This article introduces the Membership Inference Test (MINT), a novel
approach that aims to empirically assess if given data was used during the
training of AI/ML models. Specifically, we propose two MINT architectures
designed to learn the distinct activation patterns that emerge when an Audited
Model is exposed to data used during its training process. These architectures
are based on Multilayer Perceptrons (MLPs) and Convolutional Neural Networks
(CNNs). The experimental framework focuses on the challenging task of Face
Recognition, considering three state-of-the-art Face Recognition systems.
Experiments are carried out using six publicly available databases, comprising
over 22 million face images in total. Different experimental scenarios are
considered depending on the context of the AI model to test. Our proposed MINT
approach achieves promising results, with up to 90\% accuracy, indicating the
potential to recognize if an AI model has been trained with specific data. The
proposed MINT approach can serve to enforce privacy and fairness in several AI
applications, e.g., revealing if sensitive or private data was used for
training or tuning Large Language Models (LLMs).
comment: 26 pages main text and 2 pages appendix
♻ ☆ HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI Analysis
Pathology is essential for cancer diagnosis, with multiple instance learning
(MIL) widely used for whole slide image (WSI) analysis. WSIs exhibit a natural
hierarchy -- patches, regions, and slides -- with distinct semantic
associations. While some methods attempt to leverage this hierarchy for
improved representation, they predominantly rely on Euclidean embeddings, which
struggle to fully capture semantic hierarchies. To address this limitation, we
propose HyperPath, a novel method that integrates knowledge from textual
descriptions to guide the modeling of semantic hierarchies of WSIs in
hyperbolic space, thereby enhancing WSI classification. Our approach adapts
both visual and textual features extracted by pathology vision-language
foundation models to the hyperbolic space. We design an Angular Modality
Alignment Loss to ensure robust cross-modal alignment, while a Semantic
Hierarchy Consistency Loss further refines feature hierarchies through
entailment and contradiction relationships and thus enhance semantic coherence.
The classification is performed with geodesic distance, which measures the
similarity between entities in the hyperbolic semantic hierarchy. This
eliminates the need for linear classifiers and enables a geometry-aware
approach to WSI analysis. Extensive experiments show that our method achieves
superior performance across tasks compared to existing methods, highlighting
the potential of hyperbolic embeddings for WSI analysis.
♻ ☆ HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics ICCV 2025
Long-form video understanding presents unique challenges that extend beyond
traditional short-video analysis approaches, particularly in capturing
long-range dependencies, processing redundant information efficiently, and
extracting high-level semantic concepts. To address these challenges, we
propose a novel approach that more accurately reflects human cognition. This
paper introduces HERMES: temporal-coHERent long-forM understanding with
Episodes and Semantics, featuring two versatile modules that can enhance
existing video-language models or operate as a standalone system. Our Episodic
COmpressor (ECO) efficiently aggregates representations from micro to
semi-macro levels, reducing computational overhead while preserving temporal
dependencies. Our Semantics ReTRiever (SeTR) enriches these representations
with semantic information by focusing on broader context, dramatically reducing
feature dimensionality while preserving relevant macro-level information. We
demonstrate that these modules can be seamlessly integrated into existing SOTA
models, consistently improving their performance while reducing inference
latency by up to 43% and memory usage by 46%. As a standalone system, HERMES
achieves state-of-the-art performance across multiple long-video understanding
benchmarks in both zero-shot and fully-supervised settings.
comment: Accepted for ICCV 2025. Project page:
https://joslefaure.github.io/assets/html/hermes.html
♻ ☆ ClearSight: Human Vision-Inspired Solutions for Event-Based Motion Deblurring ICCV 2025
Motion deblurring addresses the challenge of image blur caused by camera or
scene movement. Event cameras provide motion information that is encoded in the
asynchronous event streams. To efficiently leverage the temporal information of
event streams, we employ Spiking Neural Networks (SNNs) for motion feature
extraction and Artificial Neural Networks (ANNs) for color information
processing. Due to the non-uniform distribution and inherent redundancy of
event data, existing cross-modal feature fusion methods exhibit certain
limitations. Inspired by the visual attention mechanism in the human visual
system, this study introduces a bioinspired dual-drive hybrid network (BDHNet).
Specifically, the Neuron Configurator Module (NCM) is designed to dynamically
adjusts neuron configurations based on cross-modal features, thereby focusing
the spikes in blurry regions and adapting to varying blurry scenarios
dynamically. Additionally, the Region of Blurry Attention Module (RBAM) is
introduced to generate a blurry mask in an unsupervised manner, effectively
extracting motion clues from the event features and guiding more accurate
cross-modal feature fusion. Extensive subjective and objective evaluations
demonstrate that our method outperforms current state-of-the-art methods on
both synthetic and real-world datasets.
comment: Accepted by ICCV 2025
♻ ☆ ToMiE: Towards Explicit Exoskeleton for the Reconstruction of Complicated 3D Human Avatars
Yifan Zhan, Qingtian Zhu, Muyao Niu, Mingze Ma, Jiancheng Zhao, Zhihang Zhong, Xiao Sun, Yu Qiao, Yinqiang Zheng
In this paper, we highlight a critical yet often overlooked factor in most 3D
human tasks, namely modeling complicated 3D human with with hand-held objects
or loose-fitting clothing. It is known that the parameterized formulation of
SMPL is able to fit human skin; while hand-held objects and loose-fitting
clothing, are difficult to get modeled within the unified framework, since
their movements are usually decoupled with the human body. To enhance the
capability of SMPL skeleton in response to this situation, we propose a growth
strategy that enables the joint tree of the skeleton to expand adaptively.
Specifically, our method, called ToMiE, consists of parent joints localization
and external joints optimization. For parent joints localization, we employ a
gradient-based approach guided by both LBS blending weights and motion kernels.
Once the external joints are obtained, we proceed to optimize their
transformations in SE(3) across different frames, enabling rendering and
explicit animation. ToMiE manages to outperform other methods across various
cases with hand-held objects and loose-fitting clothing, not only in rendering
quality but also by offering free animation of grown joints, thereby enhancing
the expressive ability of SMPL skeleton for a broader range of applications.
♻ ☆ RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS ICCV 2025
Chuanyu Fu, Yuqi Zhang, Kunbin Yao, Guanying Chen, Yuan Xiong, Chuan Huang, Shuguang Cui, Xiaochun Cao
3D Gaussian Splatting (3DGS) has gained significant attention for its
real-time, photo-realistic rendering in novel-view synthesis and 3D modeling.
However, existing methods struggle with accurately modeling scenes affected by
transient objects, leading to artifacts in the rendered images. We identify
that the Gaussian densification process, while enhancing scene detail capture,
unintentionally contributes to these artifacts by growing additional Gaussians
that model transient disturbances. To address this, we propose RobustSplat, a
robust solution based on two critical designs. First, we introduce a delayed
Gaussian growth strategy that prioritizes optimizing static scene structure
before allowing Gaussian splitting/cloning, mitigating overfitting to transient
objects in early optimization. Second, we design a scale-cascaded mask
bootstrapping approach that first leverages lower-resolution feature similarity
supervision for reliable initial transient mask estimation, taking advantage of
its stronger semantic consistency and robustness to noise, and then progresses
to high-resolution supervision to achieve more precise mask prediction.
Extensive experiments on multiple challenging datasets show that our method
outperforms existing methods, clearly demonstrating the robustness and
effectiveness of our method. Our project page is
https://fcyycf.github.io/RobustSplat/.
comment: ICCV 2025. Project page: https://fcyycf.github.io/RobustSplat/
♻ ☆ 2D Triangle Splatting for Direct Differentiable Mesh Training
Differentiable rendering with 3D Gaussian primitives has emerged as a
powerful method for reconstructing high-fidelity 3D scenes from multi-view
images. While it offers improvements over NeRF-based methods, this
representation still encounters challenges with rendering speed and advanced
rendering effects, such as relighting and shadow rendering, compared to
mesh-based models. In this paper, we propose 2D Triangle Splatting (2DTS), a
novel method that replaces 3D Gaussian primitives with 2D triangle facelets.
This representation naturally forms a discrete mesh-like structure while
retaining the benefits of continuous volumetric modeling. By incorporating a
compactness parameter into the triangle primitives, we enable direct training
of photorealistic meshes. Our experimental results demonstrate that our
triangle-based method, in its vanilla version (without compactness tuning),
achieves higher fidelity compared to state-of-the-art Gaussian-based methods.
Furthermore, our approach produces reconstructed meshes with superior visual
quality compared to existing mesh reconstruction methods. Please visit our
project page at https://gaoderender.github.io/triangle-splatting.
comment: 13 pages, 8 figures
♻ ☆ High Temporal Consistency through Semantic Similarity Propagation in Semi-Supervised Video Semantic Segmentation for Autonomous Flight CVPR2025
Semantic segmentation from RGB cameras is essential to the perception of
autonomous flying vehicles. The stability of predictions through the captured
videos is paramount to their reliability and, by extension, to the
trustworthiness of the agents. In this paper, we propose a lightweight video
semantic segmentation approach-suited to onboard real-time inference-achieving
high temporal consistency on aerial data through Semantic Similarity
Propagation across frames. SSP temporally propagates the predictions of an
efficient image segmentation model with global registration alignment to
compensate for camera movements. It combines the current estimation and the
prior prediction with linear interpolation using weights computed from the
features similarities of the two frames. Because data availability is a
challenge in this domain, we propose a consistency-aware Knowledge Distillation
training procedure for sparsely labeled datasets with few annotations. Using a
large image segmentation model as a teacher to train the efficient SSP, we
leverage the strong correlations between labeled and unlabeled frames in the
same training videos to obtain high-quality supervision on all frames. KD-SSP
obtains a significant temporal consistency increase over the base image
segmentation model of 12.5% and 6.7% TC on UAVid and RuralScapes respectively,
with higher accuracy and comparable inference speed. On these aerial datasets,
KD-SSP provides a superior segmentation quality and inference speed trade-off
than other video methods proposed for general applications and shows
considerably higher consistency. Project page:
https://github.com/FraunhoferIVI/SSP.
comment: Accepted by CVPR2025
♻ ☆ CREStE: Scalable Mapless Navigation with Internet Scale Priors and Counterfactual Guidance
We introduce CREStE, a scalable learning-based mapless navigation framework
to address the open-world generalization and robustness challenges of outdoor
urban navigation. Key to achieving this is learning perceptual representations
that generalize to open-set factors (e.g. novel semantic classes, terrains,
dynamic entities) and inferring expert-aligned navigation costs from limited
demonstrations. CREStE addresses both these issues, introducing 1) a visual
foundation model (VFM) distillation objective for learning open-set structured
bird's-eye-view perceptual representations, and 2) counterfactual inverse
reinforcement learning (IRL), a novel active learning formulation that uses
counterfactual trajectory demonstrations to reason about the most important
cues when inferring navigation costs. We evaluate CREStE on the task of
kilometer-scale mapless navigation in a variety of city, offroad, and
residential environments and find that it outperforms all state-of-the-art
approaches with 70% fewer human interventions, including a 2-kilometer mission
in an unseen environment with just 1 intervention; showcasing its robustness
and effectiveness for long-horizon mapless navigation. Videos and additional
materials can be found on the project page: https://amrl.cs.utexas.edu/creste
comment: 18 pages, 10 figures, 5 tables
♻ ☆ Generate the Forest before the Trees -- A Hierarchical Diffusion model for Climate Downscaling
Downscaling is essential for generating the high-resolution climate data
needed for local planning, but traditional methods remain computationally
demanding. Recent years have seen impressive results from AI downscaling
models, particularly diffusion models, which have attracted attention due to
their ability to generate ensembles and overcome the smoothing problem common
in other AI methods. However, these models typically remain computationally
intensive. We introduce a Hierarchical Diffusion Downscaling (HDD) model, which
introduces an easily-extensible hierarchical sampling process to the diffusion
framework. A coarse-to-fine hierarchy is imposed via a simple downsampling
scheme. HDD achieves competitive accuracy on ERA5 reanalysis datasets and CMIP6
models, significantly reducing computational load by running on up to half as
many pixels with competitive results. Additionally, a single model trained at
0.25{\deg} resolution transfers seamlessly across multiple CMIP6 models with
much coarser resolution. HDD thus offers a lightweight alternative for
probabilistic climate downscaling, facilitating affordable large-ensemble
high-resolution climate projections. See a full code implementation at:
https://github.com/HDD-Hierarchical-Diffusion-Downscaling/HDD-Hierarchical-Diffusion-Downscaling.
comment: 8 pages
♻ ☆ A Multi-Source Data Fusion-based Semantic Segmentation Model for Relic Landslide Detection
As a natural disaster, landslide often brings tremendous losses to human
lives, so it urgently demands reliable detection of landslide risks. When
detecting relic landslides that present important information for landslide
risk warning, problems such as visual blur and small-sized dataset cause great
challenges when using remote sensing images. To extract accurate semantic
features, a hyper-pixel-wise contrastive learning augmented segmentation
network (HPCL-Net) is proposed, which augments the local salient feature
extraction from boundaries of landslides through HPCL and fuses heterogeneous
information in the semantic space from high-resolution remote sensing images
and digital elevation model data. For full utilization of precious samples, a
global hyper-pixel-wise sample pair queues-based contrastive learning method is
developed, which includes the construction of global queues that store
hyper-pixel-wise samples and the updating scheme of a momentum encoder,
reliably enhancing the extraction ability of semantic features. The proposed
HPCL-Net is evaluated on the Loess Plateau relic landslide dataset and
experimental results verify that the proposed HPCL-Net greatly outperforms
existing models, where the mIoU is increased from 0.620 to 0.651, the Landslide
IoU is improved from 0.334 to 0.394 and the F1score is enhanced from 0.501 to
0.565.
♻ ☆ Decouple to Reconstruct: High Quality UHD Restoration via Active Feature Disentanglement and Reversible Fusion ICCV 2025
Ultra-high-definition (UHD) image restoration often faces computational
bottlenecks and information loss due to its extremely high resolution. Existing
studies based on Variational Autoencoders (VAE) improve efficiency by
transferring the image restoration process from pixel space to latent space.
However, degraded components are inherently coupled with background elements in
degraded images, both information loss during compression and information gain
during compensation remain uncontrollable. These lead to restored images often
exhibiting image detail loss and incomplete degradation removal. To address
this issue, we propose a Controlled Differential Disentangled VAE, which
utilizes Hierarchical Contrastive Disentanglement Learning and an Orthogonal
Gated Projection Module to guide the VAE to actively discard easily recoverable
background information while encoding more difficult-to-recover degraded
information into the latent space. Additionally, we design a Complex Invertible
Multiscale Fusion Network to handle background features, ensuring their
consistency, and utilize a latent space restoration network to transform the
degraded latent features, leading to more accurate restoration results.
Extensive experimental results demonstrate that our method effectively
alleviates the information loss problem in VAE models while ensuring
computational efficiency, significantly improving the quality of UHD image
restoration, and achieves state-of-the-art results in six UHD restoration tasks
with only 1M parameters.
comment: Accepted by ICCV 2025
♻ ☆ JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers ICCV
We present JointDiT, a diffusion transformer that models the joint
distribution of RGB and depth. By leveraging the architectural benefit and
outstanding image prior of the state-of-the-art diffusion transformer, JointDiT
not only generates high-fidelity images but also produces geometrically
plausible and accurate depth maps. This solid joint distribution modeling is
achieved through two simple yet effective techniques that we propose, i.e.,
adaptive scheduling weights, which depend on the noise levels of each modality,
and the unbalanced timestep sampling strategy. With these techniques, we train
our model across all noise levels for each modality, enabling JointDiT to
naturally handle various combinatorial generation tasks, including joint
generation, depth estimation, and depth-conditioned image generation by simply
controlling the timestep of each branch. JointDiT demonstrates outstanding
joint generation performance. Furthermore, it achieves comparable results in
depth estimation and depth-conditioned image generation, suggesting that joint
distribution modeling can serve as a replaceable alternative to conditional
generation. The project page is available at
https://byungki-k.github.io/JointDiT/.
comment: Accepted to IEEE/CVF International Conference on Computer Vision
(ICCV) 2025. Project page: https://byungki-k.github.io/JointDiT/ Code:
https://github.com/ByungKi-K/JointDiT-code
♻ ☆ HUG: Hierarchical Urban Gaussian Splatting with Block-Based Reconstruction for Large-Scale Aerial Scenes ICCV
3DGS is an emerging and increasingly popular technology in the field of novel
view synthesis. Its highly realistic rendering quality and real-time rendering
capabilities make it promising for various applications. However, when applied
to large-scale aerial urban scenes, 3DGS methods suffer from issues such as
excessive memory consumption, slow training times, prolonged partitioning
processes, and significant degradation in rendering quality due to the
increased data volume. To tackle these challenges, we introduce \textbf{HUG}, a
novel approach that enhances data partitioning and reconstruction quality by
leveraging a hierarchical neural Gaussian representation. We first propose a
visibility-based data partitioning method that is simple yet highly efficient,
significantly outperforming existing methods in speed. Then, we introduce a
novel hierarchical weighted training approach, combined with other optimization
strategies, to substantially improve reconstruction quality. Our method
achieves state-of-the-art results on one synthetic dataset and four real-world
datasets.
comment: An improved version has recently been accepted to ICCV, manuscript,
not camera-ready
♻ ☆ ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model
Speech-driven 3D facial animation aims to generate realistic lip movements
and facial expressions for 3D head models from arbitrary audio clips. Although
existing diffusion-based methods are capable of producing natural motions,
their slow generation speed limits their application potential. In this paper,
we introduce a novel autoregressive model that achieves real-time generation of
highly synchronized lip movements and realistic head poses and eye blinks by
learning a mapping from speech to a multi-scale motion codebook. Furthermore,
our model can adapt to unseen speaking styles, enabling the creation of 3D
talking avatars with unique personal styles beyond the identities seen during
training. Extensive evaluations and user studies demonstrate that our method
outperforms existing approaches in lip synchronization accuracy and perceived
quality.
comment: More video demonstrations, code, models and data can be found on our
project website: http://xg-chu.site/project_artalk/
♻ ☆ Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model MICCAI25
Wei Li, Ming Hu, Guoan Wang, Lihao Liu, Kaijin Zhou, Junzhi Ning, Xin Guo, Zongyuan Ge, Lixu Gu, Junjun He
In ophthalmic surgery, developing an AI system capable of interpreting
surgical videos and predicting subsequent operations requires numerous
ophthalmic surgical videos with high-quality annotations, which are difficult
to collect due to privacy concerns and labor consumption. Text-guided video
generation (T2V) emerges as a promising solution to overcome this issue by
generating ophthalmic surgical videos based on surgeon instructions. In this
paper, we present Ophora, a pioneering model that can generate ophthalmic
surgical videos following natural language instructions. To construct Ophora,
we first propose a Comprehensive Data Curation pipeline to convert narrative
ophthalmic surgical videos into a large-scale, high-quality dataset comprising
over 160K video-instruction pairs, Ophora-160K. Then, we propose a Progressive
Video-Instruction Tuning scheme to transfer rich spatial-temporal knowledge
from a T2V model pre-trained on natural video-text datasets for
privacy-preserved ophthalmic surgical video generation based on Ophora-160K.
Experiments on video quality evaluation via quantitative analysis and
ophthalmologist feedback demonstrate that Ophora can generate realistic and
reliable ophthalmic surgical videos based on surgeon instructions. We also
validate the capability of Ophora for empowering downstream tasks of ophthalmic
surgical workflow understanding. Code is available at
https://github.com/mar-cry/Ophora.
comment: Early accepted in MICCAI25
♻ ☆ Efficient Image Generation with Variadic Attention Heads CVPR
While the integration of transformers in vision models have yielded
significant improvements on vision tasks they still require significant amounts
of computation for both training and inference. Restricted attention mechanisms
significantly reduce these computational burdens but come at the cost of losing
either global or local coherence. We propose a simple, yet powerful method to
reduce these trade-offs: allow the attention heads of a single transformer to
attend to multiple receptive fields.
We demonstrate our method utilizing Neighborhood Attention (NA) and integrate
it into a StyleGAN based architecture for image generation. With this work,
dubbed StyleNAT, we are able to achieve a FID of 2.05 on FFHQ, a 6% improvement
over StyleGAN-XL, while utilizing 28% fewer parameters and with 4$\times$ the
throughput capacity. StyleNAT achieves the Pareto Frontier on FFHQ-256 and
demonstrates powerful and efficient image generation on other datasets. Our
code and model checkpoints are publicly available at:
https://github.com/SHI-Labs/StyleNAT
comment: Published in eLVM @ CVPR
(https://openaccess.thecvf.com/content/CVPR2025W/eLVM/html/Walton_Efficient_Image_Generation_with_Variadic_Attention_Heads_CVPRW_2025_paper)
| Formerly named StyleNAT: Giving Each Head a New Perspective |