Related papers: Efficient Odd-One-Out Anomaly Detection

Efficient Odd-One-Out Anomaly Detection

URL: http://arxiv.org/abs/2509.04326v1
Date: Thu, 04 Sep 2025 15:44:37 GMT
Title: Efficient Odd-One-Out Anomaly Detection
Authors: Silvio Chito, Paolo Rabino, Tatiana Tommasi,
Abstract summary: Odd-one-out anomaly detection task involves identifying odd-looking instances within a multi-object scene.<n>This problem presents several challenges for modern deep learning models.<n>We propose a DINO-based model that reduces the number of parameters by one third and shortens training time by a factor of three.
Score: 7.456608146535316
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recently introduced odd-one-out anomaly detection task involves identifying the odd-looking instances within a multi-object scene. This problem presents several challenges for modern deep learning models, demanding spatial reasoning across multiple views and relational reasoning to understand context and generalize across varying object categories and layouts. We argue that these challenges must be addressed with efficiency in mind. To this end, we propose a DINO-based model that reduces the number of parameters by one third and shortens training time by a factor of three compared to the current state-of-the-art, while maintaining competitive performance. Our experimental evaluation also introduces a Multimodal Large Language Model baseline, providing insights into its current limitations in structured visual reasoning tasks. The project page can be found at https://silviochito.github.io/EfficientOddOneOut/

Related papers

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search [85.201906907271]
Mini-o3 is a system that executes deep, multi-turn reasoning spanning tens of steps.<n>Our recipe for reproducing OpenAI o3-style behaviors comprises three key components.<n>Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths.
arXiv Detail & Related papers (2025-09-09T17:54:21Z)
Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse [9.542503507653494]
Chain-of-thought (CoT) prompting has become a widely used strategy for improving large language and multimodal model performance.<n>This paper focuses on six representative tasks from the psychological literature where deliberation hurts performance in humans.<n>In three of these tasks, state-of-the-art models exhibit significant performance drop-offs with CoT.<n>While models and humans do not exhibit perfectly parallel cognitive processes, considering cases where thinking has negative consequences for humans helps identify settings where it negatively impacts models.
arXiv Detail & Related papers (2024-10-27T18:30:41Z)
RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception [64.80760846124858]
This paper proposes a novel unified representation, RepVF, which harmonizes the representation of various perception tasks. RepVF characterizes the structure of different targets in the scene through a vector field, enabling a single-head, multi-task learning model. Building upon RepVF, we introduce RFTR, a network designed to exploit the inherent connections between different tasks.
arXiv Detail & Related papers (2024-07-15T16:25:07Z)
Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model [86.9619638550683]
Vision-language foundation models have exhibited remarkable success across a multitude of downstream tasks due to their scalability on extensive image-text paired data.<n>However, these models display significant limitations when applied to downstream tasks, such as fine-grained image classification, as a result of decision shortcuts''
arXiv Detail & Related papers (2024-03-01T09:01:53Z)
Concrete Subspace Learning based Interference Elimination for Multi-task Model Fusion [86.6191592951269]
Merging models fine-tuned from common extensively pretrained large model but specialized for different tasks has been demonstrated as a cheap and scalable strategy to construct a multitask model that performs well across diverse tasks. We propose the CONtinuous relaxation dis (Concrete) subspace learning method to identify a common lowdimensional subspace and utilize its shared information track interference problem without sacrificing performance.
arXiv Detail & Related papers (2023-12-11T07:24:54Z)
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos [53.92440577914417]
ACQUIRED consists of 3.9K annotated videos, encompassing a wide range of event types and incorporating both first and third-person viewpoints. Each video is annotated with questions that span three distinct dimensions of reasoning, including physical, social, and temporal. We benchmark our dataset against several state-of-the-art language-only and multimodal models and experimental results demonstrate a significant performance gap.
arXiv Detail & Related papers (2023-11-02T22:17:03Z)
Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection. First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network. Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z)
Deep Non-Monotonic Reasoning for Visual Abstract Reasoning Tasks [3.486683381782259]
This paper proposes a non-monotonic computational approach to solve visual abstract reasoning tasks. We implement a deep learning model using this approach and tested it on the RAVEN dataset -- a dataset inspired by the Raven's Progressive Matrices test.
arXiv Detail & Related papers (2023-02-08T16:35:05Z)
Causal Triplet: An Open Challenge for Intervention-centric Causal Representation Learning [98.78136504619539]
Causal Triplet is a causal representation learning benchmark featuring visually more complex scenes. We show that models built with the knowledge of disentangled or object-centric representations significantly outperform their distributed counterparts.
arXiv Detail & Related papers (2023-01-12T17:43:38Z)
Assisting Scene Graph Generation with Self-Supervision [21.89909688056478]
We propose a set of three novel yet simple self-supervision tasks and train them as auxiliary multi-tasks to the main model. While comparing, we train the base-model from scratch with these self-supervision tasks, we achieve state-of-the-art results in all the metrics and recall settings.
arXiv Detail & Related papers (2020-08-08T16:38:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.