Large Models in Dialogue for Active Perception and Anomaly Detection
- URL: http://arxiv.org/abs/2501.16300v1
- Date: Mon, 27 Jan 2025 18:38:36 GMT
- Title: Large Models in Dialogue for Active Perception and Anomaly Detection
- Authors: Tzoulio Chamiti, Nikolaos Passalis, Anastasios Tefas,
- Abstract summary: We propose a framework to actively collect information and perform anomaly detection in novel scenes.<n>Two deep learning models engage in a dialogue to actively control a drone to increase perception and anomaly detection accuracy.<n>In addition to information gathering, our approach is utilized for anomaly detection and our results demonstrate the proposed methods effectiveness.
- Score: 35.16837804526144
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Autonomous aerial monitoring is an important task aimed at gathering information from areas that may not be easily accessible by humans. At the same time, this task often requires recognizing anomalies from a significant distance or not previously encountered in the past. In this paper, we propose a novel framework that leverages the advanced capabilities provided by Large Language Models (LLMs) to actively collect information and perform anomaly detection in novel scenes. To this end, we propose an LLM based model dialogue approach, in which two deep learning models engage in a dialogue to actively control a drone to increase perception and anomaly detection accuracy. We conduct our experiments in a high fidelity simulation environment where an LLM is provided with a predetermined set of natural language movement commands mapped into executable code functions. Additionally, we deploy a multimodal Visual Question Answering (VQA) model charged with the task of visual question answering and captioning. By engaging the two models in conversation, the LLM asks exploratory questions while simultaneously flying a drone into different parts of the scene, providing a novel way to implement active perception. By leveraging LLMs reasoning ability, we output an improved detailed description of the scene going beyond existing static perception approaches. In addition to information gathering, our approach is utilized for anomaly detection and our results demonstrate the proposed methods effectiveness in informing and alerting about potential hazards.
Related papers
- Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition [71.5328300638085]
Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions.<n>Existing methods, including two-stage methods, tightly couple interaction recognition with a specific detector.<n>We propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR.
arXiv Detail & Related papers (2026-02-16T19:01:31Z) - Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection [6.491407316650203]
Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have shown strong generalisation in detecting image and video deepfakes.<n>We aim to explore the potential of MLLMs for audio deepfake detection.
arXiv Detail & Related papers (2026-01-02T18:17:22Z) - AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios [64.51320327698231]
We introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios.<n>We develop an innovative semi-automated collaborative agent-based labeling assistant framework.<n>We also propose HawkEyeTrack, a novel method that collaboratively enhances vision-language representation learning.
arXiv Detail & Related papers (2025-11-26T04:44:27Z) - Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection [33.77002721234086]
We propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs)<n>Our method focuses on extracting and interpreting object activity and interactions over time.<n>Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability.
arXiv Detail & Related papers (2025-10-16T17:13:33Z) - IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection [70.02774285130238]
This paper explores the combination of rich text semantics with both image-level and pixel-level information from images.<n>We propose IAD-GPT, a novel paradigm based on MLLMs for Industrial Anomaly Detection.<n>Experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance.
arXiv Detail & Related papers (2025-10-16T02:48:05Z) - Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method [10.748210940033484]
Large language models (LLMs) and vision-language models (VLMs) have achieved significant success.
Due to the substantial differences between remote sensing images and conventional optical images, these models face challenges in comprehension.
This letter explores the application of VLMs for object detection in remote sensing images.
arXiv Detail & Related papers (2025-03-11T08:02:54Z) - ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting [24.56720920528011]
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges.
One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning.
We propose visual-temporal context, a novel communication protocol between VLMs and policy models.
arXiv Detail & Related papers (2024-10-23T13:26:59Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.
To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.
This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - Hawk: Learning to Understand Open-World Video Anomalies [76.9631436818573]
Video Anomaly Detection (VAD) systems can autonomously monitor and identify disturbances, reducing the need for manual labor and associated costs.
We introduce Hawk, a novel framework that leverages interactive large Visual Language Models (VLM) to interpret video anomalies precisely.
We have annotated over 8,000 anomaly videos with language descriptions, enabling effective training across diverse open-world scenarios, and also created 8,000 question-answering pairs for users' open-world questions.
arXiv Detail & Related papers (2024-05-27T07:08:58Z) - Harnessing Large Language Models for Training-free Video Anomaly Detection [34.76811491190446]
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video.
Training-based methods are prone to be domain-specific, thus being costly for practical deployment.
We propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm.
arXiv Detail & Related papers (2024-04-01T09:34:55Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.
Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.
We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z) - Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs)
We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - HiLM-D: Enhancing MLLMs with Multi-Scale High-Resolution Details for Autonomous Driving [44.06475712570428]
HiLM-D is a resource-efficient framework that enhances visual information processing in MLLMs for ROLISP.
Our method is motivated by the fact that the primary variations in autonomous driving scenarios are the motion trajectories.
Experiments show HiLM-D's significant improvements over current MLLMs, with a 3.7% in BLEU-4 for captioning and 8.7% in mIoU for detection.
arXiv Detail & Related papers (2023-09-11T01:24:13Z) - Look, Remember and Reason: Grounded reasoning in videos with language
models [5.3445140425713245]
Multi-temporal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos.
We propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, tracking, to endow the model with the required low-level visual capabilities.
We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets.
arXiv Detail & Related papers (2023-06-30T16:31:14Z) - H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding
Object Articulations from Interactions [62.510951695174604]
"Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR) is a probabilistic generative framework that generates hypotheses about how objects articulate given input observations.
We show that the proposed model significantly outperforms the current state-of-the-art articulated object manipulation framework.
We further improve the test-time efficiency of H-SAUR by integrating a learned prior from learning-based vision models.
arXiv Detail & Related papers (2022-10-22T18:39:33Z) - Multi-Agent Active Search using Realistic Depth-Aware Noise Model [8.520962086877548]
Active search for objects of interest in an unknown environment has many robotics applications including search and rescue, detecting gas leaks or locating animal poachers.
Existing algorithms often prioritize the location accuracy of objects of interest while other practical issues such as the reliability of object detection as a function of distance and lines of sight remain largely ignored.
We present an algorithm called Noise-Aware Thompson Sampling (NATS) that addresses these issues for multiple ground-based robots performing active search considering two sources of sensory information from monocular optical imagery and depth maps.
arXiv Detail & Related papers (2020-11-09T23:20:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.