Image Segmentation with Large Language Models: A Survey with Perspectives for Intelligent Transportation Systems
- URL: http://arxiv.org/abs/2506.14096v2
- Date: Fri, 05 Sep 2025 22:03:39 GMT
- Title: Image Segmentation with Large Language Models: A Survey with Perspectives for Intelligent Transportation Systems
- Authors: Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma,
- Abstract summary: This survey systematically reviews the emerging field of LLM-augmented image segmentation.<n>We highlight how these innovations can enhance road scene understanding for autonomous driving, traffic monitoring, and infrastructure maintenance.<n>We identify key challenges, including real-time performance and safety-critical reliability.
- Score: 6.908972852063454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The integration of Large Language Models (LLMs) with computer vision is profoundly transforming perception tasks like image segmentation. For intelligent transportation systems (ITS), where accurate scene understanding is critical for safety and efficiency, this new paradigm offers unprecedented capabilities. This survey systematically reviews the emerging field of LLM-augmented image segmentation, focusing on its applications, challenges, and future directions within ITS. We provide a taxonomy of current approaches based on their prompting mechanisms and core architectures, and we highlight how these innovations can enhance road scene understanding for autonomous driving, traffic monitoring, and infrastructure maintenance. Finally, we identify key challenges, including real-time performance and safety-critical reliability, and outline a perspective centered on explainable, human-centric AI as a prerequisite for the successful deployment of this technology in next-generation transportation systems.
Related papers
- Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems [75.78934957242403]
Self-driving vehicles and drones require true Spatial Intelligence from multi-modal onboard sensor data.<n>This paper presents a framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal.
arXiv Detail & Related papers (2025-12-30T17:58:01Z) - AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios [64.51320327698231]
We introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios.<n>We develop an innovative semi-automated collaborative agent-based labeling assistant framework.<n>We also propose HawkEyeTrack, a novel method that collaboratively enhances vision-language representation learning.
arXiv Detail & Related papers (2025-11-26T04:44:27Z) - All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles [7.863490977061713]
Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems.<n>Their success is tied to one core capability, reliable object detection in complex and multimodal environments.<n>Recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress.<n>This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs.
arXiv Detail & Related papers (2025-10-30T16:08:25Z) - Edge General Intelligence Through World Models and Agentic AI: Fundamentals, Solutions, and Challenges [87.02855999212817]
Edge General Intelligence (EGI) represents a transformative evolution of edge computing, where distributed agents possess the capability to perceive, reason, and act autonomously.<n>World models act as proactive internal simulators that not only predict but also actively imagine future trajectories, reason under uncertainty, and plan multi-step actions with foresight.<n>This survey bridges the gap by offering a comprehensive analysis of how world models can empower agentic artificial intelligence (AI) systems at the edge.
arXiv Detail & Related papers (2025-08-13T07:29:40Z) - Large Language Models and Their Applications in Roadway Safety and Mobility Enhancement: A Comprehensive Review [14.611584622270405]
This paper reviews the application and customization of Large Language Models (LLMs) for enhancing roadway safety and mobility.<n>A key focus is how LLMs are adapted -- via architectural, training, prompting, and multimodal strategies -- to bridge the "modality gap" with transportation's unique-temporal and physical data.<n>Despite significant potential, challenges persist regarding inherent LLM limitations (hallucinations, reasoning deficits), data governance (privacy, bias complexity), complexities (sim-to-real, latency), and rigorous safety assurance.
arXiv Detail & Related papers (2025-05-19T21:51:18Z) - Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision [49.073964142139495]
We systematically review the applications and advancements of multimodal fusion methods and vision-language models.<n>For semantic scene understanding tasks, we categorize fusion approaches into encoder-decoder frameworks, attention-based architectures, and graph neural networks.<n>We identify key challenges in current research, including cross-modal alignment, efficient fusion, real-time deployment, and domain adaptation.
arXiv Detail & Related papers (2025-04-03T10:53:07Z) - A Survey on (M)LLM-Based GUI Agents [62.57899977018417]
Graphical User Interface (GUI) Agents have emerged as a transformative paradigm in human-computer interaction.<n>Recent advances in large language models and multimodal learning have revolutionized GUI automation across desktop, mobile, and web platforms.<n>This survey identifies key technical challenges, including accurate element localization, effective knowledge retrieval, long-horizon planning, and safety-aware execution control.
arXiv Detail & Related papers (2025-03-27T17:58:31Z) - Exploring the Roles of Large Language Models in Reshaping Transportation Systems: A Survey, Framework, and Roadmap [51.198001060683296]
Large Language Models (LLMs) offer transformative potential to address transportation challenges.<n>This survey first presents LLM4TR, a novel conceptual framework that systematically categorizes the roles of LLMs in transportation.<n>For each role, our review spans diverse applications, from traffic prediction and autonomous driving to safety analytics and urban mobility optimization.
arXiv Detail & Related papers (2025-03-27T11:56:27Z) - Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey [61.39993881402787]
World models and video generation are pivotal technologies in the domain of autonomous driving.
This paper investigates the relationship between these two technologies.
By analyzing the interplay between video generation and world models, this survey identifies critical challenges and future research directions.
arXiv Detail & Related papers (2024-11-05T08:58:35Z) - GenAI-powered Multi-Agent Paradigm for Smart Urban Mobility: Opportunities and Challenges for Integrating Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) with Intelligent Transportation Systems [10.310791311301962]
This paper explores the transformative potential of large language models (LLMs) and emerging Retrieval-Augmented Generation (RAG) technologies.
We propose a conceptual framework aimed at developing multi-agent systems capable of intelligently and conversationally delivering smart mobility services.
arXiv Detail & Related papers (2024-08-31T16:14:42Z) - A Survey of Generative AI for Intelligent Transportation Systems: Road Transportation Perspective [7.770651543578893]
We introduce the principles of different generative AI techniques.
We classify tasks in ITS into four types: traffic perception, traffic prediction, traffic simulation, and traffic decision-making.
We illustrate how generative AI techniques addresses key issues in these four different types of tasks.
arXiv Detail & Related papers (2023-12-13T16:13:23Z) - Representation Engineering: A Top-Down Approach to AI Transparency [130.33981757928166]
We identify and characterize the emerging area of representation engineering (RepE)<n>RepE places population-level representations, rather than neurons or circuits, at the center of analysis.<n>We showcase how these methods can provide traction on a wide range of safety-relevant problems.
arXiv Detail & Related papers (2023-10-02T17:59:07Z) - Core Challenges in Embodied Vision-Language Planning [11.896110519868545]
Embodied Vision-Language Planning tasks leverage computer vision and natural language for interaction in physical environments.
We propose a taxonomy to unify these tasks and provide an analysis and comparison of the current and new algorithmic approaches.
We advocate for task construction that enables model generalisability and furthers real-world deployment.
arXiv Detail & Related papers (2023-04-05T20:37:13Z) - Camera-Radar Perception for Autonomous Vehicles and ADAS: Concepts,
Datasets and Metrics [77.34726150561087]
This work aims to carry out a study on the current scenario of camera and radar-based perception for ADAS and autonomous vehicles.
Concepts and characteristics related to both sensors, as well as to their fusion, are presented.
We give an overview of the Deep Learning-based detection and segmentation tasks, and the main datasets, metrics, challenges, and open questions in vehicle perception.
arXiv Detail & Related papers (2023-03-08T00:48:32Z) - Self-supervised Video Object Segmentation by Motion Grouping [79.13206959575228]
We develop a computer vision system able to segment objects by exploiting motion cues.
We introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background.
We evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59)
arXiv Detail & Related papers (2021-04-15T17:59:32Z) - DEEVA: A Deep Learning and IoT Based Computer Vision System to Address
Safety and Security of Production Sites in Energy Industry [0.0]
This paper tackles various computer vision related problems such as scene classification, object detection in scenes, semantic segmentation, scene captioning etc.
We developed Deep ExxonMobil Eye for Video Analysis (DEEVA) package to handle scene classification, object detection, semantic segmentation and captioning of scenes.
The results reveal that transfer learning with the RetinaNet object detector is able to detect the presence of workers, different types of vehicles/construction equipment, safety related objects at a high level of accuracy (above 90%)
arXiv Detail & Related papers (2020-03-02T21:26:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.