Related papers: TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance

TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance

URL: http://arxiv.org/abs/2504.16505v1
Date: Wed, 23 Apr 2025 08:32:25 GMT
Title: TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance
Authors: Meng Chu, Yukang Chen, Haokun Gui, Shaozuo Yu, Yi Wang, Jiaya Jia,
Abstract summary: We present TraveLLaMA, a specialized multimodal language model designed for urban scene understanding and travel assistance.<n>Our work addresses the fundamental challenge of developing practical AI travel assistants through a novel large-scale dataset of 220k question-answer pairs.
Score: 48.12326709517022
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tourism and travel planning increasingly rely on digital assistance, yet existing multimodal AI systems often lack specialized knowledge and contextual understanding of urban environments. We present TraveLLaMA, a specialized multimodal language model designed for urban scene understanding and travel assistance. Our work addresses the fundamental challenge of developing practical AI travel assistants through a novel large-scale dataset of 220k question-answer pairs. This comprehensive dataset uniquely combines 130k text QA pairs meticulously curated from authentic travel forums with GPT-enhanced responses, alongside 90k vision-language QA pairs specifically focused on map understanding and scene comprehension. Through extensive fine-tuning experiments on state-of-the-art vision-language models (LLaVA, Qwen-VL, Shikra), we demonstrate significant performance improvements ranging from 6.5\%-9.4\% in both pure text travel understanding and visual question answering tasks. Our model exhibits exceptional capabilities in providing contextual travel recommendations, interpreting map locations, and understanding place-specific imagery while offering practical information such as operating hours and visitor reviews. Comparative evaluations show TraveLLaMA significantly outperforms general-purpose models in travel-specific tasks, establishing a new benchmark for multi-modal travel assistance systems.

Related papers

VoQA: Visual-only Question Answering [7.251596370310251]
We propose Visual-only Question Answering (VoQA), a novel multimodal task in which questions are visually embedded within images.<n>This requires models to locate, recognize, and reason over visually embedded textual questions.<n>We introduce Guided Response Triggering Supervised Fine-tuning (GRT-SFT), a structured fine-tuning strategy that guides the model to perform step-by-step reasoning purely based on visual input.
arXiv Detail & Related papers (2025-05-20T11:37:49Z)
GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance [18.467461615621872]
Mobility remains a significant challenge for the 2.2 billion people worldwide affected by blindness and low vision (BLV)<n>We introduce GuideDog, a novel accessibility-aware guide dataset containing 22K image-description pairs.<n>We also develop GuideDogQA, a subset of 818 samples featuring multiple-choice questions designed to evaluate fine-grained visual perception capabilities.
arXiv Detail & Related papers (2025-03-17T05:43:40Z)
InsTALL: Context-aware Instructional Task Assistance with Multi-modal Large Language Models [11.913271486031201]
We develop a Context-aware instructional task assistant with multi-modal large language models (InsTALL)<n>InsTALL responds in real-time to user queries related to the task at hand.<n>We show InsTALL achieves state-of-the-art performance across proposed sub-tasks considered for multimodal activity understanding.
arXiv Detail & Related papers (2025-01-21T15:55:06Z)
ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning [49.37899519520761]
We introduce emphChinaTravel, the first open-ended benchmark grounded in authentic Chinese travel requirements.<n>We design a compositionally generalizable domain-specific language for scalable evaluation, covering feasibility, constraint satisfaction, and preference comparison.<n> Empirical studies reveal the potential of neuro-symbolic agents in travel planning, achieving a 37.0% constraint satisfaction rate on human queries.
arXiv Detail & Related papers (2024-12-18T10:10:12Z)
VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks [48.67062958311173]
VL-GLUE is a multitask benchmark for natural language understanding. We show that this benchmark is quite challenging for existing large-scale vision-language models.
arXiv Detail & Related papers (2024-10-17T15:27:17Z)
Advancing Transportation Mode Share Analysis with Built Environment: Deep Hybrid Models with Urban Road Network [12.349403667141559]
We propose deep hybrid models (DHM) which directly combine road networks and sociodemographic features as inputs for travel mode share analysis. In experiments of mode share prediction in Chicago, results demonstrate that DHM can provide valuable spatial insights into the sociodemographic structure.
arXiv Detail & Related papers (2024-05-23T00:59:00Z)
Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes. Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques. We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z)
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z)
ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese [1.6340299456362617]
We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese. We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations. We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
arXiv Detail & Related papers (2023-10-27T10:44:50Z)
The Urban Toolkit: A Grammar-based Framework for Urban Visual Analytics [5.674216760436341]
The complex nature of urban issues and the overwhelming amount of available data have posed significant challenges in translating these efforts into actionable insights. When analyzing a feature of interest, an urban expert must transform, integrate, and visualize different thematic (e.g., sunlight access, demographic) and physical (e.g., buildings, street networks) data layers. This makes the entire visual data exploration and system implementation difficult for programmers and also sets a high entry barrier for urban experts outside of computer science.
arXiv Detail & Related papers (2023-08-15T13:43:04Z)
Multimodal Lecture Presentations Dataset: Understanding Multimodality in Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content. Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects. We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z)
Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation [71.67507925788577]
This paper introduces a Multimodal Text Style Transfer (MTST) learning approach for outdoor navigation tasks. We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external navigation dataset. Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task.
arXiv Detail & Related papers (2020-07-01T04:29:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.