Location-Aware Visual Question Generation with Lightweight Models
- URL: http://arxiv.org/abs/2310.15129v1
- Date: Mon, 23 Oct 2023 17:33:31 GMT
- Title: Location-Aware Visual Question Generation with Lightweight Models
- Authors: Nicholas Collin Suwono, Justin Chih-Yao Chen, Tun Min Hung, Ting-Hao
Kenneth Huang, I-Bin Liao, Yung-Hui Li, Lun-Wei Ku, Shao-Hua Sun
- Abstract summary: This work introduces a novel task, location-aware visual question generation (LocaVQG)
We represent such location-aware information with surrounding images and a GPS coordinate.
We learn a lightweight model that can address the LocaVQG task and fit on an edge device, such as a mobile phone.
- Score: 21.278164764804536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work introduces a novel task, location-aware visual question generation
(LocaVQG), which aims to generate engaging questions from data relevant to a
particular geographical location. Specifically, we represent such
location-aware information with surrounding images and a GPS coordinate. To
tackle this task, we present a dataset generation pipeline that leverages GPT-4
to produce diverse and sophisticated questions. Then, we aim to learn a
lightweight model that can address the LocaVQG task and fit on an edge device,
such as a mobile phone. To this end, we propose a method which can reliably
generate engaging questions from location-aware information. Our proposed
method outperforms baselines regarding human evaluation (e.g., engagement,
grounding, coherence) and automatic evaluation metrics (e.g., BERTScore,
ROUGE-2). Moreover, we conduct extensive ablation studies to justify our
proposed techniques for both generating the dataset and solving the task.
Related papers
- Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework [59.42946541163632]
We introduce a comprehensive geolocation framework with three key components.
GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric.
We demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.
arXiv Detail & Related papers (2025-02-19T14:21:25Z) - GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.
Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.
We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z) - Granular Privacy Control for Geolocation with Vision Language Models [36.3455665044992]
We develop a new benchmark, GPTGeoChat, to test the ability of Vision Language Models to moderate geolocation dialogues with users.
We collect a set of 1,000 image geolocation conversations between in-house annotators and GPT-4v.
We evaluate the ability of various VLMs to moderate GPT-4v geolocation conversations by determining when too much location information has been revealed.
arXiv Detail & Related papers (2024-07-06T04:06:55Z) - Identifying User Goals from UI Trajectories [19.492331502146886]
We propose a new task goal identification from observed UI trajectories.
We also introduce a novel evaluation methodology designed to assess whether two intent descriptions can be considered paraphrases.
To benchmark this task, we compare the performance of humans and state-of-the-art models, specifically GPT-4 and Gemini-1.5 Pro.
arXiv Detail & Related papers (2024-06-20T13:46:10Z) - VBR: A Vision Benchmark in Rome [1.71787484850503]
This paper presents a vision and perception research dataset collected in Rome, featuring RGB data, 3D point clouds, IMU, and GPS data.
We introduce a new benchmark targeting visual odometry and SLAM, to advance the research in autonomous robotics and computer vision.
arXiv Detail & Related papers (2024-04-17T12:34:49Z) - Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language
Navigation [87.52136927091712]
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions.
To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.
We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
arXiv Detail & Related papers (2022-10-14T04:23:27Z) - Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for
Grounding Relative Directions via Multi-Task Learning [16.538887534958555]
We introduce GRiD-A-3D, a novel diagnostic visual question-answering dataset based on abstract objects.
Our dataset allows for a fine-grained analysis of end-to-end VQA models' capabilities to ground relative directions.
We demonstrate that within a few epochs, the subtasks required to reason over relative directions are learned in the order in which relative directions are intuitively processed.
arXiv Detail & Related papers (2022-07-06T12:31:49Z) - Learning Implicit Feature Alignment Function for Semantic Segmentation [51.36809814890326]
Implicit Feature Alignment function (IFA) is inspired by the rapidly expanding topic of implicit neural representations.
We show that IFA implicitly aligns the feature maps at different levels and is capable of producing segmentation maps in arbitrary resolutions.
Our method can be combined with improvement on various architectures, and it achieves state-of-the-art accuracy trade-off on common benchmarks.
arXiv Detail & Related papers (2022-06-17T09:40:14Z) - Exploiting Scene Graphs for Human-Object Interaction Detection [81.49184987430333]
Human-Object Interaction (HOI) detection is a fundamental visual task aiming at localizing and recognizing interactions between humans and objects.
We propose a novel method to exploit this information, through the scene graph, for the Human-Object Interaction (SG2HOI) detection task.
Our method, SG2HOI, incorporates the SG information in two ways: (1) we embed a scene graph into a global context clue, serving as the scene-specific environmental context; and (2) we build a relation-aware message-passing module to gather relationships from objects' neighborhood and transfer them into interactions.
arXiv Detail & Related papers (2021-08-19T09:40:50Z) - SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots.
Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step.
This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z) - Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document.
We show that readers engage in a series of pragmatic strategies to seek information.
We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z) - Exploiting Scene-specific Features for Object Goal Navigation [9.806910643086043]
We introduce a new reduced dataset that speeds up the training of navigation models.
Our proposed dataset permits the training of models that do not exploit online-built maps in reasonable times.
We propose the SMTSC model, an attention-based model capable of exploiting the correlation between scenes and objects contained in them.
arXiv Detail & Related papers (2020-08-21T10:16:01Z) - Location-aware Graph Convolutional Networks for Video Question Answering [85.44666165818484]
We propose to represent the contents in the video as a location-aware graph.
Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action.
Our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA, and MSVD-QA datasets.
arXiv Detail & Related papers (2020-08-07T02:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.