@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology
- URL: http://arxiv.org/abs/2409.14215v2
- Date: Mon, 25 Nov 2024 15:36:32 GMT
- Title: @Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology
- Authors: Xin Jiang, Junwei Zheng, Ruiping Liu, Jiahang Li, Jiaming Zhang, Sven Matthiesen, Rainer Stiefelhagen,
- Abstract summary: Human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously.
We first create a novel AT benchmark (@Bench) guided by a pre-design user study with PVIs.
Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs.
- Score: 31.779074930032184
- License:
- Abstract: As Vision-Language Models (VLMs) advance, human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously. However, benchmarking VLMs for ATs remains under-explored. To bridge this gap, we first create a novel AT benchmark (@Bench). Guided by a pre-design user study with PVIs, our benchmark includes the five most crucial vision-language tasks: Panoptic Segmentation, Depth Estimation, Optical Character Recognition (OCR), Image Captioning, and Visual Question Answering (VQA). Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs. Our framework exhibits outstanding performance across tasks by integrating multi-modal information, and it offers PVIs a more comprehensive assistance. Extensive experiments prove the effectiveness and generalizability of our framework.
Related papers
- AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [55.14033256706175]
Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information.
We introduce AutoBench-V, an automated framework for serving evaluation on demand.
Through an extensive evaluation of seven popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs)
VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks.
We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z) - VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks [48.67062958311173]
VL-GLUE is a multitask benchmark for natural language understanding.
We show that this benchmark is quite challenging for existing large-scale vision-language models.
arXiv Detail & Related papers (2024-10-17T15:27:17Z) - Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning [15.296263261737026]
We introduce a Multi-Image MIRB Benchmark to evaluate visual language models' ability to compare, analyze, and reason across multiple images.
Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning.
We demonstrate that while open-source VLMs were shown to approach the GPT-4V in single-image tasks, a significant gap remains in multi-image reasoning tasks.
arXiv Detail & Related papers (2024-06-18T16:02:18Z) - Visual Instruction Tuning towards General-Purpose Multimodal Model: A
Survey [59.95153883166705]
Traditional computer vision generally solves each single task independently by a dedicated model with the task instruction implicitly designed in the model architecture.
Visual Instruction Tuning (VIT) has been intensively studied recently, which finetunes a large vision model with language as task instructions.
This work aims to provide a systematic review of visual instruction tuning, covering (1) the background that presents computer vision task paradigms and the development of VIT; (2) the foundations of VIT that introduce commonly used network architectures, visual instruction tuning frameworks and objectives, and evaluation setups and tasks; and (3) the commonly used datasets in visual instruction tuning and evaluation.
arXiv Detail & Related papers (2023-12-27T14:54:37Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - A Continual Learning Paradigm for Non-differentiable Visual Programming
Frameworks on Visual Reasoning Tasks [51.053901491986025]
We propose a Continuous Learning paradigm for VisProg across various visual reasoning tasks.
Our CLVP distills the capabilities of well-trained task-specific models into the visual sub-modules in a stepwise and anti-forgetting manner.
arXiv Detail & Related papers (2023-09-18T14:28:47Z) - VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models [21.549122658275383]
Recent advances in vision-language pre-training have demonstrated impressive performance in a range of vision-language tasks.
We introduce the Vision-Language Understanding Evaluation benchmark, a multi-task multi-dimension benchmark for evaluating the generalization capabilities and the efficiency-performance trade-off.
arXiv Detail & Related papers (2022-05-30T16:52:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.