ViLPAct: A Benchmark for Compositional Generalization on Multimodal
Human Activities
- URL: http://arxiv.org/abs/2210.05556v1
- Date: Tue, 11 Oct 2022 15:50:51 GMT
- Title: ViLPAct: A Benchmark for Compositional Generalization on Multimodal
Human Activities
- Authors: Terry Yue Zhuo and Yaqing Liao and Yuecheng Lei and Lizhen Qu and
Gerard de Melo and Xiaojun Chang and Yazhou Ren and Zenglin Xu
- Abstract summary: ViLPAct is a vision-language benchmark for human activity planning.
The dataset consists of 2.9k videos from charades extended with intents via crowdsourcing.
- Score: 68.93275430102118
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We introduce ViLPAct, a novel vision-language benchmark for human activity
planning. It is designed for a task where embodied AI agents can reason and
forecast future actions of humans based on video clips about their initial
activities and intents in text. The dataset consists of 2.9k videos from
\charades extended with intents via crowdsourcing, a multi-choice question test
set, and four strong baselines. One of the baselines implements a neurosymbolic
approach based on a multi-modal knowledge base (MKB), while the other ones are
deep generative models adapted from recent state-of-the-art (SOTA) methods.
According to our extensive experiments, the key challenges are compositional
generalization and effective use of information from both modalities.
Related papers
- VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks [100.3234156027118]
We present VLABench, an open-source benchmark for evaluating universal LCM task learning.
VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects.
The benchmark assesses multiple competencies including understanding of mesh&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning.
arXiv Detail & Related papers (2024-12-24T06:03:42Z) - @Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology [31.779074930032184]
Human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously.
We first create a novel AT benchmark (@Bench) guided by a pre-design user study with PVIs.
Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs.
arXiv Detail & Related papers (2024-09-21T18:30:17Z) - Keypoints-Integrated Instruction-Following Data Generation for Enhanced Human Pose Understanding in Multimodal Models [1.9890559505377343]
We introduce a new method for generating such data by integrating human keypoints with traditional visual features like captions and bounding boxes.
Our approach produces datasets designed for fine-tuning models to excel in human-centric activities.
Experimental results show an overall improvement of 21.18% compared to the original LLaVA-7B model.
arXiv Detail & Related papers (2024-09-14T05:07:57Z) - A Survey on Vision-Language-Action Models for Embodied AI [71.16123093739932]
Embodied AI is widely recognized as a key element of artificial general intelligence.
A new category of multimodal models has emerged to address language-conditioned robotic tasks in embodied AI.
We present the first survey on vision-language-action models for embodied AI.
arXiv Detail & Related papers (2024-05-23T01:43:54Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech [107.81472531864195]
Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions.
We present Dynamic-SUPERB, a benchmark for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion.
arXiv Detail & Related papers (2023-09-18T06:43:30Z) - Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment
Analysis [25.482853330324748]
Multimodal Aspect-Based Sentiment Analysis (MABSA) has attracted increasing attention in recent years.
Previous approaches either (i) use separately pre-trained visual and textual models, which ignore the crossmodal alignment or (ii) use vision-grained models pre-trained with general pre-training tasks.
We propose a task-specific Vision-Language Pre-training framework for MABSA (MABSA), which is a unified multimodal encoder-decoder architecture for all the pretraining and downstream tasks.
arXiv Detail & Related papers (2022-04-17T08:44:00Z) - Versatile Multi-Modal Pre-Training for Human-Centric Perception [32.62404509079062]
We propose the Human-Centric Multi-Modal Contrastive Learning framework HCMoCo for effective representation learning.
Dense Intra-sample Contrastive Learning and Sparse Structure-aware Contrastive Learning targets by hierarchically learning a modal-invariant latent space.
Experiments on four downstream tasks of different modalities demonstrate the effectiveness of HCMoCo.
arXiv Detail & Related papers (2022-03-25T17:58:29Z) - Vision-Language Intelligence: Tasks, Representation Learning, and Large
Models [32.142076223602906]
This paper presents a comprehensive survey of vision-language intelligence from the perspective of time.
We summarize the development in this field into three time periods, namely task-specific methods, vision-language pre-training methods, and larger models empowered by large-scale weakly-labeled data.
arXiv Detail & Related papers (2022-03-03T18:54:59Z) - LEMMA: A Multi-view Dataset for Learning Multi-agent Multi-task
Activities [119.88381048477854]
We introduce the LEMMA dataset to provide a single home to address missing dimensions with meticulously designed settings.
We densely annotate the atomic-actions with human-object interactions to provide ground-truths of the compositionality, scheduling, and assignment of daily activities.
We hope this effort would drive the machine vision community to examine goal-directed human activities and further study the task scheduling and assignment in the real world.
arXiv Detail & Related papers (2020-07-31T00:13:54Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z) - The IKEA ASM Dataset: Understanding People Assembling Furniture through
Actions, Objects and Pose [108.21037046507483]
IKEA ASM is a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose.
We benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset.
The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
arXiv Detail & Related papers (2020-07-01T11:34:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.