Visual Transformation Telling
- URL: http://arxiv.org/abs/2305.01928v2
- Date: Tue, 11 Jun 2024 08:49:25 GMT
- Title: Visual Transformation Telling
- Authors: Wanqing Cui, Xin Hong, Yanyan Lan, Liang Pang, Jiafeng Guo, Xueqi Cheng,
- Abstract summary: We propose a new visual reasoning task, called textbfVisual textbfTransformation textbfTelling (VTT)
Given a series of states (i.e. images), VTT requires to describe the transformation occurring between every two adjacent states.
We collect a novel dataset to support the study of transformation reasoning from two existing instructional video datasets.
- Score: 81.99825888461544
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Humans can naturally reason from superficial state differences (e.g. ground wetness) to transformations descriptions (e.g. raining) according to their life experience. In this paper, we propose a new visual reasoning task to test this transformation reasoning ability in real-world scenarios, called \textbf{V}isual \textbf{T}ransformation \textbf{T}elling (VTT). Given a series of states (i.e. images), VTT requires to describe the transformation occurring between every two adjacent states. Different from existing visual reasoning tasks that focus on surface state reasoning, the advantage of VTT is that it captures the underlying causes, e.g. actions or events, behind the differences among states. We collect a novel dataset to support the study of transformation reasoning from two existing instructional video datasets, CrossTask and COIN, comprising 13,547 samples. Each sample involves the key state images along with their transformation descriptions. Our dataset covers diverse real-world activities, providing a rich resource for training and evaluation. To construct an initial benchmark for VTT, we test several models, including traditional visual storytelling methods (CST, GLACNet, Densecap) and advanced multimodal large language models (LLaVA v1.5-7B, Qwen-VL-chat, Gemini Pro Vision, GPT-4o, and GPT-4). Experimental results reveal that even state-of-the-art models still face challenges in VTT, highlighting substantial areas for improvement.
Related papers
- Supervised Fine-tuning in turn Improves Visual Foundation Models [74.1760864718129]
Two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models.
Vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks.
arXiv Detail & Related papers (2024-01-18T18:58:54Z) - Story Visualization by Online Text Augmentation with Context Memory [64.86944645907771]
We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation.
The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
arXiv Detail & Related papers (2023-08-15T05:08:12Z) - Visual Reasoning: from State to Transformation [80.32402545546209]
Existing visual reasoning tasks ignore an important factor, i.e.transformation.
We propose a novel textbftransformation driven visual reasoning (TVR) task.
We show that state-of-the-art visual reasoning models perform well on Basic, but are far from human-level intelligence on Event, View, and TRANCO.
arXiv Detail & Related papers (2023-05-02T14:24:12Z) - PointVST: Self-Supervised Pre-training for 3D Point Clouds via
View-Specific Point-to-Image Translation [64.858505571083]
This paper proposes a translative pre-training framework, namely PointVST.
It is driven by a novel self-supervised pretext task of cross-modal translation from 3D point clouds to their corresponding diverse forms of 2D rendered images.
arXiv Detail & Related papers (2022-12-29T07:03:29Z) - VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video
Paragraph Captioning [19.73126931526359]
Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling.
We first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements.
We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video.
arXiv Detail & Related papers (2022-11-28T07:39:20Z) - Visuo-Tactile Transformers for Manipulation [4.60687205898687]
We present Visuo-Tactile Transformers (VTTs), a novel multimodal representation learning approach suited for model-based reinforcement learning and planning.
Specifically, VTT uses tactile feedback together with self and cross-modal attention to build latent heatmap representations that focus attention on important task features in the visual domain.
arXiv Detail & Related papers (2022-09-30T22:38:29Z) - Fine-tuning Vision Transformers for the Prediction of State Variables in
Ising Models [2.9005223064604078]
Transformers are state-of-the-art deep learning models that are composed of stacked attention and point-wise, fully connected layers.
In this work, a Vision Transformer (ViT) is applied to predict the state variables of 2-dimensional Ising model simulations.
arXiv Detail & Related papers (2021-09-28T00:23:31Z) - Transformation Driven Visual Reasoning [80.32402545546209]
This paper defines a new visual reasoning paradigm by introducing an important factor, i.e.transformation.
We argue that this kind of textbfstate driven visual reasoning approach has limitations in reflecting whether the machine has the ability to infer the dynamics between different states.
Experimental results show that the state-of-the-art visual reasoning models perform well on Basic, but are still far from human-level intelligence on Event and View.
arXiv Detail & Related papers (2020-11-26T07:11:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.