The Role of the Input in Natural Language Video Description
- URL: http://arxiv.org/abs/2102.05067v1
- Date: Tue, 9 Feb 2021 19:00:35 GMT
- Title: The Role of the Input in Natural Language Video Description
- Authors: Silvia Cascianelli, Gabriele Costante, Alessandro Devo, Thomas A.
Ciarfuglia, Paolo Valigi, Mario L. Fravolini
- Abstract summary: Natural Language Video Description (NLVD) has recently received strong interest in the Computer Vision, Natural Language Processing, Multimedia, and Autonomous Robotics communities.
In this work, it is presented an extensive study dealing with the role of the visual input, evaluated with respect to the overall NLP performance.
A t-SNE based analysis is proposed to evaluate the effects of the considered transformations on the overall visual data distribution.
- Score: 60.03448250024277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural Language Video Description (NLVD) has recently received strong
interest in the Computer Vision, Natural Language Processing (NLP), Multimedia,
and Autonomous Robotics communities. The State-of-the-Art (SotA) approaches
obtained remarkable results when tested on the benchmark datasets. However,
those approaches poorly generalize to new datasets. In addition, none of the
existing works focus on the processing of the input to the NLVD systems, which
is both visual and textual. In this work, it is presented an extensive study
dealing with the role of the visual input, evaluated with respect to the
overall NLP performance. This is achieved performing data augmentation of the
visual component, applying common transformations to model camera distortions,
noise, lighting, and camera positioning, that are typical in real-world
operative scenarios. A t-SNE based analysis is proposed to evaluate the effects
of the considered transformations on the overall visual data distribution. For
this study, it is considered the English subset of Microsoft Research Video
Description (MSVD) dataset, which is used commonly for NLVD. It was observed
that this dataset contains a relevant amount of syntactic and semantic errors.
These errors have been amended manually, and the new version of the dataset
(called MSVD-v2) is used in the experimentation. The MSVD-v2 dataset is
released to help to gain insight into the NLVD problem.
Related papers
- Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation.
Our approach can be applied to existing datasets by automatically generating hard negative test captions.
Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z) - Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR [51.72751335574947]
Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes.
Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers)
This paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR.
arXiv Detail & Related papers (2024-05-27T08:26:58Z) - Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study [41.84915013818794]
The Natural Language to Visualization (NL2Vis) task aims to transform natural-language descriptions into visual representations for a grounded table.
Many deep learning-based approaches have been developed for NL2Vis, but challenges persist in visualizing data sourced from unseen databases or spanning multiple tables.
Taking inspiration from the remarkable generation capabilities of Large Language Models (LLMs), this paper conducts an empirical study to evaluate their potential in generating visualizations.
arXiv Detail & Related papers (2024-04-26T03:25:35Z) - DimVis: Interpreting Visual Clusters in Dimensionality Reduction With Explainable Boosting Machine [3.2748787252933442]
DimVis is a tool that employs supervised Explainable Boosting Machine (EBM) models as an interpretation assistant for DR projections.
Our tool facilitates high-dimensional data analysis by providing an interpretation of feature relevance in visual clusters.
arXiv Detail & Related papers (2024-02-10T04:50:36Z) - How to Evaluate the Generalization of Detection? A Benchmark for
Comprehensive Open-Vocabulary Detection [25.506346503624894]
We propose a new benchmark named OVDEval, which includes 9 sub-tasks and introduces evaluations on commonsense knowledge.
The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input.
arXiv Detail & Related papers (2023-08-25T04:54:32Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating
Visio-Linguistic Reasoning [25.520406167426135]
We present TraVLR, a synthetic dataset comprising four visio-linguistic (V+L) reasoning tasks.
Each example in TraVLR redundantly encodes the scene in two modalities, allowing either to be dropped or added during training or testing without losing relevant information.
We compare the performance of four state-of-the-art V+L models, finding that while they perform well on test examples from the same modality, they all fail at cross-modal transfer.
arXiv Detail & Related papers (2021-11-21T07:22:44Z) - e-ViL: A Dataset and Benchmark for Natural Language Explanations in
Vision-Language Tasks [52.918087305406296]
We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks.
We also introduce e-SNLI-VE, the largest existing dataset with NLEs.
We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
arXiv Detail & Related papers (2021-05-08T18:46:33Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.