Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models
- URL: http://arxiv.org/abs/2108.02562v1
- Date: Mon, 5 Jul 2021 12:54:05 GMT
- Title: Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models
- Authors: Khazar Khorrami, Okko R\"as\"anen
- Abstract summary: This work studies multimodal learning in context of visually grounded speech (VGS) models.
We introduce systematic metrics for evaluating model performance in aligning visual objects and spoken words.
We show that cross-modal attention helps the model to achieve higher semantic cross-modal retrieval performance.
- Score: 2.1320960069210484
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Systems that can find correspondences between multiple modalities, such as
between speech and images, have great potential to solve different recognition
and data analysis tasks in an unsupervised manner. This work studies multimodal
learning in the context of visually grounded speech (VGS) models, and focuses
on their recently demonstrated capability to extract spatiotemporal alignments
between spoken words and the corresponding visual objects without ever been
explicitly trained for object localization or word recognition. As the main
contributions, we formalize the alignment problem in terms of an audiovisual
alignment tensor that is based on earlier VGS work, introduce systematic
metrics for evaluating model performance in aligning visual objects and spoken
words, and propose a new VGS model variant for the alignment task utilizing
cross-modal attention layer. We test our model and a previously proposed model
in the alignment task using SPEECH-COCO captions coupled with MSCOCO images. We
compare the alignment performance using our proposed evaluation metrics to the
semantic retrieval task commonly used to evaluate VGS models. We show that
cross-modal attention layer not only helps the model to achieve higher semantic
cross-modal retrieval performance, but also leads to substantial improvements
in the alignment performance between image object and spoken words.
Related papers
- Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement [102.22911097049953]
SIMA is a framework that enhances visual and language modality alignment through self-improvement.
It employs an in-context self-critic mechanism to select response pairs for preference tuning.
We demonstrate that SIMA achieves superior modality alignment, outperforming previous approaches.
arXiv Detail & Related papers (2024-05-24T23:09:27Z) - Improved Baselines for Data-efficient Perceptual Augmentation of LLMs [66.05826802808177]
In computer vision, large language models (LLMs) can be used to prime vision-language tasks such as image captioning and visual question answering.
We present an experimental evaluation of different interfacing mechanisms, across multiple tasks.
We identify a new interfacing mechanism that yields (near) optimal results across different tasks, while obtaining a 4x reduction in training time.
arXiv Detail & Related papers (2024-03-20T10:57:17Z) - Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection.
The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task.
The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - Towards Addressing the Misalignment of Object Proposal Evaluation for
Vision-Language Tasks via Semantic Grounding [36.03994217853856]
The performance of object proposals generated for Vision-Language (VL) tasks is currently evaluated across all available annotations.
Our work serves as a study of this phenomenon and explores the effectiveness of semantic grounding to mitigate its effects.
We show that our method is consistent and demonstrates greatly improved alignment with annotations selected by image captioning metrics and human annotation.
arXiv Detail & Related papers (2023-09-01T02:19:41Z) - Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
Models [3.86170450233149]
We show that large vision-and-language models (VLMs) trained to match images with text lack fine-grained understanding of spatial relations.
We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses.
arXiv Detail & Related papers (2023-08-18T18:58:54Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval [8.855547063009828]
We propose a Cross-modal Semantic Enhanced Interaction method, termed CMSEI for image-sentence retrieval.
We first design the intra- and inter-modal spatial and semantic graphs based reasoning to enhance the semantic representations of objects.
To correlate the context of objects with the textual context, we further refine the visual semantic representation via the cross-level object-sentence and word-image based interactive attention.
arXiv Detail & Related papers (2022-10-17T10:01:16Z) - Detection and Captioning with Unseen Object Classes [12.894104422808242]
Test images may contain visual objects with no corresponding visual or textual training examples.
We propose a detection-driven approach based on a generalized zero-shot detection model and a template-based sentence generation model.
Our experiments show that the proposed zero-shot detection model obtains state-of-the-art performance on the MS-COCO dataset.
arXiv Detail & Related papers (2021-08-13T10:43:20Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.