Related papers: Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models

Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models

URL: http://arxiv.org/abs/2108.02562v1
Date: Mon, 5 Jul 2021 12:54:05 GMT
Title: Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models
Authors: Khazar Khorrami, Okko R\"as\"anen
Abstract summary: This work studies multimodal learning in context of visually grounded speech (VGS) models. We introduce systematic metrics for evaluating model performance in aligning visual objects and spoken words. We show that cross-modal attention helps the model to achieve higher semantic cross-modal retrieval performance.
Score: 2.1320960069210484
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Systems that can find correspondences between multiple modalities, such as between speech and images, have great potential to solve different recognition and data analysis tasks in an unsupervised manner. This work studies multimodal learning in the context of visually grounded speech (VGS) models, and focuses on their recently demonstrated capability to extract spatiotemporal alignments between spoken words and the corresponding visual objects without ever been explicitly trained for object localization or word recognition. As the main contributions, we formalize the alignment problem in terms of an audiovisual alignment tensor that is based on earlier VGS work, introduce systematic metrics for evaluating model performance in aligning visual objects and spoken words, and propose a new VGS model variant for the alignment task utilizing cross-modal attention layer. We test our model and a previously proposed model in the alignment task using SPEECH-COCO captions coupled with MSCOCO images. We compare the alignment performance using our proposed evaluation metrics to the semantic retrieval task commonly used to evaluate VGS models. We show that cross-modal attention layer not only helps the model to achieve higher semantic cross-modal retrieval performance, but also leads to substantial improvements in the alignment performance between image object and spoken words.

Related papers

Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model [0.8747606955991707]
We propose a vision-language model that incorporates architectural improvements to enhance segmentation accuracy and cross-modal alignment.<n>SegVLM shows strong generalization across diverse datasets and referring expression scenarios.
arXiv Detail & Related papers (2025-05-25T17:42:53Z)
Vision-Speech Models: Teaching Speech Models to Converse about Images [67.62394024470528]
We introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules. An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics. We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis.
arXiv Detail & Related papers (2025-03-19T18:40:45Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement [102.22911097049953]
SIMA is a framework that enhances visual and language modality alignment through self-improvement. It employs an in-context self-critic mechanism to select response pairs for preference tuning. We demonstrate that SIMA achieves superior modality alignment, outperforming previous approaches.
arXiv Detail & Related papers (2024-05-24T23:09:27Z)
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs [66.05826802808177]
In computer vision, large language models (LLMs) can be used to prime vision-language tasks such as image captioning and visual question answering. We present an experimental evaluation of different interfacing mechanisms, across multiple tasks. We identify a new interfacing mechanism that yields (near) optimal results across different tasks, while obtaining a 4x reduction in training time.
arXiv Detail & Related papers (2024-03-20T10:57:17Z)
Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection. The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task. The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z)
Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process. Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z)
Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic Grounding [36.03994217853856]
The performance of object proposals generated for Vision-Language (VL) tasks is currently evaluated across all available annotations. Our work serves as a study of this phenomenon and explores the effectiveness of semantic grounding to mitigate its effects. We show that our method is consistent and demonstrates greatly improved alignment with annotations selected by image captioning metrics and human annotation.
arXiv Detail & Related papers (2023-09-01T02:19:41Z)
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models [3.86170450233149]
We show that large vision-and-language models (VLMs) trained to match images with text lack fine-grained understanding of spatial relations. We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses.
arXiv Detail & Related papers (2023-08-18T18:58:54Z)
Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z)
Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval [8.855547063009828]
We propose a Cross-modal Semantic Enhanced Interaction method, termed CMSEI for image-sentence retrieval. We first design the intra- and inter-modal spatial and semantic graphs based reasoning to enhance the semantic representations of objects. To correlate the context of objects with the textual context, we further refine the visual semantic representation via the cross-level object-sentence and word-image based interactive attention.
arXiv Detail & Related papers (2022-10-17T10:01:16Z)
Detection and Captioning with Unseen Object Classes [12.894104422808242]
Test images may contain visual objects with no corresponding visual or textual training examples. We propose a detection-driven approach based on a generalized zero-shot detection model and a template-based sentence generation model. Our experiments show that the proposed zero-shot detection model obtains state-of-the-art performance on the MS-COCO dataset.
arXiv Detail & Related papers (2021-08-13T10:43:20Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
Object Relational Graph with Teacher-Recommended Learning for Video Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation. Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.