Related papers: SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data

SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data

URL: http://arxiv.org/abs/2402.06959v1
Date: Sat, 10 Feb 2024 14:26:42 GMT
Title: SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data
Authors: Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-yi Lee, Hsin-Min Wang, David Harwath
Abstract summary: SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription. This paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module to replace a fixed number of CLS tokens in the cascaded architecture. Second, we propose a new hybrid architecture that merges the cascaded and parallel architectures of SpeechCLIP into a multi-task learning framework.
Score: 69.20254987896674
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recently proposed visually grounded speech model SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription. On this basis, this paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module to replace a fixed number of CLS tokens in the cascaded architecture. Second, we propose a new hybrid architecture that merges the cascaded and parallel architectures of SpeechCLIP into a multi-task learning framework. Our experimental evaluation is performed on the Flickr8k and SpokenCOCO datasets. The results show that in the speech keyword extraction task, the CIF-based cascaded SpeechCLIP model outperforms the previous cascaded SpeechCLIP model using a fixed number of CLS tokens. Furthermore, through our hybrid architecture, cascaded task learning boosts the performance of the parallel branch in image-speech retrieval tasks.

Related papers

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment [20.953645420787527]
We train a CLIP-like model with only a fraction of the computational cost compared to CLIP. We achieve state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-12-20T20:46:48Z)
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements [0.6990493129893112]
Recent advances in Vision Language Models have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications. Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks.
arXiv Detail & Related papers (2024-11-18T20:31:38Z)
Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts. The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images. Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z)
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z)
Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled. First, we quantize and embed both text and visual prompt into a unified representational space. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z)
Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition [77.93678598476149]
We establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR) By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow. Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.
arXiv Detail & Related papers (2023-10-08T04:00:20Z)
Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one. We use features from the OpenAI CLIP model to tackle the considered task. We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z)
CLIPPO: Image-and-Language Understanding from Pixels Only [36.433133689137875]
We propose a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO) When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks.
arXiv Detail & Related papers (2022-12-15T18:52:08Z)
Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP) We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.