SpeechCLIP+: Self-supervised multi-task representation learning for
speech via CLIP and speech-image data
- URL: http://arxiv.org/abs/2402.06959v1
- Date: Sat, 10 Feb 2024 14:26:42 GMT
- Title: SpeechCLIP+: Self-supervised multi-task representation learning for
speech via CLIP and speech-image data
- Authors: Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng,
Hung-yi Lee, Hsin-Min Wang, David Harwath
- Abstract summary: SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription.
This paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module to replace a fixed number of CLS tokens in the cascaded architecture.
Second, we propose a new hybrid architecture that merges the cascaded and parallel architectures of SpeechCLIP into a multi-task learning framework.
- Score: 69.20254987896674
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recently proposed visually grounded speech model SpeechCLIP is an
innovative framework that bridges speech and text through images via CLIP
without relying on text transcription. On this basis, this paper introduces two
extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire
(CIF) module to replace a fixed number of CLS tokens in the cascaded
architecture. Second, we propose a new hybrid architecture that merges the
cascaded and parallel architectures of SpeechCLIP into a multi-task learning
framework. Our experimental evaluation is performed on the Flickr8k and
SpokenCOCO datasets. The results show that in the speech keyword extraction
task, the CIF-based cascaded SpeechCLIP model outperforms the previous cascaded
SpeechCLIP model using a fixed number of CLS tokens. Furthermore, through our
hybrid architecture, cascaded task learning boosts the performance of the
parallel branch in image-speech retrieval tasks.
Related papers
- Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - Symmetrical Linguistic Feature Distillation with CLIP for Scene Text
Recognition [77.93678598476149]
We establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR)
By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow.
Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.
arXiv Detail & Related papers (2023-10-08T04:00:20Z) - Composed Image Retrieval using Contrastive Learning and Task-oriented
CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one.
We use features from the OpenAI CLIP model to tackle the considered task.
We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z) - CLIPPO: Image-and-Language Understanding from Pixels Only [36.433133689137875]
We propose a pure pixel-based model to perform image, text, and multimodal tasks.
Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO)
When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks.
arXiv Detail & Related papers (2022-12-15T18:52:08Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.