Related papers: FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text

FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text

URL: http://arxiv.org/abs/2507.10095v2
Date: Tue, 29 Jul 2025 02:40:10 GMT
Title: FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text
Authors: Bingchao Wang, Zhiwei Ning, Jianyu Ding, Xuanang Gao, Yin Li, Dongsheng Jiang, Jie Yang, Wei Liu,
Abstract summary: We propose FIX-CLIP, which includes three novel modules.<n>A dual-branch training pipeline that aligns short and long texts with masked and raw images, respectively.<n>Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction.
Score: 13.888406804533535
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs ($>77$ tokens). To remedy this issue, we propose FIX-CLIP, which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images, respectively, which boosts the long-text representation while preserving the short-text ability. (2) Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction. (3) A hierarchical feature alignment module in the intermediate encoder layers to promote the consistency of multi-scale features. Furthermore, we collect 30M images and utilize existing MLLMs to synthesize long-text captions for training. Extensive experiments show that FIX-CLIP achieves state-of-the-art performance on both long-text and short-text retrieval benchmarks. For downstream applications, we reveal that FIX-CLIP's text encoder delivers promising performance in a plug-and-play manner for diffusion models with long-text input. The code is available at https://github.com/bcwang-sjtu/Fix-CLIP.

Related papers

FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs [0.351124620232225]
FineLIP enhances cross-modal text-image mapping by incorporating textbfFine-grained alignment with textbfLonger text input.<n>FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens.<n>We validate our model on datasets with long, detailed captions across two tasks: zero-shot cross-modal retrieval and text-to-image generation.
arXiv Detail & Related papers (2025-04-02T17:19:59Z)
TULIP: Token-length Upgraded CLIP [57.818513403100326]
We address the challenge of representing long captions in vision-language models, such as CLIP.<n>By design these models are limited by fixed, absolute positional encodings, restricting inputs to a maximum of 77 tokens.<n>We propose a generalizable method, named T, able to upgrade the token length to any length for CLIP-like models.
arXiv Detail & Related papers (2024-10-13T22:34:15Z)
Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval [13.315951821189538]
Scene text retrieval aims to find all images containing the query text from an image gallery. Current efforts tend to adopt an Optical Character Recognition (OCR) pipeline, which requires complicated text detection and/or recognition processes. We propose to explore the intrinsic potential of Contrastive Language-Image Pre-training (CLIP) for OCR-free scene text retrieval.
arXiv Detail & Related papers (2024-08-01T10:25:14Z)
MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment [53.235290505274676]
Large-scale vision-language models such as CLIP can improve semantic segmentation performance. We introduce MTA-CLIP, a novel framework employing mask-level vision-language alignment. MTA-CLIP achieves state-of-the-art, surpassing prior works by an average of 2.8% and 1.3% on benchmark datasets.
arXiv Detail & Related papers (2024-07-31T14:56:42Z)
Out of Length Text Recognition with Sub-String Matching [54.63761108308825]
In this paper, we term this task Out of Length (OOL) text recognition.<n>We propose a novel method called OOL Text Recognition with sub-String Matching (SMTR)<n>SMTR comprises two cross-attention-based modules: one encodes a sub-string containing multiple characters into next and previous queries, and the other employs the queries to attend to the image features.
arXiv Detail & Related papers (2024-07-17T05:02:17Z)
From Text to Pixel: Advancing Long-Context Understanding in MLLMs [70.78454154014989]
We introduce SEEKER, a multimodal large language model designed to tackle this issue. SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images. Our experiments on six long-context multimodal tasks demonstrate that SEEKER can leverage fewer image tokens to convey the same amount of textual information compared with the OCR-based approach.
arXiv Detail & Related papers (2024-05-23T06:17:23Z)
An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation [21.154973705998945]
Existing methods leverage the text encoder of the CLIP model to represent input prompts. Large Language Models (LLMs) offer multilingual input, accommodate longer context, and achieve superior text representation. We propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs.
arXiv Detail & Related papers (2024-05-21T16:35:02Z)
Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text.<n>We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length.<n>We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z)
Long-CLIP: Unlocking the Long-Text Capability of CLIP [47.13547303843929]
Long-CLIP is a plug-and-play alternative to Contrastive Language-Image Pre-training. Long-CLIP supports long-text input, retains or even surpasses its zero-shot generalizability. It offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.
arXiv Detail & Related papers (2024-03-22T17:58:16Z)
TF-CLIP: Learning Text-free CLIP for Video-based Person Re-Identification [60.5843635938469]
We propose a novel one-stage text-free CLIP-based learning framework named TF-CLIP for video-based person ReID. More specifically, we extract the identity-specific sequence feature as the CLIP-Memory to replace the text feature. Our proposed method shows much better results than other state-of-the-art methods on MARS, LS-VID and iLIDS-VID.
arXiv Detail & Related papers (2023-12-15T09:10:05Z)
LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval [71.01982683581572]
The conventional dense retrieval paradigm relies on encoding images and texts into dense representations using dual-stream encoders. We propose the lexicon-weighting paradigm, where sparse representations in vocabulary space are learned for images and texts. We introduce a novel pre-training framework, that learns importance-aware lexicon representations. Our framework achieves a 5.5 221.3X faster retrieval speed and 13.2 48.8X less index storage memory.
arXiv Detail & Related papers (2023-02-06T16:24:41Z)
SCATTER: Selective Context Attentional Scene Text Recognizer [16.311256552979835]
Scene Text Recognition (STR) is the task of recognizing text against complex image backgrounds. Current state-of-the-art (SOTA) methods still struggle to recognize text written in arbitrary shapes. We introduce a novel architecture for STR, named Selective Context ATtentional Text Recognizer (SCATTER)
arXiv Detail & Related papers (2020-03-25T09:20:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.