Related papers: TextME: Bridging Unseen Modalities Through Text Descriptions

TextME: Bridging Unseen Modalities Through Text Descriptions

URL: http://arxiv.org/abs/2602.03098v1
Date: Tue, 03 Feb 2026 04:43:13 GMT
Title: TextME: Bridging Unseen Modalities Through Text Descriptions
Authors: Soyeon Hong, Jinchan Kim, Jaegook You, Seungtaek Choi, Suha Kwak, Hyunsouk Cho,
Abstract summary: We introduce TextME, the first text-only modality expansion framework.<n>Our approach exploits the geometric structure of pretrained contrastive encoders to enable zero-shot cross-modal transfer.<n>Results establish text-only training as a practical alternative to paired supervision for modality expansion.
Score: 37.33304279891978
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Expanding multimodal representations to novel modalities is constrained by reliance on large-scale paired datasets (e.g., text-image, text-audio, text-3D, text-molecule), which are costly and often infeasible in domains requiring expert annotation such as medical imaging and molecular analysis. We introduce TextME, the first text-only modality expansion framework, to the best of our knowledge, projecting diverse modalities into LLM embedding space as a unified anchor. Our approach exploits the geometric structure of pretrained contrastive encoders to enable zero-shot cross-modal transfer using only text descriptions, without paired supervision. We empirically validate that such consistent modality gaps exist across image, video, audio, 3D, X-ray, and molecular domains, demonstrating that text-only training can preserve substantial performance of pretrained encoders. We further show that our framework enables emergent cross-modal retrieval between modality pairs not explicitly aligned during training (e.g., audio-to-image, 3D-to-image). These results establish text-only training as a practical alternative to paired supervision for modality expansion.

Related papers

Multimodal Medical Image Binding via Shared Text Embeddings [15.504918331492716]
Multimodal Medical Image Binding with Text (Mtextsuperscript3Bind) is a novel pre-training framework that enables seamless alignment of medical imaging modalities.<n>Mtextsuperscript3Bind first fine-tunes CLIP-like image-text models to align their modality-specific text embedding space.<n>We show that Mtextsuperscript3Bind achieves state-of-the-art performance in zero-shot, few-shot classification and cross-modal retrieval tasks.
arXiv Detail & Related papers (2025-06-22T15:39:25Z)
Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. We introduce a novel method named Decoder Pre-training with only text for STR (DPTR) DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z)
Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used. We propose an effective yet straightforward scheme named PTUnifier to unify the two types. We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z)
ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training [29.240131406803794]
We show that a common space can be created without any training at all, using single-domain encoders and a much smaller amount of image-text pairs. Our model has unique properties, most notably, deploying a new version with updated training samples can be done in a matter of seconds.
arXiv Detail & Related papers (2022-10-04T16:56:22Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)
Multimodal Semi-Supervised Learning for Text Recognition [10.33262222726707]
We present semi-supervised learning for multimodal text recognizers (SemiMTR) that leverages unlabeled data at each modality training phase. Our algorithm starts by pretraining the vision model through a single-stage training that unifies self-supervised learning with supervised training. In a novel setup, consistency is enforced on each modality separately.
arXiv Detail & Related papers (2022-05-08T13:55:30Z)
Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection. We propose to learn contextualized, joint representations through vision-language pre-training. The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z)
Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model. During the training phase, the modality transition network is optimised by the proposed modality loss. Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.