A New Method to Capturing Compositional Knowledge in Linguistic Space
- URL: http://arxiv.org/abs/2412.15632v1
- Date: Fri, 20 Dec 2024 07:48:09 GMT
- Title: A New Method to Capturing Compositional Knowledge in Linguistic Space
- Authors: Jiahe Wan,
- Abstract summary: ZS-CU is a novel task that enhances compositional understanding without requiring hard negative training data.
We propose YUKINO, which uses textual inversion to map unlabeled images to pseudo-tokens in a pre-trained CLIP model.
YUKINO outperforms the existing multi-modal SOTA models by over 8% on the SugarCREPE benchmark.
- Score: 0.0
- License:
- Abstract: Compositional understanding allows visual language models to interpret complex relationships between objects, attributes, and relations in images and text. However, most existing methods often rely on hard negative examples and fine-tuning, which can overestimate improvements and are limited by the difficulty of obtaining hard negatives. In this work, we introduce Zero-Shot Compositional Understanding (ZS-CU), a novel task that enhances compositional understanding without requiring hard negative training data. We propose YUKINO (Yielded Compositional Understanding Knowledge via Textual Inversion with NO), which uses textual inversion to map unlabeled images to pseudo-tokens in a pre-trained CLIP model. We propose introducing "no" logical regularization to address the issue of token interaction in inversion. Additionally, we suggest using knowledge distillation to reduce the time complexity of textual inversion. Experimental results show that YUKINO outperforms the existing multi-modal SOTA models by over 8% on the SugarCREPE benchmark, and also achieves significant improvements in image retrieval tasks.
Related papers
- Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation [7.742746565876165]
The interpretability of LVLMs remains an under-explored area.
In models such as LLaVA1.5, image tokens that are semantically related to text are more likely to have information flow convergence.
We propose a new image token reduction method, Simignore, which aims to improve the complex reasoning ability of LVLMs.
arXiv Detail & Related papers (2024-12-13T03:13:44Z) - Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation [70.95783968368124]
We introduce a novel multi-modal autoregressive model, dubbed $textbfInstaManip$.
We propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages.
Our method surpasses previous few-shot image manipulation models by a notable margin.
arXiv Detail & Related papers (2024-12-02T01:19:21Z) - HNCSE: Advancing Sentence Embeddings via Hybrid Contrastive Learning with Hard Negatives [17.654412302780557]
HNCSE is a novel contrastive learning framework that extends the leading SimCSE approach.
The hallmark of HNCSE is its innovative use of hard negative samples to enhance the learning of both positive and negative samples.
arXiv Detail & Related papers (2024-11-19T01:26:20Z) - ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components.
Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality.
We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z) - NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality [52.08735848128973]
We study the capability of Video-Language (VidL) models in understanding compositions between objects, attributes, actions and their relations.
We propose a training method called NAVERO which utilizes video-text data augmented with negative texts to enhance composition understanding.
arXiv Detail & Related papers (2024-08-18T15:27:06Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples [0.6249768559720122]
We introduce a novel pretraining method incorporating synthetic hard negative text examples.
The hard negatives permute terms corresponding to visual concepts, leading to a more fine-grained visual and textual concept alignment.
We also introduce InpaintCOCO, a new dataset for assessing the fine-grained alignment of colors, objects, and sizes in vision-language models.
arXiv Detail & Related papers (2024-03-05T11:38:48Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - Co-Grounding Networks with Semantic Attention for Referring Expression
Comprehension in Videos [96.85840365678649]
We tackle the problem of referring expression comprehension in videos with an elegant one-stage framework.
We enhance the single-frame grounding accuracy by semantic attention learning and improve the cross-frame grounding consistency.
Our model is also applicable to referring expression comprehension in images, illustrated by the improved performance on the RefCOCO dataset.
arXiv Detail & Related papers (2021-03-23T06:42:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.