Clustering-based Image-Text Graph Matching for Domain Generalization
- URL: http://arxiv.org/abs/2310.02692v2
- Date: Mon, 15 Apr 2024 17:01:23 GMT
- Title: Clustering-based Image-Text Graph Matching for Domain Generalization
- Authors: Nokyung Park, Daewon Chae, Jeongyong Shim, Sangpil Kim, Eun-Sol Kim, Jinkyu Kim,
- Abstract summary: Domain-invariant visual representations are important to train a model that can generalize well to unseen target task domains.
Recent works demonstrate that text descriptions contain high-level class-discriminative information.
We advocate for the use of local alignment between image regions and corresponding textual descriptions.
- Score: 13.277406473107721
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Learning domain-invariant visual representations is important to train a model that can generalize well to unseen target task domains. Recent works demonstrate that text descriptions contain high-level class-discriminative information and such auxiliary semantic cues can be used as effective pivot embedding for domain generalization problem. However, they use pivot embedding in global manner (i.e., aligning an image embedding with sentence-level text embedding), not fully utilizing the semantic cues of given text description. In this work, we advocate for the use of local alignment between image regions and corresponding textual descriptions. To this end, we first represent image and text inputs with graphs. We subsequently cluster nodes in those graphs and match the graph-based image node features into textual graphs. This matching process is conducted globally and locally, tightly aligning visual and textual semantic sub-structures. We experiment with large-scale public datasets, such as CUB-DG and DomainBed, and our model achieves matched or better state-of-the-art performance on these datasets. Our code will be publicly available upon publication.
Related papers
- Bridging Local Details and Global Context in Text-Attributed Graphs [62.522550655068336]
GraphBridge is a framework that bridges local and global perspectives by leveraging contextual textual information.
Our method achieves state-of-theart performance, while our graph-aware token reduction module significantly enhances efficiency and solves scalability issues.
arXiv Detail & Related papers (2024-06-18T13:35:25Z) - Text-Guided Image Clustering [15.217924518131268]
We propose Text-Guided Image Clustering, i.e., generating text using image captioning and visual question-answering (VQA) models.
Across eight diverse image clustering datasets, our results show that the obtained text representations often outperform image features.
arXiv Detail & Related papers (2024-02-05T13:34:21Z) - Adapt Anything: Tailor Any Image Classifiers across Domains And
Categories Using Text-to-Image Diffusion Models [82.95591765009105]
We aim to study if a modern text-to-image diffusion model can tailor any task-adaptive image classifier across domains and categories.
We utilize only one off-the-shelf text-to-image model to synthesize images with category labels derived from the corresponding text prompts.
arXiv Detail & Related papers (2023-10-25T11:58:14Z) - ConGraT: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings [20.25180279903009]
We propose Contrastive Graph-Text pretraining (ConGraT) for jointly learning separate representations of texts and nodes in a text-attributed graph (TAG)
Our method trains a language model (LM) and a graph neural network (GNN) to align their representations in a common latent space using a batch-wise contrastive learning objective inspired by CLIP.
Experiments demonstrate that ConGraT outperforms baselines on various downstream tasks, including node and text category classification, link prediction, and language modeling.
arXiv Detail & Related papers (2023-05-23T17:53:30Z) - HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval.
First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively.
Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module.
Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z) - A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval [4.159666152160874]
Scene graph presentation is a suitable method for the image-text matching challenge.
We introduce the Local and Global Scene Graph Matching (LGSGM) model that enhances the state-of-the-art method.
Our enhancement with the combination of levels can improve the performance of the baseline method by increasing the recall by more than 10% on the Flickr30k dataset.
arXiv Detail & Related papers (2021-06-04T10:33:14Z) - Disentangled Motif-aware Graph Learning for Phrase Grounding [48.64279161780489]
We propose a novel graph learning framework for phrase grounding in the image.
We devise the disentangled graph network to integrate the motif-aware contextual information into representations.
Our model achieves state-of-the-art performance on Flickr30K Entities and ReferIt Game benchmarks.
arXiv Detail & Related papers (2021-04-13T08:20:07Z) - Similarity Reasoning and Filtration for Image-Text Matching [85.68854427456249]
We propose a novel Similarity Graph Reasoning and Attention filtration network for image-text matching.
Similarity Graph Reasoning (SGR) module relying on one graph convolutional neural network is introduced to infer relation-aware similarities with both the local and global alignments.
We demonstrate the superiority of the proposed method with achieving state-of-the-art performances on the Flickr30K and MSCOCO datasets.
arXiv Detail & Related papers (2021-01-05T06:29:35Z) - GINet: Graph Interaction Network for Scene Parsing [58.394591509215005]
We propose a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss) to promote context reasoning over image regions.
The proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
arXiv Detail & Related papers (2020-09-14T02:52:45Z) - Graph Edit Distance Reward: Learning to Edit Scene Graph [69.39048809061714]
We propose a new method to edit the scene graph according to the user instructions, which has never been explored.
To be specific, in order to learn editing scene graphs as the semantics given by texts, we propose a Graph Edit Distance Reward.
In the context of text-editing image retrieval, we validate the effectiveness of our method in CSS and CRIR dataset.
arXiv Detail & Related papers (2020-08-15T04:52:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.