SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo
and Text
- URL: http://arxiv.org/abs/2204.11964v3
- Date: Sun, 26 Mar 2023 13:01:15 GMT
- Title: SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo
and Text
- Authors: Pinaki Nath Chowdhury and Ayan Kumar Bhunia and Aneeshan Sain and
Subhadeep Koley and Tao Xiang and Yi-Zhe Song
- Abstract summary: In this paper, we extend scene understanding to include that of human sketch.
We focus on learning a flexible joint embedding that fully supports the optionality" that this complementarity brings.
- Score: 109.69076457732632
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we extend scene understanding to include that of human sketch.
The result is a complete trilogy of scene representation from three diverse and
complementary modalities -- sketch, photo, and text. Instead of learning a
rigid three-way embedding and be done with it, we focus on learning a flexible
joint embedding that fully supports the ``optionality" that this
complementarity brings. Our embedding supports optionality on two axes: (i)
optionality across modalities -- use any combination of modalities as query for
downstream tasks like retrieval, (ii) optionality across tasks --
simultaneously utilising the embedding for either discriminative (e.g.,
retrieval) or generative tasks (e.g., captioning). This provides flexibility to
end-users by exploiting the best of each modality, therefore serving the very
purpose behind our proposal of a trilogy in the first place. First, a
combination of information-bottleneck and conditional invertible neural
networks disentangle the modality-specific component from modality-agnostic in
sketch, photo, and text. Second, the modality-agnostic instances from sketch,
photo, and text are synergised using a modified cross-attention. Once learned,
we show our embedding can accommodate a multi-facet of scene-related tasks,
including those enabled for the first time by the inclusion of sketch, all
without any task-specific modifications. Project Page:
\url{http://www.pinakinathc.me/scenetrilogy}
Related papers
- Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding [39.55810156545949]
We propose a Multimodal Alignment (DMA) framework to densely co-embed different modalities into a common space.
Our DMA method produces highly competitive open-vocabulary segmentation performance on various indoor and outdoor tasks.
arXiv Detail & Related papers (2024-07-13T05:39:17Z) - Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval [10.202562518113677]
We propose an approach called Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval.
Our key innovation lies in the usage of text data as auxiliary information for images, thus leveraging the inherent zero-shot generalization ability that language offers.
arXiv Detail & Related papers (2024-07-01T05:32:06Z) - You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval [120.49126407479717]
We introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models.
Our system extends to novel applications in composed image retrieval, domain transfer, and fine-grained generation.
arXiv Detail & Related papers (2024-03-12T00:27:18Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image
Retrieval [55.21569389894215]
We propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them.
Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities.
We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation.
arXiv Detail & Related papers (2022-10-19T11:50:14Z) - Learning Semantic-Aligned Feature Representation for Text-based Person
Search [8.56017285139081]
We propose a semantic-aligned embedding method for text-based person search.
The feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features.
Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.
arXiv Detail & Related papers (2021-12-13T14:54:38Z) - Dense Relational Image Captioning via Multi-task Triple-Stream Networks [95.0476489266988]
We introduce dense captioning, a novel task which aims to generate captions with respect to information between objects in a visual scene.
This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding.
arXiv Detail & Related papers (2020-10-08T09:17:55Z) - A Novel Attention-based Aggregation Function to Combine Vision and
Language [55.7633883960205]
We propose a novel fully-attentive reduction method for vision and language.
Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention.
We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices.
arXiv Detail & Related papers (2020-04-27T18:09:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.