Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set
Alignment
- URL: http://arxiv.org/abs/2305.12218v1
- Date: Sat, 20 May 2023 15:48:47 GMT
- Title: Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set
Alignment
- Authors: Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, Zhennan Wang, Li Yuan,
Chang Liu, Jie Chen
- Abstract summary: We propose the Disentangled Conceptualization and Set-to-set Alignment (DiCoSA) to simulate the conceptualizing and reasoning process of human beings.
For disentangled conceptualization, we divide the coarse feature into multiple latent factors related to semantic concepts.
For set-to-set alignment, where a set of visual concepts correspond to a set of textual concepts, we propose an adaptive pooling method to aggregate semantic concepts.
- Score: 17.423361070781876
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-video retrieval is a challenging cross-modal task, which aims to align
visual entities with natural language descriptions. Current methods either fail
to leverage the local details or are computationally expensive. What's worse,
they fail to leverage the heterogeneous concepts in data. In this paper, we
propose the Disentangled Conceptualization and Set-to-set Alignment (DiCoSA) to
simulate the conceptualizing and reasoning process of human beings. For
disentangled conceptualization, we divide the coarse feature into multiple
latent factors related to semantic concepts. For set-to-set alignment, where a
set of visual concepts correspond to a set of textual concepts, we propose an
adaptive pooling method to aggregate semantic concepts to address the partial
matching. In particular, since we encode concepts independently in only a few
dimensions, DiCoSA is superior at efficiency and granularity, ensuring
fine-grained interactions using a similar computational complexity as
coarse-grained alignment. Extensive experiments on five datasets, including
MSR-VTT, LSMDC, MSVD, ActivityNet, and DiDeMo, demonstrate that our method
outperforms the existing state-of-the-art methods.
Related papers
- Conceptual Codebook Learning for Vision-Language Models [27.68834532978939]
We propose Codebook Learning (CoCoLe) to address the challenge of improving the generalization capability of vision-language models (VLMs)
We learn a conceptual codebook consisting of visual concepts as keys and conceptual prompts as values.
We observe that this conceptual codebook learning method is able to achieve enhanced alignment between visual and linguistic modalities.
arXiv Detail & Related papers (2024-07-02T15:16:06Z) - Towards Compositionality in Concept Learning [20.960438848942445]
We show that existing unsupervised concept extraction methods find concepts which are not compositional.
We propose Compositional Concept Extraction (CCE) for finding concepts which obey these properties.
CCE finds more compositional concept representations than baselines and yields better accuracy on four downstream classification tasks.
arXiv Detail & Related papers (2024-06-26T17:59:30Z) - PaCE: Parsimonious Concept Engineering for Large Language Models [57.740055563035256]
Parsimonious Concept Engineering (PaCE) is a novel activation engineering framework for alignment.
We construct a large-scale concept dictionary in the activation space, in which each atom corresponds to a semantic concept.
We show that PaCE achieves state-of-the-art alignment performance while maintaining linguistic capabilities.
arXiv Detail & Related papers (2024-06-06T17:59:10Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Non-confusing Generation of Customized Concepts in Diffusion Models [135.4385383284657]
We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs)
Existing customized generation methods only focus on fine-tuning the second stage while overlooking the first one.
We propose a simple yet effective solution called CLIF: contrastive image-language fine-tuning.
arXiv Detail & Related papers (2024-05-11T05:01:53Z) - ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation [17.019848796027485]
Self-supervised visual pre-training models have shown great promise in representing pixel-level semantic relationships.
In this work, we investigate the pixel-level semantic aggregation in self-trained models as image encodes and design concepts.
We propose the Adaptive Concept Generator (ACG) which adaptively maps these prototypes to informative concepts for each image.
arXiv Detail & Related papers (2022-10-12T06:16:34Z) - DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for
Open-world Detection [118.36746273425354]
This paper presents a paralleled visual-concept pre-training method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary.
By enriching the concepts with their descriptions, we explicitly build the relationships among various concepts to facilitate the open-domain learning.
The proposed framework demonstrates strong zero-shot detection performances, e.g., on the LVIS dataset, our DetCLIP-T outperforms GLIP-T by 9.9% mAP and obtains a 13.5% improvement on rare categories.
arXiv Detail & Related papers (2022-09-20T02:01:01Z) - Modeling Temporal Concept Receptive Field Dynamically for Untrimmed
Video Analysis [105.06166692486674]
We study temporal concept receptive field of concept-based event representation.
We introduce temporal dynamic convolution (TDC) to give stronger flexibility to concept-based event analytics.
Different coefficients can generate appropriate and accurate temporal concept receptive field size according to input videos.
arXiv Detail & Related papers (2021-11-23T04:59:48Z) - Entity Concept-enhanced Few-shot Relation Extraction [35.10974511223129]
Few-shot relation extraction (FSRE) is of great importance in long-tail distribution problem.
Most existing FSRE algorithms fail to accurately classify the relations merely based on the information of the sentences together with the recognized entity pairs.
We propose a novel entity-enhanced FEw-shot Relation Extraction scheme (ConceptFERE), which introduces the inherent concepts of entities to provide clues for relation prediction.
arXiv Detail & Related papers (2021-06-04T10:36:49Z) - Concept Learners for Few-Shot Learning [76.08585517480807]
We propose COMET, a meta-learning method that improves generalization ability by learning to learn along human-interpretable concept dimensions.
We evaluate our model on few-shot tasks from diverse domains, including fine-grained image classification, document categorization and cell type annotation.
arXiv Detail & Related papers (2020-07-14T22:04:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.