Retriever: Learning Content-Style Representation as a Token-Level
Bipartite Graph
- URL: http://arxiv.org/abs/2202.12307v1
- Date: Thu, 24 Feb 2022 19:00:03 GMT
- Title: Retriever: Learning Content-Style Representation as a Token-Level
Bipartite Graph
- Authors: Dacheng Yin, Xuanchi Ren, Chong Luo, Yuwang Wang, Zhiwei Xiong, Wenjun
Zeng
- Abstract summary: An unsupervised framework, named Retriever, is proposed to learn such representations.
Being modal-agnostic, the proposed Retriever is evaluated in both speech and image domains.
- Score: 89.52990975155579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the unsupervised learning of content-style decomposed
representation. We first give a definition of style and then model the
content-style representation as a token-level bipartite graph. An unsupervised
framework, named Retriever, is proposed to learn such representations. First, a
cross-attention module is employed to retrieve permutation invariant (P.I.)
information, defined as style, from the input data. Second, a vector
quantization (VQ) module is used, together with man-induced constraints, to
produce interpretable content tokens. Last, an innovative link attention module
serves as the decoder to reconstruct data from the decomposed content and
style, with the help of the linking keys. Being modal-agnostic, the proposed
Retriever is evaluated in both speech and image domains. The state-of-the-art
zero-shot voice conversion performance confirms the disentangling ability of
our framework. Top performance is also achieved in the part discovery task for
images, verifying the interpretability of our representation. In addition, the
vivid part-based style transfer quality demonstrates the potential of Retriever
to support various fascinating generative tasks. Project page at
https://ydcustc.github.io/retriever-demo/.
Related papers
- Self-supervised Cross-view Representation Reconstruction for Change
Captioning [113.08380679787247]
Change captioning aims to describe the difference between a pair of similar images.
Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change.
We propose a self-supervised cross-view representation reconstruction network.
arXiv Detail & Related papers (2023-09-28T09:28:50Z) - With a Little Help from your own Past: Prototypical Memory Networks for
Image Captioning [47.96387857237473]
We devise a network which can perform attention over activations obtained while processing other training samples.
Our memory models the distribution of past keys and values through the definition of prototype vectors.
We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
arXiv Detail & Related papers (2023-08-23T18:53:00Z) - Improving Continuous Sign Language Recognition with Consistency
Constraints and Signer Removal [24.537234147678113]
We propose three auxiliary tasks to enhance the CSLR backbones.
A keypoint-guided spatial attention module is developed to enforce the visual module.
A sentence embedding consistency constraint is imposed between the visual and sequential modules.
Our model achieves state-of-the-art or competitive performance on five benchmarks.
arXiv Detail & Related papers (2022-12-26T06:38:34Z) - Fashionformer: A simple, Effective and Unified Baseline for Human
Fashion Segmentation and Recognition [80.74495836502919]
In this work, we focus on joint human fashion segmentation and attribute recognition.
We introduce the object query for segmentation and the attribute query for attribute prediction.
For attribute stream, we design a novel Multi-Layer Rendering module to explore more fine-grained features.
arXiv Detail & Related papers (2022-04-10T11:11:10Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - Partitioning Image Representation in Contrastive Learning [0.0]
We introduce a new representation, partitioned representation, which can learn both common and unique features of the anchor and positive samples in contrastive learning.
We show that our approach can separate two types of information in the VAE framework and outperforms the conventional BYOL in linear separability and a few-shot learning task as downstream tasks.
arXiv Detail & Related papers (2022-03-20T04:55:39Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.