Contextual Representation Learning beyond Masked Language Modeling
- URL: http://arxiv.org/abs/2204.04163v1
- Date: Fri, 8 Apr 2022 16:18:06 GMT
- Title: Contextual Representation Learning beyond Masked Language Modeling
- Authors: Zhiyi Fu, Wangchunshu Zhou, Jingjing Xu, Hao Zhou, Lei Li
- Abstract summary: We analyze language models (MLMs) such as BERT learn contextually.
To address these issues, we propose TACO, a representation learning approach directly directly global semantics.
TACO extracts contextual semantics hidden in contextualized representations to encourage models to attend global semantics.
- Score: 45.46220173487394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How do masked language models (MLMs) such as BERT learn contextual
representations? In this work, we analyze the learning dynamics of MLMs. We
find that MLMs adopt sampled embeddings as anchors to estimate and inject
contextual semantics to representations, which limits the efficiency and
effectiveness of MLMs. To address these issues, we propose TACO, a simple yet
effective representation learning approach to directly model global semantics.
TACO extracts and aligns contextual semantics hidden in contextualized
representations to encourage models to attend global semantics when generating
contextualized representations. Experiments on the GLUE benchmark show that
TACO achieves up to 5x speedup and up to 1.2 points average improvement over
existing MLMs. The code is available at https://github.com/FUZHIYI/TACO.
Related papers
- ExLM: Rethinking the Impact of [MASK] Tokens in Masked Language Models [11.997499811414837]
Masked Language Models (ML)Mss are trained by randomly masking portions of the input sequences with [MASK] tokens and learning to reconstruct the original content based on the remaining context.
arXiv Detail & Related papers (2025-01-23T05:46:50Z) - Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search [64.15205542003056]
We introduce Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM)
AGA achieves new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTP, respectively.
arXiv Detail & Related papers (2024-12-19T17:51:49Z) - Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings [69.35226485836641]
Excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation.
We propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE)
DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer.
arXiv Detail & Related papers (2024-11-29T11:24:23Z) - Can MLLMs Guide Weakly-Supervised Temporal Action Localization Tasks? [6.7065734065794835]
We introduce a novel learning paradigm termed MLLM4WTAL.
It harnesses the potential of MLLM to offer temporal action key semantics and complete semantic priors.
It achieves this by integrating two distinct modules: Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR)
arXiv Detail & Related papers (2024-11-13T09:37:24Z) - ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual referring into Multimodal Large Language Models (MLLMs)
We observe the relationship between text prompt tokens and visual tokens in MLLMs, where attention layers model the connection between them.
We optimize a learnable visual token based on an energy function, enhancing the strength of referential regions in the attention map.
arXiv Detail & Related papers (2024-07-31T11:40:29Z) - Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.
This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)
SeTok groups visual features into semantic units via a dynamic clustering algorithm.
The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks.
Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored.
We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - IERL: Interpretable Ensemble Representation Learning -- Combining
CrowdSourced Knowledge and Distributed Semantic Representations [11.008412414253662]
Large Language Models (LLMs) encode meanings of words in the form of distributed semantics.
Recent studies have shown that LLMs tend to generate unintended, inconsistent, or wrong texts as outputs.
We propose a novel ensemble learning method, Interpretable Ensemble Representation Learning (IERL), that systematically combines LLM and crowdsourced knowledge representations.
arXiv Detail & Related papers (2023-06-24T05:02:34Z) - MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [73.86954509967416]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks.
This paper presents the first comprehensive MLLM Evaluation benchmark MME.
It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z) - Masked and Permuted Implicit Context Learning for Scene Text Recognition [8.742571493814326]
Scene Recognition (STR) is difficult because of variations in text styles, shapes, and backgrounds.
We propose a masked and permuted implicit context learning network for STR, within a single decoder.
arXiv Detail & Related papers (2023-05-25T15:31:02Z) - Representation Deficiency in Masked Language Modeling [107.39136254013042]
We propose MAE-LM, which pretrains the Masked Autoencoder architecture with where $tt[MASK]$ tokens are excluded from the encoder.
We show that MAE-LM consistently outperforms pretrained models across different pretraining settings and model sizes when fine-tuned on the GLUE and SQuAD benchmarks.
arXiv Detail & Related papers (2023-02-04T01:54:17Z) - Fast, Effective and Self-Supervised: Transforming Masked LanguageModels
into Universal Lexical and Sentence Encoders [66.76141128555099]
We show that it is possible to turn tasks into universal lexical and sentence encoders even without any additional data and without supervision.
We propose an extremely simple, fast and effective contrastive learning technique, termed Mirror-BERT.
Mirror-BERT relies on fully identical or slightly modified string pairs as positive (i.e., synonymous) fine-tuning examples.
We report huge gains over off-the-shelfs with Mirror-BERT in both lexical-level and sentence-level tasks, across different domains and different languages.
arXiv Detail & Related papers (2021-04-16T10:49:56Z) - Universal Sentence Representation Learning with Conditional Masked
Language Model [7.334766841801749]
We present Conditional Masked Language Modeling (M) to effectively learn sentence representations.
Our English CMLM model achieves state-of-the-art performance on SentEval.
As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains.
arXiv Detail & Related papers (2020-12-28T18:06:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.