Related papers: Bringing Emerging Architectures to Sequence Labeling in NLP

Bringing Emerging Architectures to Sequence Labeling in NLP

URL: http://arxiv.org/abs/2509.25918v1
Date: Tue, 30 Sep 2025 08:12:02 GMT
Title: Bringing Emerging Architectures to Sequence Labeling in NLP
Authors: Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares,
Abstract summary: We study how Transformer encoders adapt across tagging tasks that vary in structural complexity, label space, and token dependencies.<n>We find that the strong performance previously observed in simpler settings does not always generalize well across languages or datasets, nor does it extend to more complex structured tasks.
Score: 9.660348625678001
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pretrained Transformer encoders are the dominant approach to sequence labeling. While some alternative architectures-such as xLSTMs, structured state-space models, diffusion models, and adversarial learning-have shown promise in language modeling, few have been applied to sequence labeling, and mostly on flat or simplified tasks. We study how these architectures adapt across tagging tasks that vary in structural complexity, label space, and token dependencies, with evaluation spanning multiple languages. We find that the strong performance previously observed in simpler settings does not always generalize well across languages or datasets, nor does it extend to more complex structured tasks.

Related papers

Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models [3.382910438968506]
Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process.<n>We investigate a hierarchical architecture for autoregressive language modelling that combines character-level and word-level processing.<n>We demonstrate, at scales up to 7 billion parameters, that hierarchical transformers match the downstream task performance of subword-tokenizer-based models.
arXiv Detail & Related papers (2025-01-17T17:51:53Z)
EpiCoder: Encompassing Diversity and Complexity in Code Generation [49.170195362149386]
Existing methods for code generation use code snippets as seed data.<n>We introduce a novel feature tree-based synthesis framework, which revolves around hierarchical code features.<n>Our framework provides precise control over the complexity of the generated code, enabling functionalities that range from function-level operations to multi-file scenarios.
arXiv Detail & Related papers (2025-01-08T18:58:15Z)
Adaptive Large Language Models By Layerwise Attention Shortcuts [46.76681147411957]
LLM-like setups allow the final layer to attend to all of the intermediate layers as it deems fit through the attention mechanism. We showcase four different datasets, namely acoustic tokens, natural language, and symbolic music, and we achieve superior performance for GPT-like architecture.
arXiv Detail & Related papers (2024-09-17T03:46:01Z)
Learning Syntax Without Planting Trees: Understanding Hierarchical Generalization in Transformers [74.96551626420188]
Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures.<n>We investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge.
arXiv Detail & Related papers (2024-04-25T07:10:29Z)
Neural Architecture Search for Sentence Classification with BERT [4.862490782515929]
We perform an AutoML search to find architectures that outperform the current single layer at only a small compute cost. We validate our classification architecture on a variety of NLP benchmarks from the GLUE dataset.
arXiv Detail & Related papers (2024-03-27T13:25:43Z)
Inducing Systematicity in Transformers by Attending to Structurally Quantized Embeddings [60.698130703909804]
Transformers generalize to novel compositions of structures and entities after being trained on a complex dataset. We propose SQ-Transformer that explicitly encourages systematicity in the embeddings and attention layers. We show that SQ-Transformer achieves stronger compositional generalization than the vanilla Transformer on multiple low-complexity semantic parsing and machine translation datasets.
arXiv Detail & Related papers (2024-02-09T15:53:15Z)
Structural Concept Learning via Graph Attention for Multi-Level Rearrangement Planning [2.7195102129095003]
We propose a deep learning approach to perform multi-level object rearrangement planning for scenes with structural dependency hierarchies. It is trained on a self-generated simulation data set with intuitive structures and works for unseen scenes with an arbitrary number of objects. We compare our method with a range of classical and model-based baselines to show that our method leverages its scene understanding to achieve better performance, flexibility, and efficiency.
arXiv Detail & Related papers (2023-09-05T19:35:44Z)
Real-World Compositional Generalization with Disentangled Sequence-to-Sequence Learning [81.24269148865555]
A recently proposed Disentangled sequence-to-sequence model (Dangle) shows promising generalization capability. We introduce two key modifications to this model which encourage more disentangled representations and improve its compute and memory efficiency. Specifically, instead of adaptively re-encoding source keys and values at each time step, we disentangle their representations and only re-encode keys periodically.
arXiv Detail & Related papers (2022-12-12T15:40:30Z)
Autoregressive Structured Prediction with Language Models [73.11519625765301]
We describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs. Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at.
arXiv Detail & Related papers (2022-10-26T13:27:26Z)
Multilingual Extraction and Categorization of Lexical Collocations with Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context. Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z)
Tree-structured Attention with Hierarchical Accumulation [103.47584968330325]
"Hierarchical Accumulation" encodes parse tree structures into self-attention at constant time complexity. Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task.
arXiv Detail & Related papers (2020-02-19T08:17:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.