Towards Corpus-Grounded Agentic LLMs for Multilingual Grammatical Analysis
- URL: http://arxiv.org/abs/2512.00214v1
- Date: Fri, 28 Nov 2025 21:27:58 GMT
- Title: Towards Corpus-Grounded Agentic LLMs for Multilingual Grammatical Analysis
- Authors: Matej Klemen, Tjaša Arčon, Luka Terčon, Marko Robnik-Šikonja, Kaja Dobrovoljc,
- Abstract summary: We explore how agentic large language models (LLMs) can streamline the systematic analysis of annotated corpora.<n>We introduce an agentic framework for corpus-grounded grammatical analysis that integrates concepts such as natural-language task interpretation.<n>We test the system on multilingual grammatical tasks inspired by the World Atlas of Language Structures (WALS)
- Score: 0.5545791216381869
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Empirical grammar research has become increasingly data-driven, but the systematic analysis of annotated corpora still requires substantial methodological and technical effort. We explore how agentic large language models (LLMs) can streamline this process by reasoning over annotated corpora and producing interpretable, data-grounded answers to linguistic questions. We introduce an agentic framework for corpus-grounded grammatical analysis that integrates concepts such as natural-language task interpretation, code generation, and data-driven reasoning. As a proof of concept, we apply it to Universal Dependencies (UD) corpora, testing it on multilingual grammatical tasks inspired by the World Atlas of Language Structures (WALS). The evaluation spans 13 word-order features and over 170 languages, assessing system performance across three complementary dimensions - dominant-order accuracy, order-coverage completeness, and distributional fidelity - which reflect how well the system generalizes, identifies, and quantifies word-order variations. The results demonstrate the feasibility of combining LLM reasoning with structured linguistic data, offering a first step toward interpretable, scalable automation of corpus-based grammatical inquiry.
Related papers
- LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder [47.81850176849213]
We propose a framework for analyzing the linguistic mechanisms of large language models, based on Sparse Auto-Encoders (SAEs)<n>We extract a broad set of Chinese and English linguistic features across four dimensions (morphology, syntax, semantics, and pragmatics)<n>Our findings reveal intrinsic representations of linguistic knowledge in LLMs, uncover patterns of cross-layer and cross-lingual distribution, and demonstrate the potential to control model outputs.
arXiv Detail & Related papers (2025-02-27T18:16:47Z) - Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration [42.95840730800478]
This paper presents a complete explainable system that interprets a set of data, abstracts the underlying features and describes them in a natural language of choice.<n>The system relies on two crucial stages: (i) identifying emerging properties from data and transforming them into abstract concepts, and (ii) converting these concepts into natural language.
arXiv Detail & Related papers (2025-02-13T11:49:48Z) - Beyond Memorization: Assessing Semantic Generalization in Large Language Models Using Phrasal Constructions [3.0906699069248806]
Construction Grammar (CxG) is a psycholinguistically grounded framework for testing generalization.<n>Our dataset consists of English phrasal constructions, for which speakers are known to be able to abstract over commonplace instantiations.<n>Our results demonstrate that state-of-the-art models, including GPT-o1, exhibit a performance drop of over 40% on our second task.
arXiv Detail & Related papers (2025-01-08T18:15:10Z) - Evaluating Distributed Representations for Multi-Level Lexical Semantics: A Research Proposal [3.3585951129432323]
This thesis builds a bridge between computational models and lexical semantics, aiming to complement each other.<n>Modern neural networks (NNs) construct distributed representations by compressing individual words into dense, continuous, high-dimensional vectors.
arXiv Detail & Related papers (2024-06-02T14:08:51Z) - Decomposed Prompting: Probing Multilingual Linguistic Structure Knowledge in Large Language Models [54.58989938395976]
We introduce a decomposed prompting approach for sequence labeling tasks.<n>We test our method on the Universal Dependencies part-of-speech tagging dataset for 38 languages.
arXiv Detail & Related papers (2024-02-28T15:15:39Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Prompting Language Models for Linguistic Structure [73.11488464916668]
We present a structured prompting approach for linguistic structured prediction tasks.
We evaluate this approach on part-of-speech tagging, named entity recognition, and sentence chunking.
We find that while PLMs contain significant prior knowledge of task labels due to task leakage into the pretraining corpus, structured prompting can also retrieve linguistic structure with arbitrary labels.
arXiv Detail & Related papers (2022-11-15T01:13:39Z) - Compositional Generalization in Grounded Language Learning via Induced
Model Sparsity [81.38804205212425]
We consider simple language-conditioned navigation problems in a grid world environment with disentangled observations.
We design an agent that encourages sparse correlations between words in the instruction and attributes of objects, composing them together to find the goal.
Our agent maintains a high level of performance on goals containing novel combinations of properties even when learning from a handful of demonstrations.
arXiv Detail & Related papers (2022-07-06T08:46:27Z) - A Knowledge-Enhanced Adversarial Model for Cross-lingual Structured
Sentiment Analysis [31.05169054736711]
Cross-lingual structured sentiment analysis task aims to transfer the knowledge from source language to target one.
We propose a Knowledge-Enhanced Adversarial Model (textttKEAM) with both implicit distributed and explicit structural knowledge.
We conduct experiments on five datasets and compare textttKEAM with both the supervised and unsupervised methods.
arXiv Detail & Related papers (2022-05-31T03:07:51Z) - AUTOLEX: An Automatic Framework for Linguistic Exploration [93.89709486642666]
We propose an automatic framework that aims to ease linguists' discovery and extraction of concise descriptions of linguistic phenomena.
Specifically, we apply this framework to extract descriptions for three phenomena: morphological agreement, case marking, and word order.
We evaluate the descriptions with the help of language experts and propose a method for automated evaluation when human evaluation is infeasible.
arXiv Detail & Related papers (2022-03-25T20:37:30Z) - ERICA: Improving Entity and Relation Understanding for Pre-trained
Language Models via Contrastive Learning [97.10875695679499]
We propose a novel contrastive learning framework named ERICA in pre-training phase to obtain a deeper understanding of the entities and their relations in text.
Experimental results demonstrate that our proposed ERICA framework achieves consistent improvements on several document-level language understanding tasks.
arXiv Detail & Related papers (2020-12-30T03:35:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.