A Scalable Pipeline for Estimating Verb Frame Frequencies Using Large Language Models
- URL: http://arxiv.org/abs/2507.22187v1
- Date: Tue, 29 Jul 2025 19:30:11 GMT
- Title: A Scalable Pipeline for Estimating Verb Frame Frequencies Using Large Language Models
- Authors: Adam M. Morgan, Adeen Flinker,
- Abstract summary: We present an automated pipeline for estimating Verb Frame Frequencies (VFFs)<n>VFFs provide a powerful window into syntax in both human and machine language systems.<n>We use large language models (LLMs) to generate a corpus of sentences containing 476 English verbs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an automated pipeline for estimating Verb Frame Frequencies (VFFs), the frequency with which a verb appears in particular syntactic frames. VFFs provide a powerful window into syntax in both human and machine language systems, but existing tools for calculating them are limited in scale, accuracy, or accessibility. We use large language models (LLMs) to generate a corpus of sentences containing 476 English verbs. Next, by instructing an LLM to behave like an expert linguist, we had it analyze the syntactic structure of the sentences in this corpus. This pipeline outperforms two widely used syntactic parsers across multiple evaluation datasets. Furthermore, it requires far fewer resources than manual parsing (the gold-standard), thereby enabling rapid, scalable VFF estimation. Using the LLM parser, we produce a new VFF database with broader verb coverage, finer-grained syntactic distinctions, and explicit estimates of the relative frequencies of structural alternates commonly studied in psycholinguistics. The pipeline is easily customizable and extensible to new verbs, syntactic frames, and even other languages. We present this work as a proof of concept for automated frame frequency estimation, and release all code and data to support future research.
Related papers
- Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
We train and evaluate neural networks directly as binary classifiers of strings.<n>We provide results on a variety of languages across the Chomsky hierarchy for three neural architectures.<n>Our contributions will facilitate theoretically sound empirical testing of language recognition claims in future work.
arXiv Detail & Related papers (2024-11-11T16:33:25Z) - SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units.
Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering.
SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z) - LongFNT: Long-form Speech Recognition with Factorized Neural Transducer [64.75547712366784]
We propose the LongFNT-Text architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor.
The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate(WER) reduction, respectively.
arXiv Detail & Related papers (2022-11-17T08:48:27Z) - Incorporating Constituent Syntax for Coreference Resolution [50.71868417008133]
We propose a graph-based method to incorporate constituent syntactic structures.
We also explore to utilise higher-order neighbourhood information to encode rich structures in constituent trees.
Experiments on the English and Chinese portions of OntoNotes 5.0 benchmark show that our proposed model either beats a strong baseline or achieves new state-of-the-art performance.
arXiv Detail & Related papers (2022-02-22T07:40:42Z) - The Limitations of Limited Context for Constituency Parsing [27.271792317099045]
Parsing-Reading-Predict architecture of (Shen et al., 2018a) was first to perform unsupervised syntactic parsing.
What kind of syntactic structure can current neural approaches to syntax represent?
We ground this question in the sandbox of probabilistic-free-grammars (PCFGs)
We identify a key aspect of the representational power of these approaches: the amount and directionality of context that the predictor has access to.
arXiv Detail & Related papers (2021-06-03T03:58:35Z) - Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models.
We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z) - Applying Occam's Razor to Transformer-Based Dependency Parsing: What
Works, What Doesn't, and What is Really Necessary [9.347252855045125]
We study the choice of pre-trained embeddings and whether they use LSTM layers in graph-based dependency schemes.
We propose a simple but widely applicable architecture and configuration, achieving new state-of-the-art results (in terms of LAS) for 10 out of 12 diverse languages.
arXiv Detail & Related papers (2020-10-23T22:58:26Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z) - Investigating representations of verb bias in neural language models [7.455546102930909]
We introduce DAIS, a benchmark dataset containing 50K human judgments for 5K distinct sentence pairs in the English dative alternation.
This dataset includes 200 unique verbs and systematically varies the definiteness and length of arguments.
We use this dataset, as well as an existing corpus of naturally occurring data, to evaluate how well recent neural language models capture human preferences.
arXiv Detail & Related papers (2020-10-05T22:39:08Z) - Evaluating the reliability of acoustic speech embeddings [10.5754802112615]
Speech embeddings are fixed-size acoustic representations of variable-length speech sequences.
Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods.
We find that overall, ABX and MAP correlate with one another and with frequency estimation.
arXiv Detail & Related papers (2020-07-27T13:24:09Z) - pyBART: Evidence-based Syntactic Transformations for IE [52.93947844555369]
We present pyBART, an easy-to-use open-source Python library for converting English UD trees to Enhanced UD graphs or to our representation.
When evaluated in a pattern-based relation extraction scenario, our representation results in higher extraction scores than Enhanced UD, while requiring fewer patterns.
arXiv Detail & Related papers (2020-05-04T07:38:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.