Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks
- URL: http://arxiv.org/abs/2403.17534v1
- Date: Tue, 26 Mar 2024 09:39:53 GMT
- Title: Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks
- Authors: Santiago Herrera, Caio Corro, Sylvain Kahane,
- Abstract summary: We propose a new method to extract and explore significant fine-grained grammar patterns from treebanks.
We extract descriptions and rules across different languages for two linguistic phenomena, agreement and word order.
Our method captures both well-known and less well-known significant grammar rules in Spanish, French, and Wolof.
- Score: 6.390468088226495
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Descriptive grammars are highly valuable, but writing them is time-consuming and difficult. Furthermore, while linguists typically use corpora to create them, grammar descriptions often lack quantitative data. As for formal grammars, they can be challenging to interpret. In this paper, we propose a new method to extract and explore significant fine-grained grammar patterns and potential syntactic grammar rules from treebanks, in order to create an easy-to-understand corpus-based grammar. More specifically, we extract descriptions and rules across different languages for two linguistic phenomena, agreement and word order, using a large search space and paying special attention to the ranking order of the extracted rules. For that, we use a linear classifier to extract the most salient features that predict the linguistic phenomena under study. We associate statistical information to each rule, and we compare the ranking of the model's results to those of other quantitative and statistical measures. Our method captures both well-known and less well-known significant grammar rules in Spanish, French, and Wolof.
Related papers
- Principles of semantic and functional efficiency in grammatical patterning [1.6267479602370545]
Grammatical features such as number and gender serve two central functions in human languages.
Number and gender encode salient semantic attributes like numerosity and animacy, but offload sentence processing cost by predictably linking words together.
Grammars exhibit consistent organizational patterns across diverse languages, invariably rooted in a semantic foundation.
arXiv Detail & Related papers (2024-10-21T10:49:54Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - AUTOLEX: An Automatic Framework for Linguistic Exploration [93.89709486642666]
We propose an automatic framework that aims to ease linguists' discovery and extraction of concise descriptions of linguistic phenomena.
Specifically, we apply this framework to extract descriptions for three phenomena: morphological agreement, case marking, and word order.
We evaluate the descriptions with the help of language experts and propose a method for automated evaluation when human evaluation is infeasible.
arXiv Detail & Related papers (2022-03-25T20:37:30Z) - Learning grammar with a divide-and-concur neural network [4.111899441919164]
We implement a divide-and-concur iterative projection approach to context-free grammar inference.
Our method requires a relatively small number of discrete parameters, making the inferred grammar directly interpretable.
arXiv Detail & Related papers (2022-01-18T22:42:43Z) - Dependency Induction Through the Lens of Visual Perception [81.91502968815746]
We propose an unsupervised grammar induction model that leverages word concreteness and a structural vision-based to jointly learn constituency-structure and dependency-structure grammars.
Our experiments show that the proposed extension outperforms the current state-of-the-art visually grounded models in constituency parsing even with a smaller grammar size.
arXiv Detail & Related papers (2021-09-20T18:40:37Z) - GrammarTagger: A Multilingual, Minimally-Supervised Grammar Profiler for
Language Education [7.517366022163375]
We present GrammarTagger, an open-source grammar profiler which, given an input text, identifies grammatical features useful for language education.
The model architecture enables it to learn from a small amount of texts annotated with spans and their labels.
We also build Octanove Learn, a search engine of language learning materials indexed by their reading difficulty and grammatical features.
arXiv Detail & Related papers (2021-04-07T15:31:20Z) - VLGrammar: Grounded Grammar Induction of Vision and Language [86.88273769411428]
We study grounded grammar induction of vision and language in a joint learning framework.
We present VLGrammar, a method that uses compound probabilistic context-free grammars (compound PCFGs) to induce the language grammar and the image grammar simultaneously.
arXiv Detail & Related papers (2021-03-24T04:05:08Z) - Word Frequency Does Not Predict Grammatical Knowledge in Language Models [2.1984302611206537]
We investigate whether there are systematic sources of variation in the language models' accuracy.
We find that certain nouns are systematically understood better than others, an effect which is robust across grammatical tasks and different language models.
We find that a novel noun's grammatical properties can be few-shot learned from various types of training data.
arXiv Detail & Related papers (2020-10-26T19:51:36Z) - Investigating Cross-Linguistic Adjective Ordering Tendencies with a
Latent-Variable Model [66.84264870118723]
We present the first purely corpus-driven model of multi-lingual adjective ordering in the form of a latent-variable model.
We provide strong converging evidence for the existence of universal, cross-linguistic, hierarchical adjective ordering tendencies.
arXiv Detail & Related papers (2020-10-09T18:27:55Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.