Related papers: Improving Unsupervised Constituency Parsing via Maximizing Semantic Information

Improving Unsupervised Constituency Parsing via Maximizing Semantic Information

URL: http://arxiv.org/abs/2410.02558v1
Date: Thu, 3 Oct 2024 15:04:00 GMT
Title: Improving Unsupervised Constituency Parsing via Maximizing Semantic Information
Authors: Junjie Chen, Xiangheng He, Yusuke Miyao, Danushka Bollegala,
Abstract summary: Unsupervised constituencys organize phrases within a sentence into a tree-shaped syntactic constituent structure. Traditional objective of maximizing sentence log-likelihood (LL) does not explicitly account for the close relationship between the constituent structure and the semantics. We introduce a novel objective for training unsupervised metrics: maximizing the information between constituent structures and sentence semantics (SemInfo)
Score: 35.63321102040579
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unsupervised constituency parsers organize phrases within a sentence into a tree-shaped syntactic constituent structure that reflects the organization of sentence semantics. However, the traditional objective of maximizing sentence log-likelihood (LL) does not explicitly account for the close relationship between the constituent structure and the semantics, resulting in a weak correlation between LL values and parsing accuracy. In this paper, we introduce a novel objective for training unsupervised parsers: maximizing the information between constituent structures and sentence semantics (SemInfo). We introduce a bag-of-substrings model to represent the semantics and apply the probability-weighted information metric to estimate the SemInfo. Additionally, we develop a Tree Conditional Random Field (TreeCRF)-based model to apply the SemInfo maximization objective to Probabilistic Context-Free Grammar (PCFG) induction, the state-of-the-art method for unsupervised constituency parsing. Experiments demonstrate that SemInfo correlates more strongly with parsing accuracy than LL. Our algorithm significantly enhances parsing accuracy by an average of 7.85 points across five PCFG variants and in four languages, achieving new state-of-the-art results in three of the four languages.

Related papers

Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark [0.29687381456163997]
Tokenization is a fundamental preprocessing step in NLP, directly impacting large language models' ability to capture syntactic, morphosyntactic, and semantic structures. This paper introduces a novel framework for evaluating tokenization strategies, addressing challenges in morphologically rich and low-resource languages.
arXiv Detail & Related papers (2025-02-10T21:47:49Z)
Structural Entropy Guided Probabilistic Coding [52.01765333755793]
We propose a novel structural entropy-guided probabilistic coding model, named SEPC. We incorporate the relationship between latent variables into the optimization by proposing a structural entropy regularization loss. Experimental results across 12 natural language understanding tasks, including both classification and regression tasks, demonstrate the superior performance of SEPC.
arXiv Detail & Related papers (2024-12-12T00:37:53Z)
Towards a theory of how the structure of language is acquired by deep neural networks [6.363756171493383]
We use a tree-like generative model that captures many of the hierarchical structures found in natural languages. We show that token-token correlations can be used to build a representation of the grammar's hidden variables. We conjecture that the relationship between training set size and effective range of correlations holds beyond our synthetic datasets.
arXiv Detail & Related papers (2024-05-28T17:01:22Z)
Empirical Sufficiency Lower Bounds for Language Modeling with Locally-Bootstrapped Semantic Structures [4.29295838853865]
We design a concise binary vector representation of semantic structure at the lexical level. We evaluate in-depth how good an incremental tagger needs to be in order to achieve better-than-baseline performance.
arXiv Detail & Related papers (2023-05-30T10:09:48Z)
CPTAM: Constituency Parse Tree Aggregation Method [6.011216641982612]
This paper adopts the truth discovery idea to aggregate constituency parse trees from different distances. We formulate the constituency parse tree aggregation problem in two steps, structure aggregation and constituent label aggregation. Experiments are conducted on benchmark datasets in different languages and domains.
arXiv Detail & Related papers (2022-01-19T23:05:37Z)
Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation. This paper aims to address the issue with a mask-and-predict strategy. We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions. Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z)
Discrete representations in neural models of spoken language [56.29049879393466]
We compare the merits of four commonly used metrics in the context of weakly supervised models of spoken language. We find that the different evaluation metrics can give inconsistent results.
arXiv Detail & Related papers (2021-05-12T11:02:02Z)
Introducing Syntactic Structures into Target Opinion Word Extraction with Deep Learning [89.64620296557177]
We propose to incorporate the syntactic structures of the sentences into the deep learning models for targeted opinion word extraction. We also introduce a novel regularization technique to improve the performance of the deep learning models. The proposed model is extensively analyzed and achieves the state-of-the-art performance on four benchmark datasets.
arXiv Detail & Related papers (2020-10-26T07:13:17Z)
A Comparative Study on Structural and Semantic Properties of Sentence Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction. We show that different embedding spaces have different degrees of strength for the structural and semantic properties. These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z)
Open-set Short Utterance Forensic Speaker Verification using Teacher-Student Network with Explicit Inductive Bias [59.788358876316295]
We propose a pipeline solution to improve speaker verification on a small actual forensic field dataset. By leveraging large-scale out-of-domain datasets, a knowledge distillation based objective function is proposed for teacher-student learning. We show that the proposed objective function can efficiently improve the performance of teacher-student learning on short utterances.
arXiv Detail & Related papers (2020-09-21T00:58:40Z)
Exploiting Syntactic Structure for Better Language Modeling: A Syntactic Distance Approach [78.77265671634454]
We make use of a multi-task objective, i.e., the models simultaneously predict words as well as ground truth parse trees in a form called "syntactic distances" Experimental results on the Penn Treebank and Chinese Treebank datasets show that when ground truth parse trees are provided as additional training signals, the model is able to achieve lower perplexity and induce trees with better quality.
arXiv Detail & Related papers (2020-05-12T15:35:00Z)
Discrete Optimization for Unsupervised Sentence Summarization with Word-Level Extraction [31.648764677078837]
Automatic sentence summarization produces a shorter version of a sentence, while preserving its most important information. We model these two aspects in an unsupervised objective function, consisting of language modeling and semantic similarity metrics. Our proposed method achieves a new state-of-the art for unsupervised sentence summarization according to ROUGE scores.
arXiv Detail & Related papers (2020-05-04T19:01:55Z)
Discontinuous Constituent Parsing with Pointer Networks [0.34376560669160383]
discontinuous constituent trees are crucial for representing all grammatical phenomena of languages such as German. Recent advances in dependency parsing have shown that Pointer Networks excel in efficiently parsing syntactic relations between words in a sentence. We propose a novel neural network architecture that is able to generate the most accurate discontinuous constituent representations.
arXiv Detail & Related papers (2020-02-05T15:12:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.