Segmenting Natural Language Sentences via Lexical Unit Analysis
- URL: http://arxiv.org/abs/2012.05418v3
- Date: Fri, 16 Apr 2021 08:30:03 GMT
- Title: Segmenting Natural Language Sentences via Lexical Unit Analysis
- Authors: Yangming Li, Lemao Liu, Shuming Shi
- Abstract summary: We present Lexical Unit Analysis (LUA), a framework for general sequence segmentation tasks.
LUA scores all the valid segmentation candidates and utilizes dynamic programming (DP) to extract the maximum scoring one.
We have conducted extensive experiments on 5 tasks, including syntactic chunking, named entity recognition (NER), slot filling, Chinese word segmentation, and Chinese part-of-speech (POS) tagging, across 15 datasets.
- Score: 47.273602658066196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we present Lexical Unit Analysis (LUA), a framework for general
sequence segmentation tasks. Given a natural language sentence, LUA scores all
the valid segmentation candidates and utilizes dynamic programming (DP) to
extract the maximum scoring one. LUA enjoys a number of appealing properties
such as inherently guaranteeing the predicted segmentation to be valid and
facilitating globally optimal training and inference. Besides, the practical
time complexity of LUA can be reduced to linear time, which is very efficient.
We have conducted extensive experiments on 5 tasks, including syntactic
chunking, named entity recognition (NER), slot filling, Chinese word
segmentation, and Chinese part-of-speech (POS) tagging, across 15 datasets. Our
models have achieved the state-of-the-art performances on 13 of them. The
results also show that the F1 score of identifying long-length segments is
notably improved.
Related papers
- Musical Phrase Segmentation via Grammatical Induction [0.0]
We analyze the performance of five grammatical induction algorithms on three datasets using various musical viewpoint combinations.
Our experiments show that the LONGESTFIRST algorithm achieves the best F1 scores across all three datasets.
arXiv Detail & Related papers (2024-05-29T04:04:36Z) - Universal Segmentation at Arbitrary Granularity with Language
Instruction [59.76130089644841]
We present UniLSeg, a universal segmentation model that can perform segmentation at any semantic level with the guidance of language instructions.
For training UniLSeg, we reorganize a group of tasks from original diverse distributions into a unified data format, where images with texts describing segmentation targets as input and corresponding masks are output.
arXiv Detail & Related papers (2023-12-04T04:47:48Z) - LISA: Reasoning Segmentation via Large Language Model [68.24075852136761]
We propose a new segmentation task -- reasoning segmentation.
The task is designed to output a segmentation mask given a complex and implicit query text.
We present LISA: large Language Instructed Assistant, which inherits the language generation capabilities of multimodal Large Language Models.
arXiv Detail & Related papers (2023-08-01T17:50:17Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - Pre-training Universal Language Representation [46.51685959045527]
This work introduces universal language representation learning, i.e., embeddings of different levels of linguistic units or text with quite diverse lengths in a uniform vector space.
We empirically verify that well designed pre-training scheme may effectively yield universal language representation.
arXiv Detail & Related papers (2021-05-30T09:29:01Z) - LCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for
Lexical Complexity Prediction [4.86331990243181]
This paper describes team LCP-RIT's submission to the SemEval-2021 Task 1: Lexical Complexity Prediction (LCP)
Our system uses logistic regression and a wide range of linguistic features to predict the complexity of single words in this dataset.
We evaluate the results in terms of mean absolute error, mean squared error, Pearson correlation, and Spearman correlation.
arXiv Detail & Related papers (2021-05-18T18:55:04Z) - Neural Sequence Segmentation as Determining the Leftmost Segments [25.378188980430256]
We propose a novel framework that incrementally segments natural language sentences at segment level.
For every step in segmentation, it recognizes the leftmost segment of the remaining sequence.
We have conducted extensive experiments on syntactic chunking and Chinese part-of-speech tagging across 3 datasets.
arXiv Detail & Related papers (2021-04-15T03:35:03Z) - Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present our approach of constructing analogy datasets in terms of words, phrases and sentences.
We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.