A Multilingual Study of Multi-Sentence Compression using Word
Vertex-Labeled Graphs and Integer Linear Programming
- URL: http://arxiv.org/abs/2004.04468v1
- Date: Thu, 9 Apr 2020 10:35:16 GMT
- Title: A Multilingual Study of Multi-Sentence Compression using Word
Vertex-Labeled Graphs and Integer Linear Programming
- Authors: Elvys Linhares Pontes, St\'ephane Huet, Juan-Manuel Torres-Moreno,
Thiago G. da Silva, and Andr\'ea Carneiro Linhares
- Abstract summary: Multi-Sentence Compression (MSC) aims to generate a short sentence with the key information from a cluster of similar sentences.
This paper describes an Linear Programming method for MSC using a graph to select different keywords.
Our system is of good quality and outperforms the state of the art for evaluations led on news datasets in three languages.
- Score: 1.3922732150370447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-Sentence Compression (MSC) aims to generate a short sentence with the
key information from a cluster of similar sentences. MSC enables summarization
and question-answering systems to generate outputs combining fully formed
sentences from one or several documents. This paper describes an Integer Linear
Programming method for MSC using a vertex-labeled graph to select different
keywords, with the goal of generating more informative sentences while
maintaining their grammaticality. Our system is of good quality and outperforms
the state of the art for evaluations led on news datasets in three languages:
French, Portuguese and Spanish. We led both automatic and manual evaluations to
determine the informativeness and the grammaticality of compressions for each
dataset. In additional tests, which take advantage of the fact that the length
of compressions can be modulated, we still improve ROUGE scores with shorter
output sentences.
Related papers
- Statistical Analysis of Sentence Structures through ASCII, Lexical Alignment and PCA [0.0]
It proposes a novel statistical method that uses American Standard Code for Information Interchange (ASCII) codes to represent text of 11 text corpora.
It analyzes the results through histograms and normality tests such as Shapiro-Wilk and Anderson-Darling Tests.
arXiv Detail & Related papers (2025-03-13T15:42:44Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Are the Best Multilingual Document Embeddings simply Based on Sentence
Embeddings? [18.968571816913208]
We provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models.
We show that a clever combination of sentence embeddings is usually better than encoding the full document as a single unit.
arXiv Detail & Related papers (2023-04-28T12:11:21Z) - M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [103.6153593636399]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning)
It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario.
Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z) - Modeling Sequential Sentence Relation to Improve Cross-lingual Dense
Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM)
MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document.
To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - ClueGraphSum: Let Key Clues Guide the Cross-Lingual Abstractive
Summarization [5.873920727236548]
Cross-Lingual Summarization is the task to generate a summary in one language for an article in a different language.
Previous studies on CLS mainly take pipeline methods or train the end-to-end model using translated parallel data.
We propose a clue-guided cross-lingual abstractive summarization method to improve the quality of cross-lingual summaries.
arXiv Detail & Related papers (2022-03-05T18:01:11Z) - A Unified Strategy for Multilingual Grammatical Error Correction with
Pre-trained Cross-Lingual Language Model [100.67378875773495]
We propose a generic and language-independent strategy for multilingual Grammatical Error Correction.
Our approach creates diverse parallel GEC data without any language-specific operations.
It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
arXiv Detail & Related papers (2022-01-26T02:10:32Z) - SgSum: Transforming Multi-document Summarization into Sub-graph
Selection [27.40759123902261]
Most existing extractive multi-document summarization (MDS) methods score each sentence individually and extract salient sentences one by one to compose a summary.
We propose a novel MDS framework (SgSum) to formulate the MDS task as a sub-graph selection problem.
Our model can produce significantly more coherent and informative summaries compared with traditional MDS methods.
arXiv Detail & Related papers (2021-10-25T05:12:10Z) - Using BERT Encoding and Sentence-Level Language Model for Sentence
Ordering [0.9134244356393667]
We propose an algorithm for sentence ordering in a corpus of short stories.
Our proposed method uses a language model based on Universal Transformers (UT) that captures sentences' dependencies by employing an attention mechanism.
The proposed model includes three components: Sentence, Language Model, and Sentence Arrangement with Brute Force Search.
arXiv Detail & Related papers (2021-08-24T23:03:36Z) - MultiGBS: A multi-layer graph approach to biomedical summarization [6.11737116137921]
We propose a domain-specific method that models a document as a multi-layer graph to enable multiple features of the text to be processed at the same time.
The unsupervised method selects sentences from the multi-layer graph based on the MultiRank algorithm and the number of concepts.
The proposed MultiGBS algorithm employs UMLS and extracts the concepts and relationships using different tools such as SemRep, MetaMap, and OGER.
arXiv Detail & Related papers (2020-08-27T04:22:37Z) - SummPip: Unsupervised Multi-Document Summarization with Sentence Graph
Compression [61.97200991151141]
SummPip is an unsupervised method for multi-document summarization.
We convert the original documents to a sentence graph, taking both linguistic and deep representation into account.
We then apply spectral clustering to obtain multiple clusters of sentences, and finally compress each cluster to generate the final summary.
arXiv Detail & Related papers (2020-07-17T13:01:15Z) - The Paradigm Discovery Problem [121.79963594279893]
We formalize the paradigm discovery problem and develop metrics for judging systems.
We report empirical results on five diverse languages.
Our code and data are available for public use.
arXiv Detail & Related papers (2020-05-04T16:38:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.