GAE-ISumm: Unsupervised Graph-Based Summarization of Indian Languages
- URL: http://arxiv.org/abs/2212.12937v1
- Date: Sun, 25 Dec 2022 17:20:03 GMT
- Title: GAE-ISumm: Unsupervised Graph-Based Summarization of Indian Languages
- Authors: Lakshmi Sireesha Vakada, Anudeep Ch, Mounika Marreddy, Subba Reddy
Oota, Radhika Mamidi
- Abstract summary: Document summarization aims to create a precise and coherent summary of a text document.
Many deep learning summarization models are developed mainly for English, often requiring a large training corpus.
We propose GAE-ISumm, an unsupervised Indic summarization model that extracts summaries from text documents.
- Score: 5.197307534263253
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document summarization aims to create a precise and coherent summary of a
text document. Many deep learning summarization models are developed mainly for
English, often requiring a large training corpus and efficient pre-trained
language models and tools. However, English summarization models for
low-resource Indian languages are often limited by rich morphological
variation, syntax, and semantic differences. In this paper, we propose
GAE-ISumm, an unsupervised Indic summarization model that extracts summaries
from text documents. In particular, our proposed model, GAE-ISumm uses Graph
Autoencoder (GAE) to learn text representations and a document summary jointly.
We also provide a manually-annotated Telugu summarization dataset TELSUM, to
experiment with our model GAE-ISumm. Further, we experiment with the most
publicly available Indian language summarization datasets to investigate the
effectiveness of GAE-ISumm on other Indian languages. Our experiments of
GAE-ISumm in seven languages make the following observations: (i) it is
competitive or better than state-of-the-art results on all datasets, (ii) it
reports benchmark results on TELSUM, and (iii) the inclusion of positional and
cluster information in the proposed model improved the performance of
summaries.
Related papers
- L3Cube-MahaSum: A Comprehensive Dataset and BART Models for Abstractive Text Summarization in Marathi [0.4194295877935868]
We present the MahaSUM dataset, a large-scale collection of diverse news articles in Marathi.
The dataset was created by scraping articles from a wide range of online news sources and manually verifying the abstract summaries.
We train an IndicBART model, a variant of the BART model tailored for Indic languages, using the MahaSUM dataset.
arXiv Detail & Related papers (2024-10-11T18:37:37Z) - Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs [70.15262704746378]
We propose a systematically created human-annotated dataset consisting of coherent summaries for five publicly available datasets and natural language user feedback.
Preliminary experiments with Falcon-40B and Llama-2-13B show significant performance improvements (10% Rouge-L) in terms of producing coherent summaries.
arXiv Detail & Related papers (2024-07-05T20:25:04Z) - GIELLM: Japanese General Information Extraction Large Language Model
Utilizing Mutual Reinforcement Effect [0.0]
We introduce the General Information Extraction Large Language Model (GIELLM)
It integrates text Classification, Sentiment Analysis, Named Entity Recognition, Relation Extraction, and Event Extraction using a uniform input-output schema.
This innovation marks the first instance of a model simultaneously handling such a diverse array of IE subtasks.
arXiv Detail & Related papers (2023-11-12T13:30:38Z) - Text Summarization Using Large Language Models: A Comparative Study of
MPT-7b-instruct, Falcon-7b-instruct, and OpenAI Chat-GPT Models [0.0]
Leveraging Large Language Models (LLMs) has shown remarkable promise in enhancing summarization techniques.
This paper embarks on an exploration of text summarization with a diverse set of LLMs, including MPT-7b-instruct, falcon-7b-instruct, and OpenAI ChatGPT text-davinci-003 models.
arXiv Detail & Related papers (2023-10-16T14:33:02Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Scientific Paper Extractive Summarization Enhanced by Citation Graphs [50.19266650000948]
We focus on leveraging citation graphs to improve scientific paper extractive summarization under different settings.
Preliminary results demonstrate that citation graph is helpful even in a simple unsupervised framework.
Motivated by this, we propose a Graph-based Supervised Summarization model (GSS) to achieve more accurate results on the task when large-scale labeled data are available.
arXiv Detail & Related papers (2022-12-08T11:53:12Z) - GRETEL: Graph Contrastive Topic Enhanced Language Model for Long
Document Extractive Summarization [22.053942327552583]
We propose a graph contrastive topic enhanced language model (GRETEL) for capturing global semantic information.
GRETEL integrates the hierarchical transformer encoder and the graph contrastive learning to fuse the semantic information from the global document context and the gold summary.
Experimental results on both general domain and biomedical datasets demonstrate that our proposed method outperforms SOTA methods.
arXiv Detail & Related papers (2022-08-21T23:09:29Z) - Liputan6: A Large-scale Indonesian Dataset for Text Summarization [43.375797352517765]
We harvest articles from Liputan6.com, an online news portal, and obtain 215,827 document-summary pairs.
We leverage pre-trained language models to develop benchmark extractive and abstractive summarization methods over the dataset.
arXiv Detail & Related papers (2020-11-02T02:01:12Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Leveraging Graph to Improve Abstractive Multi-Document Summarization [50.62418656177642]
We develop a neural abstractive multi-document summarization (MDS) model which can leverage well-known graph representations of documents.
Our model utilizes graphs to encode documents in order to capture cross-document relations, which is crucial to summarizing long documents.
Our model can also take advantage of graphs to guide the summary generation process, which is beneficial for generating coherent and concise summaries.
arXiv Detail & Related papers (2020-05-20T13:39:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.