A Novel Word Pair-based Gaussian Sentence Similarity Algorithm For Bengali Extractive Text Summarization
- URL: http://arxiv.org/abs/2411.17181v1
- Date: Tue, 26 Nov 2024 07:42:16 GMT
- Title: A Novel Word Pair-based Gaussian Sentence Similarity Algorithm For Bengali Extractive Text Summarization
- Authors: Fahim Morshed, Md. Abdur Rahman, Sumon Ahmed,
- Abstract summary: We propose a novel Word pair-based Gaussian Sentence Similarity (WGSS) algorithm for calculating the semantic relation between two sentences.
It compares two sentences on a word-to-word basis which rectifies the sentence representation problem faced by the word averaging method.
The proposed method is validated using four different datasets, and it outperformed other recent models by 43.2% on average ROUGE scores.
- Score: 1.3791394805787949
- License:
- Abstract: Extractive Text Summarization is the process of selecting the most representative parts of a larger text without losing any key information. Recent attempts at extractive text summarization in Bengali, either relied on statistical techniques like TF-IDF or used naive sentence similarity measures like the word averaging technique. All of these strategies suffer from expressing semantic relationships correctly. Here, we propose a novel Word pair-based Gaussian Sentence Similarity (WGSS) algorithm for calculating the semantic relation between two sentences. WGSS takes the geometric means of individual Gaussian similarity values of word embedding vectors to get the semantic relationship between sentences. It compares two sentences on a word-to-word basis which rectifies the sentence representation problem faced by the word averaging method. The summarization process extracts key sentences by grouping semantically similar sentences into clusters using the Spectral Clustering algorithm. After clustering, we use TF-IDF ranking to pick the best sentence from each cluster. The proposed method is validated using four different datasets, and it outperformed other recent models by 43.2\% on average ROUGE scores (ranging from 2.5\% to 95.4\%). It is also experimented on other low-resource languages i.e. Turkish, Marathi, and Hindi language, where we find that the proposed method performs as similar as Bengali for these languages. In addition, a new high-quality Bengali dataset is curated which contains 250 articles and a pair of summaries for each of them. We believe this research is a crucial addition to Bengali Natural Language Processing (NLP) research and it can easily be extended into other low-resource languages. We made the implementation of the proposed model and data public on \href{https://github.com/FMOpee/WGSS}{https://github.com/FMOpee/WGSS}.
Related papers
- Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation [60.493180081319785]
We propose a systematic way to estimate the intrinsic capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step.
Our work provides a comprehensive comparison between existing truncation sampling methods, as well as their recommended parameters as a guideline for users.
arXiv Detail & Related papers (2024-08-24T14:14:32Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Relational Sentence Embedding for Flexible Semantic Matching [86.21393054423355]
We present Sentence Embedding (RSE), a new paradigm to discover further the potential of sentence embeddings.
RSE is effective and flexible in modeling sentence relations and outperforms a series of state-of-the-art embedding methods.
arXiv Detail & Related papers (2022-12-17T05:25:17Z) - Query Expansion Using Contextual Clue Sampling with Language Models [69.51976926838232]
We propose a combination of an effective filtering strategy and fusion of the retrieved documents based on the generation probability of each context.
Our lexical matching based approach achieves a similar top-5/top-20 retrieval accuracy and higher top-100 accuracy compared with the well-established dense retrieval model DPR.
For end-to-end QA, the reader model also benefits from our method and achieves the highest Exact-Match score against several competitive baselines.
arXiv Detail & Related papers (2022-10-13T15:18:04Z) - Text Summarization with Oracle Expectation [88.39032981994535]
Extractive summarization produces summaries by identifying and concatenating the most important sentences in a document.
Most summarization datasets do not come with gold labels indicating whether document sentences are summary-worthy.
We propose a simple yet effective labeling algorithm that creates soft, expectation-based sentence labels.
arXiv Detail & Related papers (2022-09-26T14:10:08Z) - Pruned Graph Neural Network for Short Story Ordering [0.7087237546722617]
Organizing sentences into an order that maximizes coherence is known as sentence ordering.
We propose a new method for constructing sentence-entity graphs of short stories to create the edges between sentences.
We also observe that replacing pronouns with their referring entities effectively encodes sentences in sentence-entity graphs.
arXiv Detail & Related papers (2022-03-13T22:25:17Z) - Using BERT Encoding and Sentence-Level Language Model for Sentence
Ordering [0.9134244356393667]
We propose an algorithm for sentence ordering in a corpus of short stories.
Our proposed method uses a language model based on Universal Transformers (UT) that captures sentences' dependencies by employing an attention mechanism.
The proposed model includes three components: Sentence, Language Model, and Sentence Arrangement with Brute Force Search.
arXiv Detail & Related papers (2021-08-24T23:03:36Z) - A novel hybrid methodology of measuring sentence similarity [0.0]
It is necessary to measure the similarity between sentences accurately.
Deep learning methodology shows a state-of-the-art performance in many natural language processing fields.
Considering the structure of the sentence or the word structure that makes up the sentence is also important.
arXiv Detail & Related papers (2021-05-03T06:50:54Z) - Combining Word Embeddings and N-grams for Unsupervised Document
Summarization [2.1591018627187286]
Graph-based extractive document summarization relies on the quality of the sentence similarity graph.
We employ off-the-shelf deep embedding features and tf-idf features, and introduce a new text similarity metric.
Our approach can outperform the tf-idf based approach and achieve state-of-the-art performance on the DUC04 dataset.
arXiv Detail & Related papers (2020-04-25T00:22:46Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.