Related papers: Punctuation restoration in Swedish through fine-tuned KB-BERT

Punctuation restoration in Swedish through fine-tuned KB-BERT

URL: http://arxiv.org/abs/2202.06769v1
Date: Mon, 14 Feb 2022 14:39:40 GMT
Title: Punctuation restoration in Swedish through fine-tuned KB-BERT
Authors: John Bj\"orkman Nilsson
Abstract summary: The method is based on KB-BERT, a publicly available, neural network language model pre-trained on a Swedish corpus. With a lower-case and unpunctuated Swedish text as input, the model is supposed to return a grammatically correct punctuated copy of the text as output.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Presented here is a method for automatic punctuation restoration in Swedish using a BERT model. The method is based on KB-BERT, a publicly available, neural network language model pre-trained on a Swedish corpus by National Library of Sweden. This model has then been fine-tuned for this specific task using a corpus of government texts. With a lower-case and unpunctuated Swedish text as input, the model is supposed to return a grammatically correct punctuated copy of the text as output. A successful solution to this problem brings benefits for an array of NLP domains, such as speech-to-text and automated text. Only the punctuation marks period, comma and question marks were considered for the project, due to a lack of data for more rare marks such as semicolon. Additionally, some marks are somewhat interchangeable with the more common, such as exclamation points and periods. Thus, the data set had all exclamation points replaced with periods. The fine-tuned Swedish BERT model, dubbed prestoBERT, achieved an overall F1-score of 78.9. The proposed model scored similarly to international counterparts, with Hungarian and Chinese models obtaining F1-scores of 82.2 and 75.6 respectively. As further comparison, a human evaluation case study was carried out. The human test group achieved an overall F1-score of 81.7, but scored substantially worse than prestoBERT on both period and comma. Inspecting output sentences from the model and humans show satisfactory results, despite the difference in F1-score. The disconnect seems to stem from an unnecessary focus on replicating the exact same punctuation used in the test set, rather than providing any of the number of correct interpretations. If the loss function could be rewritten to reward all grammatically correct outputs, rather than only the one original example, the performance could improve significantly for both prestoBERT and the human group.

Related papers

Group-Adaptive Threshold Optimization for Robust AI-Generated Text Detection [60.09665704993751]
We introduce FairOPT, an algorithm for group-specific threshold optimization in AI-generated content classifiers. Our approach partitions data into subgroups based on attributes (e.g., text length and writing style) and learns decision thresholds for each group. Our framework paves the way for more robust and fair classification criteria in AI-generated output detection.
arXiv Detail & Related papers (2025-02-06T21:58:48Z)
Punctuation Prediction for Polish Texts using Transformers [0.7252027234425334]
This paper describes a solution for Poleval 2022 Task 1: Punctuation Prediction for Polish Texts, which scores 71.44 Weighted F1. The method utilizes a single HerBERT model finetuned to the competition data and an external dataset.
arXiv Detail & Related papers (2024-10-06T20:51:02Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
FullStop:Punctuation and Segmentation Prediction for Dutch with Transformers [1.2246649738388389]
The model we present is an extension of the models of Guhr et al. (2021) for Dutch and is made publicly available. For every word in the input sequence, the models predicts a punctuation marker that follows the word. Results show to be much better than a machine translation baseline approach.
arXiv Detail & Related papers (2023-01-09T13:12:05Z)
Punctuation Restoration for Singaporean Spoken Languages: English, Malay, and Mandarin [0.0]
This paper presents the work of restoring punctuation for ASR transcripts generated by multilingual ASR systems. The focus languages are English, Mandarin, and Malay which are three of the most popular languages in Singapore. To the best of our knowledge, this is the first system that can tackle punctuation restoration for these three languages simultaneously.
arXiv Detail & Related papers (2022-12-10T19:54:53Z)
RuArg-2022: Argument Mining Evaluation [69.87149207721035]
This paper is a report of the organizers on the first competition of argumentation analysis systems dealing with Russian language texts. A corpus containing 9,550 sentences (comments on social media posts) on three topics related to the COVID-19 pandemic was prepared. The system that won the first place in both tasks used the NLI (Natural Language Inference) variant of the BERT architecture.
arXiv Detail & Related papers (2022-06-18T17:13:37Z)
IIT_kgp at FinCausal 2020, Shared Task 1: Causality Detection using Sentence Embeddings in Financial Reports [0.0]
This work is associated with the first sub-task of identifying causality in sentences. BERT (Large) performed the best, giving a F1 score of 0.958, in the task of detecting the causality of sentences in financial texts and reports.
arXiv Detail & Related papers (2020-11-16T00:57:14Z)
Improving the Efficiency of Grammatical Error Correction with Erroneous Span Detection and Correction [106.63733511672721]
We propose a novel language-independent approach to improve the efficiency for Grammatical Error Correction (GEC) by dividing the task into two subtasks: Erroneous Span Detection ( ESD) and Erroneous Span Correction (ESC) ESD identifies grammatically incorrect text spans with an efficient sequence tagging model. ESC leverages a seq2seq model to take the sentence with annotated erroneous spans as input and only outputs the corrected text for these spans. Experiments show our approach performs comparably to conventional seq2seq approaches in both English and Chinese GEC benchmarks with less than 50% time cost for inference.
arXiv Detail & Related papers (2020-10-07T08:29:11Z)
Unsupervised Parsing via Constituency Tests [49.42244463346612]
We propose a method for unsupervised parsing based on the linguistic notion of a constituency test. To produce a tree given a sentence, we score each span by aggregating its constituency test judgments, and we choose the binary tree with the highest total score. The refined model achieves 62.8 F1 on the Penn Treebank test set, an absolute improvement of 7.6 points over the previous best published result.
arXiv Detail & Related papers (2020-10-07T04:05:01Z)
Efficient Constituency Parsing by Pointing [21.395573911155495]
We propose a novel constituency parsing model that casts the parsing problem into a series of pointing tasks. Our model supports efficient top-down decoding and our learning objective is able to enforce structural consistency without resorting to the expensive CKY inference.
arXiv Detail & Related papers (2020-06-24T08:29:09Z)
Semi-Supervised Models via Data Augmentationfor Classifying Interactive Affective Responses [85.04362095899656]
We present semi-supervised models with data augmentation (SMDA), a semi-supervised text classification system to classify interactive affective responses. For labeled sentences, we performed data augmentations to uniform the label distributions and computed supervised loss during training process. For unlabeled sentences, we explored self-training by regarding low-entropy predictions over unlabeled sentences as pseudo labels.
arXiv Detail & Related papers (2020-04-23T05:02:31Z)
Adversarial Transfer Learning for Punctuation Restoration [58.2201356693101]
Adversarial multi-task learning is introduced to learn task invariant knowledge for punctuation prediction. Experiments are conducted on IWSLT2011 datasets.
arXiv Detail & Related papers (2020-04-01T06:19:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.