Punctuation restoration in Swedish through fine-tuned KB-BERT
- URL: http://arxiv.org/abs/2202.06769v1
- Date: Mon, 14 Feb 2022 14:39:40 GMT
- Title: Punctuation restoration in Swedish through fine-tuned KB-BERT
- Authors: John Bj\"orkman Nilsson
- Abstract summary: The method is based on KB-BERT, a publicly available, neural network language model pre-trained on a Swedish corpus.
With a lower-case and unpunctuated Swedish text as input, the model is supposed to return a grammatically correct punctuated copy of the text as output.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Presented here is a method for automatic punctuation restoration in Swedish
using a BERT model. The method is based on KB-BERT, a publicly available,
neural network language model pre-trained on a Swedish corpus by National
Library of Sweden. This model has then been fine-tuned for this specific task
using a corpus of government texts. With a lower-case and unpunctuated Swedish
text as input, the model is supposed to return a grammatically correct
punctuated copy of the text as output. A successful solution to this problem
brings benefits for an array of NLP domains, such as speech-to-text and
automated text. Only the punctuation marks period, comma and question marks
were considered for the project, due to a lack of data for more rare marks such
as semicolon. Additionally, some marks are somewhat interchangeable with the
more common, such as exclamation points and periods. Thus, the data set had all
exclamation points replaced with periods. The fine-tuned Swedish BERT model,
dubbed prestoBERT, achieved an overall F1-score of 78.9. The proposed model
scored similarly to international counterparts, with Hungarian and Chinese
models obtaining F1-scores of 82.2 and 75.6 respectively. As further
comparison, a human evaluation case study was carried out. The human test group
achieved an overall F1-score of 81.7, but scored substantially worse than
prestoBERT on both period and comma. Inspecting output sentences from the model
and humans show satisfactory results, despite the difference in F1-score. The
disconnect seems to stem from an unnecessary focus on replicating the exact
same punctuation used in the test set, rather than providing any of the number
of correct interpretations. If the loss function could be rewritten to reward
all grammatically correct outputs, rather than only the one original example,
the performance could improve significantly for both prestoBERT and the human
group.
Related papers
- Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - FullStop:Punctuation and Segmentation Prediction for Dutch with
Transformers [1.2246649738388389]
The model we present is an extension of the models of Guhr et al. (2021) for Dutch and is made publicly available.
For every word in the input sequence, the models predicts a punctuation marker that follows the word.
Results show to be much better than a machine translation baseline approach.
arXiv Detail & Related papers (2023-01-09T13:12:05Z) - Punctuation Restoration for Singaporean Spoken Languages: English,
Malay, and Mandarin [0.0]
This paper presents the work of restoring punctuation for ASR transcripts generated by multilingual ASR systems.
The focus languages are English, Mandarin, and Malay which are three of the most popular languages in Singapore.
To the best of our knowledge, this is the first system that can tackle punctuation restoration for these three languages simultaneously.
arXiv Detail & Related papers (2022-12-10T19:54:53Z) - TRScore: A Novel GPT-based Readability Scorer for ASR Segmentation and
Punctuation model evaluation and selection [1.4720080476520687]
Punctuation and readability are key to readability in Automatic Speech Recognition.
Human evaluation is expensive, time-consuming, and suffers from large inter-observer variability.
We present TRScore, a novel readability measure using the GPT model to evaluate different segmentation and punctuation systems.
arXiv Detail & Related papers (2022-10-27T01:11:32Z) - RuArg-2022: Argument Mining Evaluation [69.87149207721035]
This paper is a report of the organizers on the first competition of argumentation analysis systems dealing with Russian language texts.
A corpus containing 9,550 sentences (comments on social media posts) on three topics related to the COVID-19 pandemic was prepared.
The system that won the first place in both tasks used the NLI (Natural Language Inference) variant of the BERT architecture.
arXiv Detail & Related papers (2022-06-18T17:13:37Z) - IIT_kgp at FinCausal 2020, Shared Task 1: Causality Detection using
Sentence Embeddings in Financial Reports [0.0]
This work is associated with the first sub-task of identifying causality in sentences.
BERT (Large) performed the best, giving a F1 score of 0.958, in the task of detecting the causality of sentences in financial texts and reports.
arXiv Detail & Related papers (2020-11-16T00:57:14Z) - Improving the Efficiency of Grammatical Error Correction with Erroneous
Span Detection and Correction [106.63733511672721]
We propose a novel language-independent approach to improve the efficiency for Grammatical Error Correction (GEC) by dividing the task into two subtasks: Erroneous Span Detection ( ESD) and Erroneous Span Correction (ESC)
ESD identifies grammatically incorrect text spans with an efficient sequence tagging model. ESC leverages a seq2seq model to take the sentence with annotated erroneous spans as input and only outputs the corrected text for these spans.
Experiments show our approach performs comparably to conventional seq2seq approaches in both English and Chinese GEC benchmarks with less than 50% time cost for inference.
arXiv Detail & Related papers (2020-10-07T08:29:11Z) - Unsupervised Parsing via Constituency Tests [49.42244463346612]
We propose a method for unsupervised parsing based on the linguistic notion of a constituency test.
To produce a tree given a sentence, we score each span by aggregating its constituency test judgments, and we choose the binary tree with the highest total score.
The refined model achieves 62.8 F1 on the Penn Treebank test set, an absolute improvement of 7.6 points over the previous best published result.
arXiv Detail & Related papers (2020-10-07T04:05:01Z) - Efficient Constituency Parsing by Pointing [21.395573911155495]
We propose a novel constituency parsing model that casts the parsing problem into a series of pointing tasks.
Our model supports efficient top-down decoding and our learning objective is able to enforce structural consistency without resorting to the expensive CKY inference.
arXiv Detail & Related papers (2020-06-24T08:29:09Z) - Semi-Supervised Models via Data Augmentationfor Classifying Interactive
Affective Responses [85.04362095899656]
We present semi-supervised models with data augmentation (SMDA), a semi-supervised text classification system to classify interactive affective responses.
For labeled sentences, we performed data augmentations to uniform the label distributions and computed supervised loss during training process.
For unlabeled sentences, we explored self-training by regarding low-entropy predictions over unlabeled sentences as pseudo labels.
arXiv Detail & Related papers (2020-04-23T05:02:31Z) - Adversarial Transfer Learning for Punctuation Restoration [58.2201356693101]
Adversarial multi-task learning is introduced to learn task invariant knowledge for punctuation prediction.
Experiments are conducted on IWSLT2011 datasets.
arXiv Detail & Related papers (2020-04-01T06:19:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.