Take the Hint: Improving Arabic Diacritization with
Partially-Diacritized Text
- URL: http://arxiv.org/abs/2306.03557v2
- Date: Mon, 31 Jul 2023 12:29:10 GMT
- Title: Take the Hint: Improving Arabic Diacritization with
Partially-Diacritized Text
- Authors: Parnia Bahar, Mattia Di Gangi, Nick Rossenbach, Mohammad Zeineldeen
- Abstract summary: We propose 2SDiac, a multi-source model that can effectively support optional diacritics in input to inform all predictions.
We also introduce Guided Learning, a training scheme to leverage given diacritics in input with different levels of random masking.
- Score: 4.863310073296471
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic Arabic diacritization is useful in many applications, ranging from
reading support for language learners to accurate pronunciation predictor for
downstream tasks like speech synthesis. While most of the previous works
focused on models that operate on raw non-diacritized text, production systems
can gain accuracy by first letting humans partly annotate ambiguous words. In
this paper, we propose 2SDiac, a multi-source model that can effectively
support optional diacritics in input to inform all predictions. We also
introduce Guided Learning, a training scheme to leverage given diacritics in
input with different levels of random masking. We show that the provided hints
during test affect more output positions than those annotated. Moreover,
experiments on two common benchmarks show that our approach i) greatly
outperforms the baseline also when evaluated on non-diacritized text; and ii)
achieves state-of-the-art results while reducing the parameter count by over
60%.
Related papers
- Don't Touch My Diacritics [6.307256398189243]
We focus on the handling of diacritics in texts originating in many languages and scripts.
We demonstrate, through several case studies, the adverse effects of inconsistent encoding of diacritized characters and of removing diacritics altogether.
arXiv Detail & Related papers (2024-10-31T17:03:44Z) - Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation.
Our approach can be applied to existing datasets by automatically generating hard negative test captions.
Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - Influence Scores at Scale for Efficient Language Data Sampling [3.072340427031969]
"influence scores" are used to identify important subsets of data.
In this paper, we explore the applicability of influence scores in language classification tasks.
arXiv Detail & Related papers (2023-11-27T20:19:22Z) - Improving Scene Text Recognition for Character-Level Long-Tailed
Distribution [35.14058653707104]
We propose a novel Context-Aware and Free Experts Network (CAFE-Net) using two experts.
CAFE-Net improves the STR performance on languages containing numerous number of characters.
arXiv Detail & Related papers (2023-03-31T06:11:33Z) - READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input
Noises [87.70001456418504]
We construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises.
READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input.
We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN.
arXiv Detail & Related papers (2023-02-14T20:14:39Z) - Text-Aware End-to-end Mispronunciation Detection and Diagnosis [17.286013739453796]
Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT)
In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information.
arXiv Detail & Related papers (2022-06-15T04:08:10Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - Improving Cross-Lingual Reading Comprehension with Self-Training [62.73937175625953]
Current state-of-the-art models even surpass human performance on several benchmarks.
Previous works have revealed the abilities of pre-trained multilingual models for zero-shot cross-lingual reading comprehension.
This paper further utilized unlabeled data to improve the performance.
arXiv Detail & Related papers (2021-05-08T08:04:30Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.