Data and Approaches for German Text simplification -- towards an
Accessibility-enhanced Communication
- URL: http://arxiv.org/abs/2312.09966v1
- Date: Fri, 15 Dec 2023 17:23:33 GMT
- Title: Data and Approaches for German Text simplification -- towards an
Accessibility-enhanced Communication
- Authors: Thorben Schomacker, Michael Gille, J\"org von der H\"ulls, Marina
Tropmann-Frick
- Abstract summary: This paper examines the current state-of-the-art of German text simplification, focusing on parallel and monolingual German corpora.
It reviews neural language models for simplifying German texts and assesses their suitability for legal texts and accessibility requirements.
The authors launched the interdisciplinary OPEN-LS project in April 2023 to address these research gaps.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper examines the current state-of-the-art of German text
simplification, focusing on parallel and monolingual German corpora. It reviews
neural language models for simplifying German texts and assesses their
suitability for legal texts and accessibility requirements. Our findings
highlight the need for additional training data and more appropriate approaches
that consider the specific linguistic characteristics of German, as well as the
importance of the needs and preferences of target groups with cognitive or
language impairments. The authors launched the interdisciplinary OPEN-LS
project in April 2023 to address these research gaps. The project aims to
develop a framework for text formats tailored to individuals with low literacy
levels, integrate legal texts, and enhance comprehensibility for those with
linguistic or cognitive impairments. It will also explore cost-effective ways
to enhance the data with audience-specific illustrations using image-generating
AI.
For more and up-to-date information, please visit our project homepage
https://open-ls.entavis.com
Related papers
- Linguistic Characteristics of AI-Generated Text: A Survey [0.3007949058551534]
Large language models (LLMs) are solidifying their position in the modern world as effective tools for the automatic generation of text.<n>There is a growing need to study the linguistic features present in AI-generated text.
arXiv Detail & Related papers (2025-10-01T05:44:28Z) - Liaozhai through the Looking-Glass: On Paratextual Explicitation of Culture-Bound Terms in Machine Translation [70.43884512651668]
We formalize Genette's (1987) theory of paratexts from literary and translation studies to introduce the task of paratextual explicitation for machine translation.<n>We construct a dataset of 560 expert-aligned paratexts from four English translations of the classical Chinese short story collection Liaozhai.<n>Our findings demonstrate the potential of paratextual explicitation in advancing machine translation beyond linguistic equivalence.
arXiv Detail & Related papers (2025-09-27T16:27:36Z) - Multilingual Self-Taught Faithfulness Evaluators [11.200203292660758]
Self-Taught Evaluators for Multilingual Faithfulness is a framework that learns exclusively from synthetic multilingual summarization data.<n>Our framework shows improvements over existing baselines, including state-of-the-art English evaluators and machine translation-based approaches.
arXiv Detail & Related papers (2025-07-28T12:01:59Z) - AI-Driven Generation of Old English: A Framework for Low-Resource Languages [0.0]
Preserving ancient languages is essential for understanding humanity's cultural and linguistic heritage.<n>Old English remains critically under-resourced, limiting its accessibility to modern natural language processing (NLP) techniques.<n>We present a scalable framework that uses advanced large language models (LLMs) to generate high-quality Old English texts.
arXiv Detail & Related papers (2025-07-27T03:29:19Z) - Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies.
We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z) - KamerRaad: Enhancing Information Retrieval in Belgian National Politics through Hierarchical Summarization and Conversational Interfaces [55.00702535694059]
KamerRaad is an AI tool that leverages large language models to help citizens interactively engage with Belgian political information.
The tool extracts and concisely summarizes key excerpts from parliamentary proceedings, followed by the potential for interaction based on generative AI.
arXiv Detail & Related papers (2024-04-22T15:01:39Z) - Decomposed Prompting: Probing Multilingual Linguistic Structure Knowledge in Large Language Models [54.58989938395976]
We introduce a decomposed prompting approach for sequence labeling tasks.<n>We test our method on the Universal Dependencies part-of-speech tagging dataset for 38 languages.
arXiv Detail & Related papers (2024-02-28T15:15:39Z) - Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings [22.71166607645311]
We introduce a novel suite of state-of-the-art bilingual text embedding models.
These models are capable of processing lengthy text inputs with up to 8192 tokens.
We have significantly improved the model performance on STS tasks.
We have expanded the Massive Text Embedding Benchmark to include benchmarks for German and Spanish embedding models.
arXiv Detail & Related papers (2024-02-26T20:53:12Z) - Language Detection for Transliterated Content [0.0]
We study the widespread use of transliteration, where the English alphabet is employed to convey messages in native languages.
This paper addresses this challenge through a dataset of phone text messages in Hindi and Russian transliterated into English.
The research pioneers innovative approaches to identify and convert transliterated text.
arXiv Detail & Related papers (2024-01-09T15:40:54Z) - Enhancing Essay Scoring with Adversarial Weights Perturbation and
Metric-specific AttentionPooling [18.182517741584707]
This study explores the application of BERT-related techniques to enhance the assessment of ELLs' writing proficiency.
To address the specific needs of ELLs, we propose the use of DeBERTa, a state-of-the-art neural language model.
arXiv Detail & Related papers (2024-01-06T06:05:12Z) - Automatic and Human-AI Interactive Text Generation [27.05024520190722]
This tutorial aims to provide an overview of the state-of-the-art natural language generation research.
Text-to-text generation tasks are more constrained in terms of semantic consistency and targeted language styles.
arXiv Detail & Related papers (2023-10-05T20:26:15Z) - BabySLM: language-acquisition-friendly benchmark of self-supervised
spoken language models [56.93604813379634]
Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels.
We propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels.
We highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
arXiv Detail & Related papers (2023-06-02T12:54:38Z) - An Inclusive Notion of Text [69.36678873492373]
We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP.
We introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling.
arXiv Detail & Related papers (2022-11-10T14:26:43Z) - A Transfer Learning Based Model for Text Readability Assessment in
German [4.550811027560416]
We propose a new model for text complexity assessment for German text based on transfer learning.
Best model is based on the BERT pre-trained language model achieved the Root Mean Square Error (RMSE) of 0.483.
arXiv Detail & Related papers (2022-07-13T15:15:44Z) - Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z) - ERICA: Improving Entity and Relation Understanding for Pre-trained
Language Models via Contrastive Learning [97.10875695679499]
We propose a novel contrastive learning framework named ERICA in pre-training phase to obtain a deeper understanding of the entities and their relations in text.
Experimental results demonstrate that our proposed ERICA framework achieves consistent improvements on several document-level language understanding tasks.
arXiv Detail & Related papers (2020-12-30T03:35:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.