Related papers: Bangla AI: A Framework for Machine Translation Utilizing Large Language Models for Ethnic Media

Bangla AI: A Framework for Machine Translation Utilizing Large Language Models for Ethnic Media

URL: http://arxiv.org/abs/2402.14179v1
Date: Wed, 21 Feb 2024 23:43:04 GMT
Title: Bangla AI: A Framework for Machine Translation Utilizing Large Language Models for Ethnic Media
Authors: MD Ashraful Goni, Fahad Mostafa, Kerk F. Kee
Abstract summary: Ethnic media caters to diaspora communities in host nations. Rather than utilizing the language of the host nation, ethnic media delivers news in the language of the immigrant community. This research delves into the prospective integration of large language models (LLM) and multi-lingual machine translations (MMT) within the ethnic media industry.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ethnic media, which caters to diaspora communities in host nations, serves as a vital platform for these communities to both produce content and access information. Rather than utilizing the language of the host nation, ethnic media delivers news in the language of the immigrant community. For instance, in the USA, Bangla ethnic media presents news in Bangla rather than English. This research delves into the prospective integration of large language models (LLM) and multi-lingual machine translations (MMT) within the ethnic media industry. It centers on the transformative potential of using LLM in MMT in various facets of news translation, searching, and categorization. The paper outlines a theoretical framework elucidating the integration of LLM and MMT into the news searching and translation processes for ethnic media. Additionally, it briefly addresses the potential ethical challenges associated with the incorporation of LLM and MMT in news translation procedures.

Related papers

NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities [12.891810941315503]
This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community.<n>We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness.<n>We develop NileChat, a 3B parameter LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values.
arXiv Detail & Related papers (2025-05-23T21:18:40Z)
Detection of Somali-written Fake News and Toxic Messages on the Social Media Using Transformer-based Language Models [0.0]
This paper presents part of our ongoing research work to bridge some of these gaps for the Somali language. We developed a transformer-based monolingual Somali language model (named SomBERTa) SomBERTa is fine-tuned and evaluated on toxic content, fake news and news topic classification datasets.
arXiv Detail & Related papers (2025-03-23T15:45:31Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, effectively being crosslingual? This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
Unraveling Code-Mixing Patterns in Migration Discourse: Automated Detection and Analysis of Online Conversations on Reddit [4.019533549688538]
This paper explores the utilization of code-mixing, a communication strategy prevalent among multilingual speakers, in migration-related discourse on social media platforms such as Reddit. We present Ensemble Learning for Identification of Code-mixed Texts (ELMICT), a novel approach designed to automatically detect code-mixed messages in migration-related discussions.
arXiv Detail & Related papers (2024-06-12T20:30:34Z)
Exploring News Summarization and Enrichment in a Highly Resource-Scarce Indian Language: A Case Study of Mizo [7.393476206148905]
We conduct a study to investigate the effectiveness of a simple methodology designed to generate a holistic summary for Mizo news articles. We make available 500 Mizo news articles and corresponding enriched holistic summaries. Human evaluation confirms that our approach significantly enhances the information coverage of Mizo news articles.
arXiv Detail & Related papers (2024-04-25T17:23:04Z)
Let Models Speak Ciphers: Multiagent Debate through Embeddings [84.20336971784495]
We introduce CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue. By deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights. This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs.
arXiv Detail & Related papers (2023-10-10T03:06:38Z)
Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any. We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z)
MMT: A Multilingual and Multi-Topic Indian Social Media Dataset [1.0413233169366503]
Social media plays a significant role in cross-cultural communication. A vast amount of this occurs in code-mixed and multilingual form. We introduce a large-scale multilingual, and multi-topic dataset (MMT) collected from Twitter.
arXiv Detail & Related papers (2023-04-02T21:39:00Z)
Challenges and Considerations with Code-Mixed NLP for Multilingual Societies [1.6675267471157407]
This paper discusses the current state of the NLP research, limitations, and foreseeable pitfalls in addressing five real-world applications for social good. We also propose futuristic datasets, models, and tools that can significantly advance the current research in multilingual NLP applications for the societal good.
arXiv Detail & Related papers (2021-06-15T00:53:55Z)
MetaXL: Meta Representation Transformation for Low-resource Cross-lingual Learning [91.5426763812547]
Cross-lingual transfer learning is one of the most effective methods for building functional NLP systems for low-resource languages. We propose MetaXL, a meta-learning based framework that learns to transform representations judiciously from auxiliary languages to a target one.
arXiv Detail & Related papers (2021-04-16T06:15:52Z)
NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application [56.1830016521422]
We propose NewsBERT, which can distill pre-trained language models for efficient and effective news intelligence. In our approach, we design a teacher-student joint learning and distillation framework to collaboratively learn both teacher and student models. In our experiments, NewsBERT can effectively improve the model performance in various intelligent news applications with much smaller models.
arXiv Detail & Related papers (2021-02-09T15:41:12Z)
Universal Sentence Representation Learning with Conditional Masked Language Model [7.334766841801749]
We present Conditional Masked Language Modeling (M) to effectively learn sentence representations. Our English CMLM model achieves state-of-the-art performance on SentEval. As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains.
arXiv Detail & Related papers (2020-12-28T18:06:37Z)
FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning. During inference, the model makes predictions based on the text input in the target language and its translation in the source language. To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.