Related papers: TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla

TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla

URL: http://arxiv.org/abs/2509.09101v1
Date: Thu, 11 Sep 2025 02:25:49 GMT
Title: TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla
Authors: Nishat Raihan, Antonios Anastasopoulos, Marcos Zampieri,
Abstract summary: Despite being the 5th most spoken language, Bangla remains underrepresented in Large Language Models (LLMs)<n>This primarily stems from the scarcity of high-quality data to pre-train and/or finetune such models.<n>We offer three major contributions: (1) a comprehensive Bangla code instruction datasets for programming domain adaptation; (2) MBPP-Bangla, an evaluation benchmark for Bangla code generation; and (3) the TigerCoder-family of Code LLMs, achieving significant 11-18% performance gains at Pass@1 over existing multilingual and general-purpose Bangla LLMs.
Score: 37.210208249613
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Despite being the 5th most spoken language, Bangla remains underrepresented in Large Language Models (LLMs), particularly for code generation. This primarily stems from the scarcity of high-quality data to pre-train and/or finetune such models. Hence, we introduce the first dedicated family of Code LLMs for Bangla (1B & 9B). We offer three major contributions: (1) a comprehensive Bangla code instruction datasets for programming domain adaptation; (2) MBPP-Bangla, an evaluation benchmark for Bangla code generation; and (3) the TigerCoder-family of Code LLMs, achieving significant ~11-18% performance gains at Pass@1 over existing multilingual and general-purpose Bangla LLMs. Our findings show that curated, high-quality datasets can overcome limitations of smaller models for low-resource languages. We open-source all resources to advance further Bangla LLM research.

Related papers

TigerLLM -- A Family of Bangla Large Language Models [8.258559455995917]
We introduce TigerLLM - a family of Bangla language models.<n>Our results demonstrate that these models surpass all open-source alternatives and also outperform larger proprietary models like GPT3.5.
arXiv Detail & Related papers (2025-03-14T01:41:16Z)
Zero-Shot Multi-Label Classification of Bangla Documents: Large Decoders Vs. Classic Encoders [0.0]
Bangla is a language spoken by over 300 million native speakers and ranked as the sixth most spoken language worldwide.<n>Our evaluation of 32 state-of-the-art models reveals that, existing so-called powerful encoders and decoders still struggle to achieve high accuracy on the Bangla Zero-Shot-MLC task.
arXiv Detail & Related papers (2025-03-04T20:39:07Z)
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking [6.070192392563392]
We present TituLLMs, the first large pretrained Bangla LLMs, available in 1b and 3b parameter sizes.<n>To train TituLLMs, we collected a pretraining dataset of approximately 37 billion tokens.<n>We extended the Llama-3.2 tokenizer to incorporate language- and culture-specific knowledge.
arXiv Detail & Related papers (2025-02-16T16:22:23Z)
Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.<n>For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.<n>We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z)
BanglaLlama: LLaMA for Bangla Language [1.0710988917914002]
Despite being the 5th largest spoken language in the world, Bangla is still a "low-resource" language.<n>Existing pretrained language models often struggle to perform well on Bangla Language Processing (BLP) tasks.<n>This paper introduces two high-quality translated Bangla-instruction datasets totaling 224k samples.
arXiv Detail & Related papers (2024-10-28T16:44:02Z)
IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs. We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files. Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language. Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z)
Introducing Bode: A Fine-Tuned Large Language Model for Portuguese Prompt-Based Task [1.158680734110387]
This work proposes a fine-tuned LLaMA 2-based model for Portuguese prompts named Bode. We evaluate the performance of this model in classification tasks using the zero-shot approach with in-context learning.
arXiv Detail & Related papers (2024-01-05T17:15:01Z)
Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z)
Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation. We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.