Related papers: BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models

BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models

URL: http://arxiv.org/abs/2510.19419v1
Date: Wed, 22 Oct 2025 09:42:01 GMT
Title: BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models
Authors: Yuan Gao, Suchir Salhan, Andrew Caines, Paula Buttery, Weiwei Sun,
Abstract summary: BLiSS 1.0 is a Benchmark of Learner Interlingual Syntactic Structure.<n>It tests whether a model finds a naturalistic learner error more plausible than a matched, artificial error.<n>We provide 136,867 controlled triplets (corrected, learner, artificial) for this purpose.
Score: 10.028672903585777
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To bridge the gap between performance-oriented benchmarks and the evaluation of cognitively inspired models, we introduce BLiSS 1.0, a Benchmark of Learner Interlingual Syntactic Structure. Our benchmark operationalizes a new paradigm of selective tolerance, testing whether a model finds a naturalistic learner error more plausible than a matched, artificial error within the same sentence. Constructed from over 2.8 million naturalistic learner sentences, BLiSS provides 136,867 controlled triplets (corrected, learner, artificial) for this purpose. Experiments on a diverse suite of models demonstrate that selective tolerance is a distinct capability from standard grammaticality, with performance clustering strongly by training paradigm. This validates BLiSS as a robust tool for measuring how different training objectives impact a model's alignment with the systematic patterns of human language acquisition.

Related papers

How Large Language Models Get Stuck: Early structure with persistent errors [0.0]
We trained Meta's OPT model on the 100M word BabyLM dataset.<n>We evaluated it on the BLiMP benchmark, which consists of 67 classes.<n>We tested the model's preference for grammatical over ungrammatical sentences.
arXiv Detail & Related papers (2026-02-27T22:49:11Z)
Your Pretrained Model Tells the Difficulty Itself: A Self-Adaptive Curriculum Learning Paradigm for Natural Language Understanding [53.63482987410292]
We present a self-adaptive curriculum learning paradigm that prioritizes fine-tuning examples based on difficulty scores predicted by pre-trained language models.<n>We evaluate our method on four natural language understanding (NLU) datasets covering both binary and multi-class classification tasks.
arXiv Detail & Related papers (2025-07-13T19:36:17Z)
Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models [55.14276067678253]
This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in Large Language Models (LLMs)<n>We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models.<n>Further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns.
arXiv Detail & Related papers (2025-05-24T12:31:27Z)
FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation [24.39952838336609]
FLUKE is a framework for assessing model robustness through systematic minimal variations of test data.<n>We demonstrate FLUKE's utility by evaluating both fine-tuned models and large language models (LLMs) across six diverse NLP tasks.
arXiv Detail & Related papers (2025-04-24T07:12:37Z)
Teaching a Language Model to Distinguish Between Similar Details using a Small Adversarial Training Set [0.0]
We show an increase in accuracy on the adversarial test set (+ 13%) while still maintaining good performance on the original NLI task. We also show an increase in accuracy from 91.2% to 92.9% on the most similar contradictions in the SNLI test set (as judged by cosine similarity)
arXiv Detail & Related papers (2024-10-30T15:27:55Z)
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets [106.7760874400261]
This paper presents ML-SUPERB2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models. We find performance improvements over the setup of ML-SUPERB, but performance depends on the downstream model design. Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches.
arXiv Detail & Related papers (2024-06-12T21:01:26Z)
Evaluating Large Language Models Using Contrast Sets: An Experimental Approach [0.0]
We introduce an innovative technique for generating a contrast set for the Stanford Natural Language Inference dataset. Our strategy involves the automated substitution of verbs, adverbs, and adjectives with their synonyms to preserve the original meaning of sentences. This method aims to assess whether a model's performance is based on genuine language comprehension or simply on pattern recognition.
arXiv Detail & Related papers (2024-04-02T02:03:28Z)
Bridging the Gap Between Training and Inference of Bayesian Controllable Language Models [58.990214815032495]
Large-scale pre-trained language models have achieved great success on natural language generation tasks. BCLMs have been shown to be efficient in controllable language generation. We propose a "Gemini Discriminator" for controllable language generation which alleviates the mismatch problem with a small computational cost.
arXiv Detail & Related papers (2022-06-11T12:52:32Z)
An Application of Pseudo-Log-Likelihoods to Natural Language Scoring [5.382454613390483]
A language model with relatively few parameters and training steps can outperform it on a recent large data set. We produce some absolute state-of-the-art results for common sense reasoning in binary choice tasks. We argue that robustness of the smaller model ought to be understood in terms of compositionality.
arXiv Detail & Related papers (2022-01-23T22:00:54Z)
Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step. We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.