SC-Ques: A Sentence Completion Question Dataset for English as a Second
Language Learners
- URL: http://arxiv.org/abs/2206.12036v2
- Date: Fri, 7 Apr 2023 11:55:04 GMT
- Title: SC-Ques: A Sentence Completion Question Dataset for English as a Second
Language Learners
- Authors: Qiongqiong Liu, Yaying Huang, Zitao Liu, Shuyan Huang, Jiahao Chen,
Xiangyu Zhao, Guimin Lin, Yuyu Zhou, Weiqi Luo
- Abstract summary: Sentence completion (SC) questions present a sentence with one or more blanks that need to be filled in, three to five possible words or phrases as options.
We present a large-scale SC dataset, textscSC-Ques, which is made up of 289,148 ESL SC questions from real-world standardized English examinations.
- Score: 22.566710467490182
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sentence completion (SC) questions present a sentence with one or more blanks
that need to be filled in, three to five possible words or phrases as options.
SC questions are widely used for students learning English as a Second Language
(ESL). In this paper, we present a large-scale SC dataset, \textsc{SC-Ques},
which is made up of 289,148 ESL SC questions from real-world standardized
English examinations. Furthermore, we build a comprehensive benchmark of
automatically solving the SC questions by training the large-scale pre-trained
language models on the proposed \textsc{SC-Ques} dataset. We conduct detailed
analysis of the baseline models performance, limitations and trade-offs. The
data and our code are available for research purposes from:
\url{https://github.com/ai4ed/SC-Ques}.
Related papers
- Multi-label Sequential Sentence Classification via Large Language Model [4.012351415340318]
This paper proposes LLM-SSC, a large language model (LLM)-based framework for both single- and multi-label SSC tasks.<n>Unlike previous approaches that employ small- or medium-sized language models, the proposed framework utilizes LLMs to generate SSC labels through designed prompts.<n>We also present a multi-label contrastive learning loss with auto-weighting scheme, enabling the multi-label classification task.
arXiv Detail & Related papers (2024-11-23T18:27:35Z) - ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings [4.68732641979009]
This paper examines the Code-Switching (CS) phenomenon where two languages intertwine within a single utterance.
We highlight that the current Equivalence Constraint (EC) theory for CS in other languages may only partially capture English-Korean CS complexities.
We introduce a novel Koglish dataset tailored for English-Korean CS scenarios to mitigate such challenges.
arXiv Detail & Related papers (2024-08-28T11:27:21Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - English Contrastive Learning Can Learn Universal Cross-lingual Sentence
Embeddings [77.94885131732119]
Universal cross-lingual sentence embeddings map semantically similar cross-lingual sentences into a shared embedding space.
In this work, we propose mSimCSE, which extends SimCSE to multilingual settings and reveal that contrastive learning on English data can surprisingly learn high-quality universal cross-lingual sentence embeddings without any parallel data.
arXiv Detail & Related papers (2022-11-11T11:17:56Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - Investigating Post-pretraining Representation Alignment for
Cross-Lingual Question Answering [20.4489424966613]
We investigate the capabilities of multilingually pre-trained language models on cross-lingual question answering systems.
We find that explicitly aligning the representations across languages with a post-hoc fine-tuning step generally leads to improved performance.
arXiv Detail & Related papers (2021-09-24T15:32:45Z) - Solving ESL Sentence Completion Questions via Pre-trained Neural
Language Models [33.41201869566935]
Sentence completion (SC) questions present a sentence with one or more blanks that need to be filled in.
We propose a neural framework to solve SC questions in English examinations by utilizing pre-trained language models.
arXiv Detail & Related papers (2021-07-15T05:01:39Z) - Conversations with Search Engines: SERP-based Conversational Response
Generation [77.1381159789032]
We create a suitable dataset, the Search as a Conversation (SaaC) dataset, for the development of pipelines for conversations with search engines.
We also develop a state-of-the-art pipeline for conversations with search engines, the Conversations with Search Engines (CaSE) using this dataset.
CaSE enhances the state-of-the-art by introducing a supporting token identification module and aprior-aware pointer generator.
arXiv Detail & Related papers (2020-04-29T13:07:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.