Blow the Dog Whistle: A Chinese Dataset for Cant Understanding with
Common Sense and World Knowledge
- URL: http://arxiv.org/abs/2104.02704v1
- Date: Tue, 6 Apr 2021 17:55:43 GMT
- Title: Blow the Dog Whistle: A Chinese Dataset for Cant Understanding with
Common Sense and World Knowledge
- Authors: Canwen Xu and Wangchunshu Zhou and Tao Ge and Ke Xu and Julian McAuley
and Furu Wei
- Abstract summary: Cant is important for understanding advertising, comedies and dog-whistle politics.
We propose a large and diverse Chinese dataset for creating and understanding cant.
- Score: 49.288196234823005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cant is important for understanding advertising, comedies and dog-whistle
politics. However, computational research on cant is hindered by a lack of
available datasets. In this paper, we propose a large and diverse Chinese
dataset for creating and understanding cant from a computational linguistics
perspective. We formulate a task for cant understanding and provide both
quantitative and qualitative analysis for tested word embedding similarity and
pretrained language models. Experiments suggest that such a task requires deep
language understanding, common sense, and world knowledge and thus can be a
good testbed for pretrained language models and help models perform better on
other tasks. The code is available at https://github.com/JetRunner/dogwhistle.
The data and leaderboard are available at
https://competitions.codalab.org/competitions/30451.
Related papers
- Is Child-Directed Speech Effective Training Data for Language Models? [34.46268640655943]
We train GPT-2 and RoBERTa models on 29M words of English child-directed speech.
We test whether the global developmental ordering or the local discourse ordering of children's training data supports high performance relative to other datasets.
These findings support the hypothesis that, rather than proceeding from better data, the child's learning algorithm is substantially more data-efficient than current language modeling techniques.
arXiv Detail & Related papers (2024-08-07T08:18:51Z) - Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - Conic10K: A Challenging Math Problem Understanding and Reasoning Dataset [38.99073257782012]
We propose Conic10K, a challenging math problem dataset on conic sections in Chinese senior high school education.
Our dataset contains various problems with different reasoning depths, while only the knowledge from conic sections is required.
For each problem, we provide a high-quality formal representation, the reasoning steps, and the final solution.
arXiv Detail & Related papers (2023-11-09T02:58:17Z) - Spoken Language Understanding for Conversational AI: Recent Advances and
Future Direction [5.829344935864271]
This tutorial will discuss how the joint task is set up and introduce Spoken Language Understanding/Natural Language Understanding (SLU/NLU) with Deep Learning techniques.
We will describe how the machine uses the latest NLP and Deep Learning techniques to address the joint task.
arXiv Detail & Related papers (2022-12-21T02:47:52Z) - Deep Bidirectional Language-Knowledge Graph Pretraining [159.9645181522436]
DRAGON is a self-supervised approach to pretraining a deeply joint language-knowledge foundation model from text and KG at scale.
Our model takes pairs of text segments and relevant KG subgraphs as input and bidirectionally fuses information from both modalities.
arXiv Detail & Related papers (2022-10-17T18:02:52Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - A Sentence Cloze Dataset for Chinese Machine Reading Comprehension [64.07894249743767]
We propose a new task called Sentence Cloze-style Machine Reading (SC-MRC)
The proposed task aims to fill the right candidate sentence into the passage that has several blanks.
We built a Chinese dataset called CMRC 2019 to evaluate the difficulty of the SC-MRC task.
arXiv Detail & Related papers (2020-04-07T04:09:00Z) - Information-Theoretic Probing for Linguistic Structure [74.04862204427944]
We propose an information-theoretic operationalization of probing as estimating mutual information.
We evaluate on a set of ten typologically diverse languages often underrepresented in NLP research.
arXiv Detail & Related papers (2020-04-07T01:06:36Z) - Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using
Zero-shot Learning [30.868309879441615]
We tackle the lack of data by leveraging pre-trained multilingual language models to transfer a retrieval system trained on English collections to non-English queries and documents.
Our results show that the proposed approach can significantly outperform unsupervised retrieval techniques for Arabic, Chinese Mandarin, and Spanish.
arXiv Detail & Related papers (2019-12-30T20:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.