A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers
- URL: http://arxiv.org/abs/2202.11176v1
- Date: Tue, 22 Feb 2022 20:55:31 GMT
- Title: A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers
- Authors: Alyssa Lees, Vinh Q. Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald
Metzler, Lucy Vasserman
- Abstract summary: We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
- Score: 66.9176610388952
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: On the world wide web, toxic content detectors are a crucial line of defense
against potentially hateful and offensive messages. As such, building highly
effective classifiers that enable a safer internet is an important research
area. Moreover, the web is a highly multilingual, cross-cultural community that
develops its own lingo over time. As such, it is crucial to develop models that
are effective across a diverse range of languages, usages, and styles. In this
paper, we present the fundamentals behind the next version of the Perspective
API from Google Jigsaw. At the heart of the approach is a single multilingual
token-free Charformer model that is applicable across a range of languages,
domains, and tasks. We demonstrate that by forgoing static vocabularies, we
gain flexibility across a variety of settings. We additionally outline the
techniques employed to make such a byte-level model efficient and feasible for
productionization. Through extensive experiments on multilingual toxic comment
classification benchmarks derived from real API traffic and evaluation on an
array of code-switching, covert toxicity, emoji-based hate, human-readable
obfuscation, distribution shift, and bias evaluation settings, we show that our
proposed approach outperforms strong baselines. Finally, we present our
findings from deploying this system in production.
Related papers
- PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Graph Neural Network Enhanced Language Models for Efficient Multilingual
Text Classification [8.147244878591014]
We propose a multilingual disaster related text classification system which is capable to work under mono, cross and multi lingual scenarios.
Our end-to-end trainable framework combines the versatility of graph neural networks, by applying over the corpus.
We evaluate our framework over total nine English, Non-English and monolingual datasets in mono, cross and multi lingual classification scenarios.
arXiv Detail & Related papers (2022-03-06T09:05:42Z) - To Augment or Not to Augment? A Comparative Study on Text Augmentation
Techniques for Low-Resource NLP [0.0]
We investigate three categories of text augmentation methodologies which perform changes on the syntax.
We compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families.
Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT.
arXiv Detail & Related papers (2021-11-18T10:52:48Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Role of Artificial Intelligence in Detection of Hateful Speech for
Hinglish Data on Social Media [1.8899300124593648]
Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world.
Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages.
We propose a methodology for efficient detection of unstructured code-mix Hinglish language.
arXiv Detail & Related papers (2021-05-11T10:02:28Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Coarse and Fine-Grained Hostility Detection in Hindi Posts using Fine
Tuned Multilingual Embeddings [4.3012765978447565]
The hostility detection task has been well explored for resource-rich languages like English, but is unexplored for resource-constrained languages like Hindidue to the unavailability of large suitable data.
We propose an effective neural network-based technique for hostility detection in Hindi posts.
arXiv Detail & Related papers (2021-01-13T11:00:31Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.