The Multilingual Amazon Reviews Corpus
- URL: http://arxiv.org/abs/2010.02573v1
- Date: Tue, 6 Oct 2020 09:34:01 GMT
- Title: The Multilingual Amazon Reviews Corpus
- Authors: Phillip Keung, Yichao Lu, Gy\"orgy Szarvas, Noah A. Smith
- Abstract summary: We present the Multilingual Amazon Reviews Corpus (MARC), a large-scale collection of Amazon reviews for multilingual text classification.
MARC contains reviews in English, Japanese, German, French, Spanish, and Chinese, which were collected between 2015 and 2019.
The corpus is balanced across the 5 possible star ratings, so each rating constitutes 20% of the reviews in each language.
- Score: 46.84980931183582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present the Multilingual Amazon Reviews Corpus (MARC), a large-scale
collection of Amazon reviews for multilingual text classification. The corpus
contains reviews in English, Japanese, German, French, Spanish, and Chinese,
which were collected between 2015 and 2019. Each record in the dataset contains
the review text, the review title, the star rating, an anonymized reviewer ID,
an anonymized product ID, and the coarse-grained product category (e.g.,
'books', 'appliances', etc.) The corpus is balanced across the 5 possible star
ratings, so each rating constitutes 20% of the reviews in each language. For
each language, there are 200,000, 5,000, and 5,000 reviews in the training,
development, and test sets, respectively. We report baseline results for
supervised text classification and zero-shot cross-lingual transfer learning by
fine-tuning a multilingual BERT model on reviews data. We propose the use of
mean absolute error (MAE) instead of classification accuracy for this task,
since MAE accounts for the ordinal nature of the ratings.
Related papers
- DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z) - Evaluating the Effectiveness of Pre-trained Language Models in
Predicting the Helpfulness of Online Product Reviews [0.21485350418225244]
We compare the use of RoBERTa and XLM-R language models to predict the helpfulness of online product reviews.
We employ the Amazon review dataset for our experiments.
arXiv Detail & Related papers (2023-02-19T18:22:59Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - Does Summary Evaluation Survive Translation to Other Languages? [0.0]
We translate an existing English summarization dataset, SummEval dataset, to four different languages.
We analyze the scores from the automatic evaluation metrics in translated languages, as well as their correlation with human annotations in the source language.
arXiv Detail & Related papers (2021-09-16T17:35:01Z) - I Wish I Would Have Loved This One, But I Didn't -- A Multilingual
Dataset for Counterfactual Detection in Product Reviews [19.533526638034047]
We consider the problem of counterfactual detection (CFD) in product reviews.
For this purpose, we annotate a multilingual CFD dataset from Amazon product reviews.
The dataset is unique as it contains counterfactuals in multiple languages.
arXiv Detail & Related papers (2021-04-14T14:38:36Z) - Abstractive Opinion Tagging [65.47649273721679]
In e-commerce, opinion tags refer to a ranked list of tags provided by the e-commerce platform that reflect characteristics of reviews of an item.
Current mechanisms for generating opinion tags rely on either manual or labelling methods, which is time-consuming and ineffective.
We propose an abstractive opinion tagging framework, named AOT-Net, to generate a ranked list of opinion tags given a large number of reviews.
arXiv Detail & Related papers (2021-01-18T05:08:15Z) - Mapping Languages: The Corpus of Global Language Use [0.0]
This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping.
In total, the corpus contains 423 billion words representing 148 languages and 158 countries.
arXiv Detail & Related papers (2020-04-02T03:42:14Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.