ContentWise Impressions: An Industrial Dataset with Impressions Included
- URL: http://arxiv.org/abs/2008.01212v2
- Date: Sat, 19 Sep 2020 12:51:09 GMT
- Title: ContentWise Impressions: An Industrial Dataset with Impressions Included
- Authors: Fernando Benjam\'in P\'erez Maurera, Maurizio Ferrari Dacrema, Lorenzo
Saule, Mario Scriminaci, Paolo Cremonesi
- Abstract summary: The ContentWise Impressions dataset is a collection of implicit interactions and impressions of movies and TV series from an Over-The-Top media service.
We describe the data collection process, the preprocessing applied, its characteristics, and statistics when compared to other commonly used datasets.
We release software tools to load and split the data, as well as examples of how to use both user interactions and impressions in several common recommendation algorithms.
- Score: 68.5068326729525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this article, we introduce the ContentWise Impressions dataset, a
collection of implicit interactions and impressions of movies and TV series
from an Over-The-Top media service, which delivers its media contents over the
Internet. The dataset is distinguished from other already available multimedia
recommendation datasets by the availability of impressions, i.e., the
recommendations shown to the user, its size, and by being open-source. We
describe the data collection process, the preprocessing applied, its
characteristics, and statistics when compared to other commonly used datasets.
We also highlight several possible use cases and research questions that can
benefit from the availability of user impressions in an open-source dataset.
Furthermore, we release software tools to load and split the data, as well as
examples of how to use both user interactions and impressions in several common
recommendation algorithms.
Related papers
- Data Distribution Valuation [56.71023681599737]
Existing data valuation methods define a value for a discrete dataset.
In many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled.
We propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies.
arXiv Detail & Related papers (2024-10-06T07:56:53Z) - Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - Uncovering the Interaction Equation: Quantifying the Effect of User Interactions on Social Media Homepage Recommendations [0.5030361857850012]
We study how prior user interactions influence the content presented on users' homepage feeds across three major platforms: YouTube, Reddit, and X (formerly Twitter)
We use a series of carefully designed experiments to gather data capable of uncovering the influence of specific user interactions on homepage content.
This study provides insights into the behaviors of the content curation algorithms used by each platform, how they respond to user interactions, and also uncovers evidence of deprioritization of specific topics.
arXiv Detail & Related papers (2024-07-09T20:47:34Z) - Attention-based sequential recommendation system using multimodal data [8.110978727364397]
We propose an attention-based sequential recommendation method that employs multimodal data of items such as images, texts, and categories.
The experimental results obtained from the Amazon datasets show that the proposed method outperforms those of conventional sequential recommendation systems.
arXiv Detail & Related papers (2024-05-28T08:41:05Z) - [Citation needed] Data usage and citation practices in medical imaging conferences [1.9702506447163306]
We present two open-source tools that could help with the detection of dataset usage.
We studied the usage of 20 publicly available medical datasets in papers from MICCAI and MIDL.
Our findings demonstrate the concentration of the usage of a limited set of datasets.
arXiv Detail & Related papers (2024-02-05T13:41:22Z) - Impression-Aware Recommender Systems [57.38537491535016]
Novel data sources bring new opportunities to improve the quality of recommender systems.
Researchers may use impressions to refine user preferences and overcome the current limitations in recommender systems research.
We present a systematic literature review on recommender systems using impressions.
arXiv Detail & Related papers (2023-08-15T16:16:02Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Revisiting Table Detection Datasets for Visually Rich Documents [17.846536373106268]
This study revisits some open datasets with high-quality annotations, identifies and cleans the noise, and aligns the annotation definitions of these datasets to merge a larger dataset, termed Open-Tables.
To enrich the data sources, we propose a new ICT-TD dataset using the PDF files of Information and Communication Technologies (ICT) commodities, a different domain containing unique samples that hardly appear in open datasets.
Our experimental results show that the domain differences among existing open datasets are minor despite having different data sources.
arXiv Detail & Related papers (2023-05-04T01:08:15Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Media Cloud: Massive Open Source Collection of Global News on the Open
Web [40.52153096219742]
We present the first full description of Media Cloud, an open source platform based on crawling hyperlink structure in operation for over 10 years.
We document the key choices behind what data Media Cloud collects and stores, how it processes and organizes these data, and its open API access as well as user-facing tools.
We give an overview two sample datasets generated using Media Cloud and discuss how researchers can use the platform to create their own datasets.
arXiv Detail & Related papers (2021-04-08T11:51:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.