Media Cloud: Massive Open Source Collection of Global News on the Open
Web
- URL: http://arxiv.org/abs/2104.03702v3
- Date: Sat, 1 May 2021 23:01:20 GMT
- Title: Media Cloud: Massive Open Source Collection of Global News on the Open
Web
- Authors: Hal Roberts, Rahul Bhargava, Linas Valiukas, Dennis Jen, Momin M.
Malik, Cindy Bishop, Emily Ndulue, Aashka Dave, Justin Clark, Bruce Etling,
Rob Faris, Anushka Shah, Jasmin Rubinovitz, Alexis Hope, Catherine D'Ignazio,
Fernando Bermejo, Yochai Benkler, Ethan Zuckerman
- Abstract summary: We present the first full description of Media Cloud, an open source platform based on crawling hyperlink structure in operation for over 10 years.
We document the key choices behind what data Media Cloud collects and stores, how it processes and organizes these data, and its open API access as well as user-facing tools.
We give an overview two sample datasets generated using Media Cloud and discuss how researchers can use the platform to create their own datasets.
- Score: 40.52153096219742
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present the first full description of Media Cloud, an open source platform
based on crawling hyperlink structure in operation for over 10 years, that for
many uses will be the best way to collect data for studying the media ecosystem
on the open web. We document the key choices behind what data Media Cloud
collects and stores, how it processes and organizes these data, and its open
API access as well as user-facing tools. We also highlight the strengths and
limitations of the Media Cloud collection strategy compared to relevant
alternatives. We give an overview two sample datasets generated using Media
Cloud and discuss how researchers can use the platform to create their own
datasets.
Related papers
- PVContext: Hybrid Context Model for Point Cloud Compression [61.24130634750288]
We propose PVContext, a hybrid context model for effective octree-based point cloud compression.
PVContext comprises two components with distinct modalities: the Voxel Context, which accurately represents local geometric information using voxels, and the Point Context, which efficiently preserves global shape information from point clouds.
arXiv Detail & Related papers (2024-09-19T12:47:35Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - BigBird: Big Data Storage and Analytics at Scale in Hybrid Cloud [0.0]
This paper showcases our approach in designing a scalable big data storage and analytics management framework using BigQuery in Google Cloud Platform.
Although the paper discusses the framework implementation in Google Cloud Platform, it can easily be applied to all major cloud providers.
arXiv Detail & Related papers (2022-03-22T05:42:46Z) - NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy
Labels [33.659146748289444]
We create a benchmark dataset consisting of around 2 million videos with associated user-generated annotations and other meta information.
We show how a network pretrained on the proposed dataset can help against video corruption and label noise in downstream datasets.
arXiv Detail & Related papers (2021-10-13T16:12:18Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z) - AMUSED: An Annotation Framework of Multi-modal Social Media Data [0.0]
The framework is designed to mitigate the issues of collecting and annotating social media data.
AMUSED can be applied in multiple application domains, as a use case, we have implemented the framework for collecting COVID-19 misinformation data.
arXiv Detail & Related papers (2020-10-01T15:50:41Z) - ContentWise Impressions: An Industrial Dataset with Impressions Included [68.5068326729525]
The ContentWise Impressions dataset is a collection of implicit interactions and impressions of movies and TV series from an Over-The-Top media service.
We describe the data collection process, the preprocessing applied, its characteristics, and statistics when compared to other commonly used datasets.
We release software tools to load and split the data, as well as examples of how to use both user interactions and impressions in several common recommendation algorithms.
arXiv Detail & Related papers (2020-08-03T21:46:38Z) - Reliable and Efficient Long-Term Social Media Monitoring [4.389610557232119]
This technical report presents a cloud-based data collection, pre-processing, and archiving infrastructure.
We show how this approach works in different cloud computing architectures, and how to adapt the method to collect streaming data from other social media platforms.
arXiv Detail & Related papers (2020-05-05T19:04:56Z) - Neural Data Server: A Large-Scale Search Engine for Transfer Learning
Data [78.74367441804183]
We introduce Neural Data Server (NDS), a large-scale search engine for finding the most useful transfer learning data to the target domain.
NDS consists of a dataserver which indexes several large popular image datasets, and aims to recommend data to a client.
We show the effectiveness of NDS in various transfer learning scenarios, demonstrating state-of-the-art performance on several target datasets.
arXiv Detail & Related papers (2020-01-09T01:21:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.