A Psycho-linguistic Analysis of BitChute
- URL: http://arxiv.org/abs/2204.08078v2
- Date: Wed, 20 Apr 2022 21:14:07 GMT
- Title: A Psycho-linguistic Analysis of BitChute
- Authors: Benjamin D. Horne
- Abstract summary: This paper describes psycho-linguistic metadata for the videos, comments, and channels in the dataset using LIWC22.
We provide basic analysis and comparison of the language on BitChute to other social media platforms.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In order to better support researchers, journalist, and practitioners in
their use of the MeLa-BitChute dataset for exploration and investigative
reporting, we provide new psycho-linguistic metadata for the videos, comments,
and channels in the dataset using LIWC22. This paper describes that metadata
and methods to filter the data using the metadata. In addition, we provide
basic analysis and comparison of the language on BitChute to other social media
platforms. The MeLa-BitChute dataset and LIWC metadata described in this paper
can be found at:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KRD1VS.
Related papers
- LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z) - [Citation needed] Data usage and citation practices in medical imaging conferences [1.9702506447163306]
We present two open-source tools that could help with the detection of dataset usage.
We studied the usage of 20 publicly available medical datasets in papers from MICCAI and MIDL.
Our findings demonstrate the concentration of the usage of a limited set of datasets.
arXiv Detail & Related papers (2024-02-05T13:41:22Z) - Data Selection for Language Models via Importance Resampling [90.9263039747723]
We formalize the problem of selecting a subset of a large raw unlabeled dataset to match a desired target distribution.
We extend the classic importance resampling approach used in low-dimensions for LM data selection.
We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents in 4.5 hours.
arXiv Detail & Related papers (2023-02-06T23:57:56Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - Fine-Grained Scene Graph Generation with Data Transfer [127.17675443137064]
Scene graph generation (SGG) aims to extract (subject, predicate, object) triplets in images.
Recent works have made a steady progress on SGG, and provide useful tools for high-level vision and language understanding.
We propose a novel Internal and External Data Transfer (IETrans) method, which can be applied in a play-and-plug fashion and expanded to large SGG with 1,807 predicate classes.
arXiv Detail & Related papers (2022-03-22T12:26:56Z) - Multimodal Approach for Metadata Extraction from German Scientific
Publications [0.0]
We propose a multimodal deep learning approach for metadata extraction from scientific papers in the German language.
We consider multiple types of input data by combining natural language processing and image vision processing.
Our model for this approach was trained on a dataset consisting of around 8800 documents and is able to obtain an overall F1-score of 0.923.
arXiv Detail & Related papers (2021-11-10T15:19:04Z) - BBC-Oxford British Sign Language Dataset [64.32108826673183]
We introduce the BBC-Oxford British Sign Language (BOBSL) dataset, a large-scale video collection of British Sign Language (BSL)
We describe the motivation for the dataset, together with statistics and available annotations.
We conduct experiments to provide baselines for the tasks of sign recognition, sign language alignment, and sign language translation.
arXiv Detail & Related papers (2021-11-05T17:35:58Z) - MexPub: Deep Transfer Learning for Metadata Extraction from German
Publications [1.1549572298362785]
We present a method that extracts metadata from PDF documents with different layouts and styles by viewing the document as an image.
Our method achieved an average accuracy of around $90%$ which validates its capability to accurately extract metadata from a variety of PDF documents.
arXiv Detail & Related papers (2021-06-04T09:43:48Z) - MusPy: A Toolkit for Symbolic Music Generation [32.01713268702699]
MusPy is an open source Python library for symbolic music generation.
In this paper, we present statistical analysis of the eleven datasets currently supported by MusPy.
arXiv Detail & Related papers (2020-08-05T06:16:13Z) - Towards Large-Scale Data Mining for Data-Driven Analysis of Sign
Languages [0.0]
We show that it is possible to collect the data from social networking services such as TikTok, Instagram, and YouTube.
Using our data collection pipeline, we collect and examine the interpretation of songs in both the American Sign Language (ASL) and the Brazilian Sign Language (Libras)
arXiv Detail & Related papers (2020-06-03T09:28:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.