Re-calibrating methodologies in social media research: Challenge the visual, work with Speech
- URL: http://arxiv.org/abs/2412.13170v1
- Date: Tue, 17 Dec 2024 18:47:57 GMT
- Title: Re-calibrating methodologies in social media research: Challenge the visual, work with Speech
- Authors: Hongrui Jin,
- Abstract summary: This article reflects on how social media scholars can effectively engage with speech-based data in their analyses.
I conclude that the expansion of our methodological repertoire enables richer interpretations of platformised content.
- Score: 0.0
- License:
- Abstract: This article methodologically reflects on how social media scholars can effectively engage with speech-based data in their analyses. While contemporary media studies have embraced textual, visual, and relational data, the aural dimension remained comparatively under-explored. Building on the notion of secondary orality and rejection towards purely visual culture, the paper argues that considering voice and speech at scale enriches our understanding of multimodal digital content. The paper presents the TikTok Subtitles Toolkit that offers accessible speech processing readily compatible with existing workflows. In doing so, it opens new avenues for large-scale inquiries that blend quantitative insights with qualitative precision. Two illustrative cases highlight both opportunities and limitations of speech research: while genres like #storytime on TikTok benefit from the exploration of spoken narratives, nonverbal or music-driven content may not yield significant insights using speech data. The article encourages researchers to integrate aural exploration thoughtfully to complement existing methods, rather than replacing them. I conclude that the expansion of our methodological repertoire enables richer interpretations of platformised content, and our capacity to unpack digital cultures as they become increasingly multimodal.
Related papers
- Acoustic and linguistic representations for speech continuous emotion
recognition in call center conversations [2.0653090022137697]
We explore the use of pre-trained speech representations as a form of transfer learning towards AlloSat corpus.
Our experiments confirm the large gain in performance obtained with the use of pre-trained features.
Surprisingly, we found that the linguistic content is clearly the major contributor for the prediction of satisfaction.
arXiv Detail & Related papers (2023-10-06T10:22:51Z) - Borrowing Human Senses: Comment-Aware Self-Training for Social Media
Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity.
The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales.
The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z) - Multimodal Lecture Presentations Dataset: Understanding Multimodality in
Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content.
Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects.
We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z) - Semantic Role Aware Correlation Transformer for Text to Video Retrieval [23.183653281610866]
This paper proposes a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts.
Preliminary results on popular YouCook2 indicate that our approach surpasses a current state-of-the-art method, with a high margin in all metrics.
arXiv Detail & Related papers (2022-06-26T11:28:03Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Emotion Based Hate Speech Detection using Multimodal Learning [0.0]
This paper proposes the first multimodal deep learning framework to combine the auditory features representing emotion and the semantic features to detect hateful content.
Our results demonstrate that incorporating emotional attributes leads to significant improvement over text-based models in detecting hateful multimedia content.
arXiv Detail & Related papers (2022-02-13T05:39:47Z) - Visual Persuasion in COVID-19 Social Media Content: A Multi-Modal
Characterization [30.710295617831015]
This work proposes a computational approach to analyze the outcome of persuasive information in multi-modal content.
It focuses on two aspects, popularity and reliability, in COVID-19-related news articles shared on Twitter.
arXiv Detail & Related papers (2021-12-05T02:15:01Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z) - Deep Audio-Visual Learning: A Survey [53.487938108404244]
We divide the current audio-visual learning tasks into four different subfields.
We discuss state-of-the-art methods as well as the remaining challenges of each subfield.
We summarize the commonly used datasets and performance metrics.
arXiv Detail & Related papers (2020-01-14T13:11:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.