Understanding Social Media Cross-Modality Discourse in Linguistic Space
- URL: http://arxiv.org/abs/2302.13311v1
- Date: Sun, 26 Feb 2023 13:04:04 GMT
- Title: Understanding Social Media Cross-Modality Discourse in Linguistic Space
- Authors: Chunpu Xu, Hanzhuo Tan, Jing Li, Piji Li
- Abstract summary: We present a novel concept of cross-modality discourse, reflecting how human readers couple image and text understandings.
We build the very first dataset containing 16K multimedia tweets with manually annotated discourse labels.
- Score: 26.19949919969774
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The multimedia communications with texts and images are popular on social
media. However, limited studies concern how images are structured with texts to
form coherent meanings in human cognition. To fill in the gap, we present a
novel concept of cross-modality discourse, reflecting how human readers couple
image and text understandings. Text descriptions are first derived from images
(named as subtitles) in the multimedia contexts. Five labels -- entity-level
insertion, projection and concretization and scene-level restatement and
extension -- are further employed to shape the structure of subtitles and texts
and present their joint meanings. As a pilot study, we also build the very
first dataset containing 16K multimedia tweets with manually annotated
discourse labels. The experimental results show that the multimedia encoder
based on multi-head attention with captions is able to obtain
the-state-of-the-art results.
Related papers
- C-CLIP: Contrastive Image-Text Encoders to Close the
Descriptive-Commentative Gap [0.5439020425819]
The interplay between the image and comment on a social media post is one of high importance for understanding its overall message.
Recent strides in multimodal embedding models, namely CLIP, have provided an avenue forward in relating image and text.
The current training regime for CLIP models is insufficient for matching content found on social media, regardless of site or language.
We show that training contrastive image-text encoders on explicitly commentative pairs results in large improvements in retrieval results.
arXiv Detail & Related papers (2023-09-06T19:03:49Z) - Borrowing Human Senses: Comment-Aware Self-Training for Social Media
Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity.
The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales.
The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Cross-Modal Graph with Meta Concepts for Video Captioning [101.97397967958722]
We propose Cross-Modal Graph (CMG) with meta concepts for video captioning.
To cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions.
We construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures.
arXiv Detail & Related papers (2021-08-14T04:00:42Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Text-Free Image-to-Speech Synthesis Using Learned Segmental Units [24.657722909094662]
We present the first model for directly fluent, natural-sounding spoken audio captions for images.
We connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units.
We conduct experiments on the Flickr8k spoken caption dataset and a novel corpus of spoken audio captions collected for the popular MSCOCO dataset.
arXiv Detail & Related papers (2020-12-31T05:28:38Z) - Cross-Media Keyphrase Prediction: A Unified Framework with
Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post.
We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions.
Our model significantly outperforms the previous state of the art based on traditional attention networks.
arXiv Detail & Related papers (2020-11-03T08:44:18Z) - Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval [41.505920288928365]
multimodal data has inspired interest in cross-modal retrieval methods.
We propose novel within-modality losses which encourage semantic coherency in both the text and image subspaces.
Our method ensures that not only are paired images and texts close, but the expected image-image and text-text relationships are also observed.
arXiv Detail & Related papers (2020-07-16T20:32:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.