Multi-modal Representation Learning for Social Post Location Inference
- URL: http://arxiv.org/abs/2306.07935v1
- Date: Sun, 11 Jun 2023 02:35:48 GMT
- Title: Multi-modal Representation Learning for Social Post Location Inference
- Authors: Ruiting Dai, Jiayi Luo, Xucheng Luo, Lisi Mo, Wanlun Ma, Fan Zhou
- Abstract summary: In this work, we propose a novel Multi-modal Representation Learning Framework (MRLF) capable of fusing different modalities of social posts for location inference.
To overcome the noisy user-generated textual content, we introduce a novel attention-based character-aware module.
The experimental results show that MRLF can make accurate location predictions and open a new door to understanding the multi-modal data of social posts for online inference tasks.
- Score: 7.911777986696313
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Inferring geographic locations via social posts is essential for many
practical location-based applications such as product marketing,
point-of-interest recommendation, and infector tracking for COVID-19. Unlike
image-based location retrieval or social-post text embedding-based location
inference, the combined effect of multi-modal information (i.e., post images,
text, and hashtags) for social post positioning receives less attention. In
this work, we collect real datasets of social posts with images, texts, and
hashtags from Instagram and propose a novel Multi-modal Representation Learning
Framework (MRLF) capable of fusing different modalities of social posts for
location inference. MRLF integrates a multi-head attention mechanism to enhance
location-salient information extraction while significantly improving location
inference compared with single domain-based methods. To overcome the noisy
user-generated textual content, we introduce a novel attention-based
character-aware module that considers the relative dependencies between
characters of social post texts and hashtags for flexible multi-model
information fusion. The experimental results show that MRLF can make accurate
location predictions and open a new door to understanding the multi-modal data
of social posts for online inference tasks.
Related papers
- MSSPlace: Multi-Sensor Place Recognition with Visual and Text Semantics [41.94295877935867]
We study the impact of leveraging a multi-camera setup and integrating diverse data sources for multimodal place recognition.
Our proposed method named MSSPlace utilizes images from multiple cameras, LiDAR point clouds, semantic segmentation masks, and text annotations to generate comprehensive place descriptors.
arXiv Detail & Related papers (2024-07-22T14:24:56Z) - AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - SoMeR: Multi-View User Representation Learning for Social Media [1.7949335303516192]
We propose SoMeR, a Social Media user representation learning framework that incorporates temporal activities, text content, profile information, and network interactions to learn comprehensive user portraits.
SoMeR encodes user post streams as sequences of timestamped textual features, uses transformers to embed this along with profile data, and jointly trains with link prediction and contrastive learning objectives.
We demonstrate SoMeR's versatility through two applications: 1) Identifying inauthentic accounts involved in coordinated influence operations by detecting users posting similar content simultaneously, and 2) Measuring increased polarization in online discussions after major events by quantifying how users with different beliefs moved farther apart
arXiv Detail & Related papers (2024-05-02T22:26:55Z) - Multi-modal Stance Detection: New Datasets and Model [56.97470987479277]
We study multi-modal stance detection for tweets consisting of texts and images.
We propose a simple yet effective Targeted Multi-modal Prompt Tuning framework (TMPT)
TMPT achieves state-of-the-art performance in multi-modal stance detection.
arXiv Detail & Related papers (2024-02-22T05:24:19Z) - Improving Social Media Popularity Prediction with Multiple Post
Dependencies [33.517898847695136]
We propose a novel prediction framework named Dependency-aware Sequence Network (DSN)
For intra-post dependency, DSN adopts a multimodal feature extractor with an efficient fine-tuning strategy to obtain task-specific representations from images and textual information of posts.
For inter-post dependency, DSN uses a hierarchical information propagation method to learn category representations that could better describe the difference between posts.
arXiv Detail & Related papers (2023-07-28T09:06:50Z) - A Transformer-based Framework for POI-level Social Post Geolocation [4.027087283290081]
We present a transformer-based general framework, which builds upon pre-trained language models and considers non-textual data.
We show that three variants of our proposed framework outperform multiple state-of-art baselines by a large margin in terms of accuracy and distance error metrics.
arXiv Detail & Related papers (2022-10-26T10:30:51Z) - CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images.
We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z) - Image-Specific Information Suppression and Implicit Local Alignment for
Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text.
Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities.
We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z) - RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video
Retrieval [66.2075707179047]
We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels.
We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels.
Our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets.
arXiv Detail & Related papers (2022-06-26T11:12:49Z) - FiLMing Multimodal Sarcasm Detection with Attention [0.7340017786387767]
Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning.
We propose a novel architecture that uses the RoBERTa model with a co-attention layer on top to incorporate context incongruity between input text and image attributes.
Our results demonstrate that our proposed model outperforms the existing state-of-the-art method by 6.14% F1 score on the public Twitter multimodal detection dataset.
arXiv Detail & Related papers (2021-08-09T06:33:29Z) - Enhancing Social Relation Inference with Concise Interaction Graph and
Discriminative Scene Representation [56.25878966006678]
We propose an approach of textbfPRactical textbfInference in textbfSocial rtextbfElation (PRISE)
It concisely learns interactive features of persons and discriminative features of holistic scenes.
PRISE achieves 6.8$%$ improvement for domain classification in PIPA dataset.
arXiv Detail & Related papers (2021-07-30T04:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.