Related papers: LLVMs4Protest: Harnessing the Power of Large Language and Vision Models for Deciphering Protests in the News

LLVMs4Protest: Harnessing the Power of Large Language and Vision Models for Deciphering Protests in the News

URL: http://arxiv.org/abs/2311.18241v1
Date: Thu, 30 Nov 2023 04:17:30 GMT
Title: LLVMs4Protest: Harnessing the Power of Large Language and Vision Models for Deciphering Protests in the News
Authors: Yongjun Zhang
Abstract summary: This article documents how we fine-tuned two large pretrained transformer models, including longformer and swin-transformer v2, to infer potential protests in news articles using textual and imagery data. We release this short technical report for social movement scholars who are interested in using LLVMs to infer protests in textual and imagery data.
Score: 3.313485776871956
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language and vision models have transformed how social movements scholars identify protest and extract key protest attributes from multi-modal data such as texts, images, and videos. This article documents how we fine-tuned two large pretrained transformer models, including longformer and swin-transformer v2, to infer potential protests in news articles using textual and imagery data. First, the longformer model was fine-tuned using the Dynamic of Collective Action (DoCA) Corpus. We matched the New York Times articles with the DoCA database to obtain a training dataset for downstream tasks. Second, the swin-transformer v2 models was trained on UCLA-protest imagery data. UCLA-protest project contains labeled imagery data with information such as protest, violence, and sign. Both fine-tuned models will be available via \url{https://github.com/Joshzyj/llvms4protest}. We release this short technical report for social movement scholars who are interested in using LLVMs to infer protests in textual and imagery data.

Related papers

Mimir: Improving Video Diffusion Models for Precise Text Understanding [53.72393225042688]
Text serves as the key control signal in video generation due to its narrative nature. The recent success of large language models (LLMs) showcases the power of decoder-only transformers. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser.
arXiv Detail & Related papers (2024-12-04T07:26:44Z)
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation [22.782099757385804]
TIP-I2V is the first large-scale dataset of user-provided text and image prompts for image-to-video generation. We provide the corresponding generated videos from five state-of-the-art image-to-video models.
arXiv Detail & Related papers (2024-11-05T18:52:43Z)
Textually Pretrained Speech Language Models [107.10344535390956]
We propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board.
arXiv Detail & Related papers (2023-05-22T13:12:16Z)
Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V) We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z)
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z)
Robustness Analysis of Video-Language Models Against Visual and Language Perturbations [10.862722733649543]
This study is the first extensive study of video-language robustness models against various real-world perturbations. We propose two large-scale benchmark datasets, MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different text perturbations.
arXiv Detail & Related papers (2022-07-05T16:26:05Z)
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels. contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text. There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z)
Using Word Embeddings to Analyze Protests News [2.024222101808971]
Two well performing models have been chosen in order to replace the existing word embeddings word2vec and FastTest with ELMo and DistilBERT. Unlike bag of words or earlier vector approaches, ELMo and DistilBERT represent words as a sequence of vectors by capturing the meaning based on contextual information in the text.
arXiv Detail & Related papers (2022-03-11T12:25:59Z)
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data. We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z)
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [80.7397409377659]
We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets. Our model is flexible and can be trained on both image and video text datasets, either independently or in conjunction. We show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks.
arXiv Detail & Related papers (2021-04-01T17:48:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.