URL-BERT: Training Webpage Representations via Social Media Engagements
- URL: http://arxiv.org/abs/2310.16303v1
- Date: Wed, 25 Oct 2023 02:22:50 GMT
- Title: URL-BERT: Training Webpage Representations via Social Media Engagements
- Authors: Ayesha Qamar, Chetan Verma, Ahmed El-Kishky, Sumit Binnani, Sneha
Mehta, Taylor Berg-Kirkpatrick
- Abstract summary: We introduce a new pre-training objective that can be used to adapt LMs to understand URLs and webpages.
Our proposed framework consists of two steps: (1) scalable graph embeddings to learn shallow representations of URLs based on user engagement on social media.
We experimentally demonstrate that our continued pre-training approach improves webpage understanding on a variety of tasks and Twitter internal and external benchmarks.
- Score: 31.6455614291821
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding and representing webpages is crucial to online social networks
where users may share and engage with URLs. Common language model (LM) encoders
such as BERT can be used to understand and represent the textual content of
webpages. However, these representations may not model thematic information of
web domains and URLs or accurately capture their appeal to social media users.
In this work, we introduce a new pre-training objective that can be used to
adapt LMs to understand URLs and webpages. Our proposed framework consists of
two steps: (1) scalable graph embeddings to learn shallow representations of
URLs based on user engagement on social media and (2) a contrastive objective
that aligns LM representations with the aforementioned graph-based
representation. We apply our framework to the multilingual version of BERT to
obtain the model URL-BERT. We experimentally demonstrate that our continued
pre-training approach improves webpage understanding on a variety of tasks and
Twitter internal and external benchmarks.
Related papers
- SocialQuotes: Learning Contextual Roles of Social Media Quotes on the Web [9.130915550141337]
We liken social media embeddings to quotes, formalize the page context as structured natural language signals, and identify a taxonomy of roles for quotes within the page context.
We release SocialQuotes, a new data set built from the Common Crawl of over 32 million social quotes, 8.3k of them with crowdsourced quote annotations.
arXiv Detail & Related papers (2024-07-22T19:21:01Z) - AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - SoMeR: Multi-View User Representation Learning for Social Media [1.7949335303516192]
We propose SoMeR, a Social Media user representation learning framework that incorporates temporal activities, text content, profile information, and network interactions to learn comprehensive user portraits.
SoMeR encodes user post streams as sequences of timestamped textual features, uses transformers to embed this along with profile data, and jointly trains with link prediction and contrastive learning objectives.
We demonstrate SoMeR's versatility through two applications: 1) Identifying inauthentic accounts involved in coordinated influence operations by detecting users posting similar content simultaneously, and 2) Measuring increased polarization in online discussions after major events by quantifying how users with different beliefs moved farther apart
arXiv Detail & Related papers (2024-05-02T22:26:55Z) - Hierarchical Multimodal Pre-training for Visually Rich Webpage
Understanding [22.00873805952277]
WebLM is a multimodal pre-training network designed to address the limitations of solely modeling text and structure modalities of HTML in webpages.
We propose several pre-training tasks to model the interaction among text, structure, and image modalities effectively.
Empirical results demonstrate that the pre-trained WebLM significantly surpasses previous state-of-the-art pre-trained models across several webpage understanding tasks.
arXiv Detail & Related papers (2024-02-28T11:50:36Z) - URLBERT:A Contrastive and Adversarial Pre-trained Model for URL
Classification [10.562100395816595]
URLs play a crucial role in understanding and categorizing web content.
This paper introduces URLBERT, the first pre-trained representation learning model applied to a variety of URL classification or detection tasks.
arXiv Detail & Related papers (2024-02-18T07:51:20Z) - Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding [58.70423899829642]
We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding.
We show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains.
arXiv Detail & Related papers (2022-10-07T06:42:06Z) - WEDGE: Web-Image Assisted Domain Generalization for Semantic
Segmentation [72.88657378658549]
We propose a WEb-image assisted Domain GEneralization scheme, which is the first to exploit the diversity of web-crawled images for generalizable semantic segmentation.
We also present a method which injects styles of the web-crawled images into training images on-the-fly during training, which enables the network to experience images of diverse styles with reliable labels for effective training.
arXiv Detail & Related papers (2021-09-29T05:19:58Z) - Transferring Cross-domain Knowledge for Video Sign Language Recognition [103.9216648495958]
Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation.
We propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them.
arXiv Detail & Related papers (2020-03-08T03:05:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.