Related papers: HKUST at SemEval-2023 Task 1: Visual Word Sense Disambiguation with Context Augmentation and Visual Assistance

HKUST at SemEval-2023 Task 1: Visual Word Sense Disambiguation with Context Augmentation and Visual Assistance

URL: http://arxiv.org/abs/2311.18273v1
Date: Thu, 30 Nov 2023 06:23:15 GMT
Title: HKUST at SemEval-2023 Task 1: Visual Word Sense Disambiguation with Context Augmentation and Visual Assistance
Authors: Zhuohao Yin, Xin Huang
Abstract summary: We propose a multi-modal retrieval framework that maximally leverages pretrained Vision-Language models. Our system does not produce the most competitive results at SemEval-2023 Task 1, but we are still able to beat nearly half of the teams.
Score: 5.5532783549057845
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual Word Sense Disambiguation (VWSD) is a multi-modal task that aims to select, among a batch of candidate images, the one that best entails the target word's meaning within a limited context. In this paper, we propose a multi-modal retrieval framework that maximally leverages pretrained Vision-Language models, as well as open knowledge bases and datasets. Our system consists of the following key components: (1) Gloss matching: a pretrained bi-encoder model is used to match contexts with proper senses of the target words; (2) Prompting: matched glosses and other textual information, such as synonyms, are incorporated using a prompting template; (3) Image retrieval: semantically matching images are retrieved from large open datasets using prompts as queries; (4) Modality fusion: contextual information from different modalities are fused and used for prediction. Although our system does not produce the most competitive results at SemEval-2023 Task 1, we are still able to beat nearly half of the teams. More importantly, our experiments reveal acute insights for the field of Word Sense Disambiguation (WSD) and multi-modal learning. Our code is available on GitHub.

Related papers

Test-time Contrastive Concepts for Open-world Semantic Segmentation [14.899741072838994]
Recent CLIP-like Vision-Language Models (VLMs), pre-trained on large amounts of image-text pairs, have paved the way to open-vocabulary semantic segmentation. We propose two different approaches to automatically generate, at test time, textual contrastive concepts that are query-specific.
arXiv Detail & Related papers (2024-07-06T12:18:43Z)
Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation [1.8591405259852054]
Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the goal of retrieving an image among a set of candidates. In this paper, we make a substantial step towards unveiling this interesting task by applying a varying set of approaches.
arXiv Detail & Related papers (2023-10-21T14:35:42Z)
DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System for Multilingual Named Entity Recognition [94.90258603217008]
The MultiCoNER RNum2 shared task aims to tackle multilingual named entity recognition (NER) in fine-grained and noisy scenarios. Previous top systems in the MultiCoNER RNum1 either incorporate the knowledge bases or gazetteers. We propose a unified retrieval-augmented system (U-RaNER) for fine-grained multilingual NER.
arXiv Detail & Related papers (2023-05-05T16:59:26Z)
OPI at SemEval 2023 Task 1: Image-Text Embeddings and Multimodal Information Retrieval for Visual Word Sense Disambiguation [0.0]
We present our submission to SemEval 2023 visual word sense disambiguation shared task. The proposed system integrates multimodal embeddings, learning to rank methods, and knowledge-based approaches. Our solution was ranked third in the multilingual task and won in the Persian track, one of the three language subtasks.
arXiv Detail & Related papers (2023-04-14T13:45:59Z)
Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks. For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs. Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z)
Multi-Granularity Cross-Modality Representation Learning for Named Entity Recognition on Social Media [11.235498285650142]
Named Entity Recognition (NER) on social media refers to discovering and classifying entities from unstructured free-form content. This work introduces the multi-granularity cross-modality representation learning. Experiments show that our proposed approach can achieve the SOTA or approximate SOTA performance on two benchmark datasets of tweets.
arXiv Detail & Related papers (2022-10-19T15:14:55Z)
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z)
Multi-Modal Few-Shot Object Detection with Meta-Learning-Based Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection. Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning. We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z)
Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories [47.03271152494389]
Word Sense Disambiguation aims to automatically identify the exact meaning of one word according to its context. Existing supervised models struggle to make correct predictions on rare word senses due to limited training data. We propose a gloss alignment algorithm that can align definition sentences with the same meaning from different sense inventories to collect rich lexical knowledge.
arXiv Detail & Related papers (2021-10-27T00:04:33Z)
Learning to Prompt for Vision-Language Models [82.25005817904027]
Vision-language pre-training has emerged as a promising alternative for representation learning. It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders. Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks.
arXiv Detail & Related papers (2021-09-02T17:57:31Z)
Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously. The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.