OPI at SemEval 2023 Task 1: Image-Text Embeddings and Multimodal
Information Retrieval for Visual Word Sense Disambiguation
- URL: http://arxiv.org/abs/2304.07127v1
- Date: Fri, 14 Apr 2023 13:45:59 GMT
- Title: OPI at SemEval 2023 Task 1: Image-Text Embeddings and Multimodal
Information Retrieval for Visual Word Sense Disambiguation
- Authors: S{\l}awomir Dadas
- Abstract summary: We present our submission to SemEval 2023 visual word sense disambiguation shared task.
The proposed system integrates multimodal embeddings, learning to rank methods, and knowledge-based approaches.
Our solution was ranked third in the multilingual task and won in the Persian track, one of the three language subtasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of visual word sense disambiguation is to find the image that best
matches the provided description of the word's meaning. It is a challenging
problem, requiring approaches that combine language and image understanding. In
this paper, we present our submission to SemEval 2023 visual word sense
disambiguation shared task. The proposed system integrates multimodal
embeddings, learning to rank methods, and knowledge-based approaches. We build
a classifier based on the CLIP model, whose results are enriched with
additional information retrieved from Wikipedia and lexical databases. Our
solution was ranked third in the multilingual task and won in the Persian
track, one of the three language subtasks.
Related papers
- HKUST at SemEval-2023 Task 1: Visual Word Sense Disambiguation with
Context Augmentation and Visual Assistance [5.5532783549057845]
We propose a multi-modal retrieval framework that maximally leverages pretrained Vision-Language models.
Our system does not produce the most competitive results at SemEval-2023 Task 1, but we are still able to beat nearly half of the teams.
arXiv Detail & Related papers (2023-11-30T06:23:15Z) - Large Language Models and Multimodal Retrieval for Visual Word Sense
Disambiguation [1.8591405259852054]
Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the goal of retrieving an image among a set of candidates.
In this paper, we make a substantial step towards unveiling this interesting task by applying a varying set of approaches.
arXiv Detail & Related papers (2023-10-21T14:35:42Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - 1Cademy at Semeval-2022 Task 1: Investigating the Effectiveness of
Multilingual, Multitask, and Language-Agnostic Tricks for the Reverse
Dictionary Task [13.480318097164389]
We focus on the Reverse Dictionary Track of the SemEval2022 task of matching dictionary glosses to word embeddings.
Models convert the input of sentences to three types of embeddings: SGNS, Char, and Electra.
Our proposed Elmobased monolingual model achieves the highest outcome.
arXiv Detail & Related papers (2022-06-08T06:39:04Z) - Building a visual semantics aware object hierarchy [0.0]
We propose a novel unsupervised method to build visual semantics aware object hierarchy.
Our intuition in this paper comes from real-world knowledge representation where concepts are hierarchically organized.
The evaluation consists of two parts, firstly we apply the constructed hierarchy on the object recognition task and then we compare our visual hierarchy and existing lexical hierarchies to show the validity of our method.
arXiv Detail & Related papers (2022-02-26T00:10:21Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z) - Dense Relational Image Captioning via Multi-task Triple-Stream Networks [95.0476489266988]
We introduce dense captioning, a novel task which aims to generate captions with respect to information between objects in a visual scene.
This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding.
arXiv Detail & Related papers (2020-10-08T09:17:55Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z) - Fine-grained Image Classification and Retrieval by Combining Visual and
Locally Pooled Textual Features [8.317191999275536]
In particular, the mere presence of text provides strong guiding content that should be employed to tackle a diversity of computer vision tasks.
In this paper, we address the problem of fine-grained classification and image retrieval by leveraging textual information along with visual cues to comprehend the existing intrinsic relation between the two modalities.
arXiv Detail & Related papers (2020-01-14T12:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.