Related papers: Detecting Euphemisms with Literal Descriptions and Visual Imagery

Detecting Euphemisms with Literal Descriptions and Visual Imagery

URL: http://arxiv.org/abs/2211.04576v1
Date: Tue, 8 Nov 2022 21:50:05 GMT
Title: Detecting Euphemisms with Literal Descriptions and Visual Imagery
Authors: \.Ilker Kesen, Aykut Erdem, Erkut Erdem and Iacer Calixto
Abstract summary: This paper describes our two-stage system for the Euphemism Detection shared task hosted by the 3rd Workshop on Figurative Language Processing in conjunction with EMNLP 2022. In the first stage, we seek to mitigate this ambiguity by incorporating literal descriptions into input text prompts to our baseline model. It turns out that this kind of direct supervision yields remarkable performance improvement. In the second stage, we integrate visual supervision into our system using visual imageries, two sets of images generated by a text-to-image model by taking terms and descriptions as input. Our experiments demonstrate that visual supervision also gives a statistically significant performance boost.
Score: 18.510509701709054
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper describes our two-stage system for the Euphemism Detection shared task hosted by the 3rd Workshop on Figurative Language Processing in conjunction with EMNLP 2022. Euphemisms tone down expressions about sensitive or unpleasant issues like addiction and death. The ambiguous nature of euphemistic words or expressions makes it challenging to detect their actual meaning within a context. In the first stage, we seek to mitigate this ambiguity by incorporating literal descriptions into input text prompts to our baseline model. It turns out that this kind of direct supervision yields remarkable performance improvement. In the second stage, we integrate visual supervision into our system using visual imageries, two sets of images generated by a text-to-image model by taking terms and descriptions as input. Our experiments demonstrate that visual supervision also gives a statistically significant performance boost. Our system achieved the second place with an F1 score of 87.2%, only about 0.9% worse than the best submission.

Related papers

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding [72.15848305976706]
Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning.<n>When confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content.<n>We propose a training-free semantic hallucination mitigation framework comprising two key components.
arXiv Detail & Related papers (2025-06-05T19:53:19Z)
Text-to-Image Generation for Vocabulary Learning Using the Keyword Method [9.862827991755076]
The 'keyword method' is an effective technique for learning vocabulary of a foreign language. It involves creating a memorable visual link between what a word means and what its pronunciation in a foreign language sounds like. We developed an application that combines the keyword method with text-to-image generators to externalise the memorable visual links into visuals.
arXiv Detail & Related papers (2025-01-28T17:39:50Z)
Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup. We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics. Several SG-pivoting based learning objectives are introduced for unsupervised translation training. Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z)
OPI at SemEval 2023 Task 1: Image-Text Embeddings and Multimodal Information Retrieval for Visual Word Sense Disambiguation [0.0]
We present our submission to SemEval 2023 visual word sense disambiguation shared task. The proposed system integrates multimodal embeddings, learning to rank methods, and knowledge-based approaches. Our solution was ranked third in the multilingual task and won in the Persian track, one of the three language subtasks.
arXiv Detail & Related papers (2023-04-14T13:45:59Z)
Multimodal Neural Machine Translation with Search Engine Based Image Retrieval [4.662583832063716]
We propose an open-vocabulary image retrieval method to collect descriptive images for bilingual parallel corpus. Our proposed method achieves significant improvements over strong baselines.
arXiv Detail & Related papers (2022-07-26T08:42:06Z)
Image Retrieval from Contextual Descriptions [22.084939474881796]
Image Retrieval from Contextual Descriptions (ImageCoDe) Models tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description. Best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans.
arXiv Detail & Related papers (2022-03-29T19:18:12Z)
Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding [59.8167502322261]
We propose Word2Pix: a one-stage visual grounding network based on encoder-decoder transformer architecture. The embedding of each word from the query sentence is treated alike by attending to visual pixels individually. The proposed Word2Pix outperforms existing one-stage methods by a notable margin.
arXiv Detail & Related papers (2021-07-31T10:20:15Z)
Connecting What to Say With Where to Look by Modeling Human Attention Traces [30.8226861256742]
We introduce a unified framework to jointly model images, text, and human attention traces. We propose two novel tasks: (1) predict a trace given an image and caption (i.e., visual grounding), and (2) predict a caption and a trace given only an image.
arXiv Detail & Related papers (2021-05-12T20:53:30Z)
This is not the Texture you are looking for! Introducing Novel Counterfactual Explanations for Non-Experts using Generative Adversarial Learning [59.17685450892182]
counterfactual explanation systems try to enable a counterfactual reasoning by modifying the input image. We present a novel approach to generate such counterfactual image explanations based on adversarial image-to-image translation techniques. Our results show that our approach leads to significantly better results regarding mental models, explanation satisfaction, trust, emotions, and self-efficacy than two state-of-the art systems.
arXiv Detail & Related papers (2020-12-22T10:08:05Z)
Visually Grounded Compound PCFGs [65.04669567781634]
Exploiting visual groundings for language understanding has recently been drawing much attention. We study visually grounded grammar induction and learn a constituency from both unlabeled text and its visual captions.
arXiv Detail & Related papers (2020-09-25T19:07:00Z)
Grounded and Controllable Image Completion by Incorporating Lexical Semantics [111.47374576372813]
Lexical Semantic Image Completion (LSIC) may have potential applications in art, design, and heritage conservation. We advocate generating results faithful to both visual and lexical semantic context. One major challenge for LSIC comes from modeling and aligning the structure of visual-semantic context.
arXiv Detail & Related papers (2020-02-29T16:54:21Z)
Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data. Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.