Multimodal Word Sense Disambiguation in Creative Practice
- URL: http://arxiv.org/abs/2007.07758v2
- Date: Sun, 17 Jan 2021 17:54:10 GMT
- Title: Multimodal Word Sense Disambiguation in Creative Practice
- Authors: Manuel Ladron de Guevara, Christopher George, Akshat Gupta, Daragh
Byrne, Ramesh Krishnamurti
- Abstract summary: We present a dataset of Ambiguous Descriptions of Art Images (ADARI)
It is organized into a total of 240k images labeled with descriptive sentences.
It is additionally organized sub-domains of architecture, art, design, fashion, furniture, product design and technology.
- Score: 2.9398911304923447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language is ambiguous; many terms and expressions can convey the same idea.
This is especially true in creative practice, where ideas and design intents
are highly subjective. We present a dataset, Ambiguous Descriptions of Art
Images (ADARI), of contemporary workpieces, which aims to provide a
foundational resource for subjective image description and multimodal word
disambiguation in the context of creative practice. The dataset contains a
total of 240k images labeled with 260k descriptive sentences. It is
additionally organized into sub-domains of architecture, art, design, fashion,
furniture, product design and technology. In subjective image description,
labels are not deterministic: for example, the ambiguous label dynamic might
correspond to hundreds of different images. To understand this complexity, we
analyze the ambiguity and relevance of text with respect to images using the
state-of-the-art pre-trained BERT model for sentence classification. We provide
a baseline for multi-label classification tasks and demonstrate the potential
of multimodal approaches for understanding ambiguity in design intentions. We
hope that ADARI dataset and baselines constitute a first step towards
subjective label classification.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - A semantics-driven methodology for high-quality image annotation [4.7590051176368915]
We propose vTelos, an integrated Natural Language Processing, Knowledge Representation, and Computer Vision methodology.
Key element of vTelos is the exploitation of the WordNet lexico-semantic hierarchy as the main means for providing the meaning of natural language labels.
The methodology is validated on images populating a subset of the ImageNet hierarchy.
arXiv Detail & Related papers (2023-07-26T11:38:45Z) - Edge Guided GANs with Multi-Scale Contrastive Learning for Semantic
Image Synthesis [139.2216271759332]
We propose a novel ECGAN for the challenging semantic image synthesis task.
The semantic labels do not provide detailed structural information, making it challenging to synthesize local details and structures.
The widely adopted CNN operations such as convolution, down-sampling, and normalization usually cause spatial resolution loss.
We propose a novel contrastive learning method, which aims to enforce pixel embeddings belonging to the same semantic class to generate more similar image content.
arXiv Detail & Related papers (2023-07-22T14:17:19Z) - Vocabulary-free Image Classification [75.38039557783414]
We formalize a novel task, termed as Vocabulary-free Image Classification (VIC)
VIC aims to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary.
CaSED is a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner.
arXiv Detail & Related papers (2023-06-01T17:19:43Z) - Exploring Affordance and Situated Meaning in Image Captions: A
Multimodal Analysis [1.124958340749622]
We annotate images from the Flickr30k dataset with five perceptual properties: Affordance, Perceptual Salience, Object Number, Cue Gazeing, and Ecological Niche Association (ENA)
Our findings reveal that images with Gibsonian affordance show a higher frequency of captions containing 'holding-verbs' and 'container-nouns' compared to images displaying telic affordance.
arXiv Detail & Related papers (2023-05-24T01:30:50Z) - Word-As-Image for Semantic Typography [41.380457098839926]
A word-as-image is a semantic typography technique where a word illustration presents a visualization of the meaning of the word.
We present a method to create word-as-image illustrations automatically.
arXiv Detail & Related papers (2023-03-03T09:59:25Z) - Visual Clues: Bridging Vision and Language Foundations for Image
Paragraph Captioning [78.07495777674747]
We argue that by using visual clues to bridge large pretrained vision foundation models and language models, we can do so without any extra cross-modal training.
Thanks to the strong zero-shot capability of foundation models, we start by constructing a rich semantic representation of the image.
We use large language model to produce a series of comprehensive descriptions for the visual content, which is then verified by the vision model again to select the candidate that aligns best with the image.
arXiv Detail & Related papers (2022-06-03T22:33:09Z) - PhraseCut: Language-based Image Segmentation in the Wild [62.643450401286]
We consider the problem of segmenting image regions given a natural language phrase.
Our dataset is collected on top of the Visual Genome dataset.
Our experiments show that the scale and diversity of concepts in our dataset poses significant challenges to the existing state-of-the-art.
arXiv Detail & Related papers (2020-08-03T20:58:53Z) - Grounded and Controllable Image Completion by Incorporating Lexical
Semantics [111.47374576372813]
Lexical Semantic Image Completion (LSIC) may have potential applications in art, design, and heritage conservation.
We advocate generating results faithful to both visual and lexical semantic context.
One major challenge for LSIC comes from modeling and aligning the structure of visual-semantic context.
arXiv Detail & Related papers (2020-02-29T16:54:21Z) - Fine-grained Image Classification and Retrieval by Combining Visual and
Locally Pooled Textual Features [8.317191999275536]
In particular, the mere presence of text provides strong guiding content that should be employed to tackle a diversity of computer vision tasks.
In this paper, we address the problem of fine-grained classification and image retrieval by leveraging textual information along with visual cues to comprehend the existing intrinsic relation between the two modalities.
arXiv Detail & Related papers (2020-01-14T12:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.