Who are you referring to? Weakly supervised coreference resolution with
multimodal grounding
- URL: http://arxiv.org/abs/2211.14563v1
- Date: Sat, 26 Nov 2022 13:33:42 GMT
- Title: Who are you referring to? Weakly supervised coreference resolution with
multimodal grounding
- Authors: Arushi Goel, Basura Fernando, Frank Keller and Hakan Bilen
- Abstract summary: Coreference resolution aims at identifying words and phrases which refer to same entity in a text.
Most existing image-text datasets contain short sentences without coreferent expressions, or coreferences are not annotated.
We propose a new technique that learns to identify coreference chains through weakly supervised grounding from image-text pairs and a regularization using prior linguistic knowledge.
- Score: 44.502102006343094
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Coreference resolution aims at identifying words and phrases which refer to
same entity in a text, a core tool in natural language processing. In this
paper, we propose a novel task, resolving coreferences in multimodal data,
long-form textual descriptions of visual scenes. Most existing image-text
datasets only contain short sentences without coreferent expressions, or
coreferences are not annotated. To this end, we first introduce a new dataset,
Flickr30k-Coref in which coreference chains and bounding box localization of
these chains are annotated. We propose a new technique that learns to identify
coreference chains through weakly supervised grounding from image-text pairs
and a regularization using prior linguistic knowledge. Our model yields large
performance gains over prior work in coreference resolution and weakly
supervised grounding of long-form text descriptions.
Related papers
- Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos [69.29778009769862]
We introduce LaGTran, a framework that guides robust transfer of discriminative knowledge from labeled source to unlabeled target data with domain gaps.
Motivated by our observation that semantically richer text modality has more favorable transfer properties, we devise a transfer mechanism to use a source-trained text-classifier to generate predictions on the target text descriptions.
Our approach driven by language guidance is surprisingly easy and simple, yet significantly outperforms all prior approaches on challenging datasets like GeoNet and DomainNet.
arXiv Detail & Related papers (2024-03-08T18:58:46Z) - From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models [38.14123683674355]
We propose a method to utilize the attention mechanism in the denoising network of text-to-image diffusion models.
We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting.
Our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.
arXiv Detail & Related papers (2023-09-08T04:10:01Z) - Shatter and Gather: Learning Referring Image Segmentation with Text
Supervision [52.46081425504072]
We present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent.
Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation models on all the benchmarks.
arXiv Detail & Related papers (2023-08-29T15:39:15Z) - Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions.
We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Leveraging Natural Supervision for Language Representation Learning and
Generation [8.083109555490475]
We describe three lines of work that seek to improve the training and evaluation of neural models using naturally-occurring supervision.
We first investigate self-supervised training losses to help enhance the performance of pretrained language models for various NLP tasks.
We propose a framework that uses paraphrase pairs to disentangle semantics and syntax in sentence representations.
arXiv Detail & Related papers (2022-07-21T17:26:03Z) - Fine-Grained Visual Entailment [51.66881737644983]
We propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image.
Unlike prior work, our method is inherently explainable and makes logical predictions at different levels of granularity.
We evaluate our method on a new dataset of manually annotated knowledge elements and show that our method achieves 68.18% accuracy at this challenging task.
arXiv Detail & Related papers (2022-03-29T16:09:38Z) - Linguistic Structures as Weak Supervision for Visual Scene Graph
Generation [39.918783911894245]
We show how linguistic structures in captions can benefit scene graph generation.
Our method captures the information provided in captions about relations between individual triplets, and context for subjects and objects.
Given the large and diverse sources of multimodal data on the web, linguistic supervision is more scalable than crowdsourced triplets.
arXiv Detail & Related papers (2021-05-28T17:20:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.