There is a Time and Place for Reasoning Beyond the Image
- URL: http://arxiv.org/abs/2203.00758v1
- Date: Tue, 1 Mar 2022 21:52:08 GMT
- Title: There is a Time and Place for Reasoning Beyond the Image
- Authors: Xingyu Fu, Ben Zhou, Ishaan Preetam Chandratreya, Carl Vondrick, Dan
Roth
- Abstract summary: Images often more significant than only the pixels to human eyes, as we can infer, associate, and reason with contextual information from other sources to establish a more complete picture.
We introduce TARA: a dataset with 16k images with their associated news, time and location automatically extracted from New York Times (NYT), and an additional 61k examples as distant supervision from WIT.
We show that there exists a 70% gap between a state-of-the-art joint model and human performance, which is slightly filled by our proposed model that uses segment-wise reasoning, motivating higher-level vision-language joint models that
- Score: 63.96498435923328
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Images are often more significant than only the pixels to human eyes, as we
can infer, associate, and reason with contextual information from other sources
to establish a more complete picture. For example, in Figure 1, we can find a
way to identify the news articles related to the picture through segment-wise
understandings on the signs, the buildings, the crowds, and more. This tells us
the time when and the location where the image is taken, which will help us in
subsequent tasks, such as evidence retrieval for criminal activities, automatic
storyline construction, and upper-stream processing such as image clustering.
In this work, we formulate this problem and introduce TARA: a dataset with 16k
images with their associated news, time and location automatically extracted
from New York Times (NYT), and an additional 61k examples as distant
supervision from WIT. On top of the extractions, we present a crowdsourced
subset in which images are believed to be feasible to find their
spatio-temporal information for evaluation purpose. We show that there exists a
70% gap between a state-of-the-art joint model and human performance, which is
slightly filled by our proposed model that uses segment-wise reasoning,
motivating higher-level vision-language joint models that can conduct
open-ended reasoning with world knowledge.
Related papers
- Stellar: Systematic Evaluation of Human-Centric Personalized
Text-to-Image Methods [52.806258774051216]
We focus on text-to-image systems that input a single image of an individual and ground the generation process along with text describing the desired visual context.
We introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available.
We derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA.
arXiv Detail & Related papers (2023-12-11T04:47:39Z) - Visual Analytics for Efficient Image Exploration and User-Guided Image
Captioning [35.47078178526536]
Recent advancements in pre-trained large-scale language-image models have ushered in a new era of visual comprehension.
This paper tackles two well-known issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of potential data biases within them; (2) the evaluation of image captions and steering of their generation process.
arXiv Detail & Related papers (2023-11-02T06:21:35Z) - Blind Dates: Examining the Expression of Temporality in Historical
Photographs [57.07335632641355]
We investigate the dating of images using OpenCLIP, an open-source implementation of CLIP, a multi-modal language and vision model.
We use the textitDe Boer Scene Detection dataset, containing 39,866 gray-scale historical press photographs from 1950 to 1999.
Our analysis reveals that images featuring buses, cars, cats, dogs, and people are more accurately dated, suggesting the presence of temporal markers.
arXiv Detail & Related papers (2023-10-10T13:51:24Z) - Focus! Relevant and Sufficient Context Selection for News Image
Captioning [69.36678144800936]
News Image Captioning requires describing an image by leveraging additional context from a news article.
We propose to use the pre-trained vision and language retrieval model CLIP to localize the visually grounded entities in the news article.
Our experiments demonstrate that by simply selecting a better context from the article, we can significantly improve the performance of existing models.
arXiv Detail & Related papers (2022-12-01T20:00:27Z) - Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z) - Vision Models Are More Robust And Fair When Pretrained On Uncurated
Images Without Supervision [38.22842778742829]
Discriminative self-supervised learning allows training models on any random group of internet images.
We train models on billions of random images without any data pre-processing or prior assumptions about what we want the model to learn.
We extensively study and validate our model performance on over 50 benchmarks including fairness, to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets.
arXiv Detail & Related papers (2022-02-16T22:26:47Z) - Deep Image Deblurring: A Survey [165.32391279761006]
Deblurring is a classic problem in low-level computer vision, which aims to recover a sharp image from a blurred input image.
Recent advances in deep learning have led to significant progress in solving this problem.
arXiv Detail & Related papers (2022-01-26T01:31:30Z) - Learning to Represent Image and Text with Denotation Graph [32.417311523031195]
We propose learning representations from a set of implied, visually grounded expressions between image and text.
We show that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations.
arXiv Detail & Related papers (2020-10-06T18:00:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.