PETA: Photo Albums Event Recognition using Transformers Attention
- URL: http://arxiv.org/abs/2109.12499v1
- Date: Sun, 26 Sep 2021 05:23:24 GMT
- Title: PETA: Photo Albums Event Recognition using Transformers Attention
- Authors: Tamar Glaser, Emanuel Ben-Baruch, Gilad Sharir, Nadav Zamir, Asaf Noy,
Lihi Zelnik-Manor
- Abstract summary: Event recognition in personal photo albums presents challenge of high-level image understanding.
We propose a tailor-made solution, combining the power of CNNs for image representation and transformers for album representation.
Our solution reaches state-of-the-art results on 3 prominent benchmarks, achieving above 90% mAP on all datasets.
- Score: 10.855070748535688
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years the amounts of personal photos captured increased
significantly, giving rise to new challenges in multi-image understanding and
high-level image understanding. Event recognition in personal photo albums
presents one challenging scenario where life events are recognized from a
disordered collection of images, including both relevant and irrelevant images.
Event recognition in images also presents the challenge of high-level image
understanding, as opposed to low-level image object classification. In absence
of methods to analyze multiple inputs, previous methods adopted temporal
mechanisms, including various forms of recurrent neural networks. However,
their effective temporal window is local. In addition, they are not a natural
choice given the disordered characteristic of photo albums. We address this gap
with a tailor-made solution, combining the power of CNNs for image
representation and transformers for album representation to perform global
reasoning on image collection, offering a practical and efficient solution for
photo albums event recognition. Our solution reaches state-of-the-art results
on 3 prominent benchmarks, achieving above 90\% mAP on all datasets. We further
explore the related image-importance task in event recognition, demonstrating
how the learned attentions correlate with the human-annotated importance for
this subjective task, thus opening the door for new applications.
Related papers
- MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training [62.843316348659165]
Deep learning-based image matching algorithms have dramatically outperformed humans in rapidly and accurately finding large amounts of correspondences.
We propose a large-scale pre-training framework that utilizes synthetic cross-modal training signals to train models to recognize and match fundamental structures across images.
Our key finding is that the matching model trained with our framework achieves remarkable generalizability across more than eight unseen cross-modality registration tasks.
arXiv Detail & Related papers (2025-01-13T18:37:36Z) - Unforgettable Lessons from Forgettable Images: Intra-Class Memorability Matters in Computer Vision [8.210681499876216]
We introduce intra-class memorability, where certain images within the same class are more memorable than others.<n>We propose the Intra-Class Memorability score (ICMscore), a novel metric that incorporates the temporal intervals between repeated image presentations into its calculation.<n>We curate the Intra-Class Memorability dataset (ICMD), comprising over 5,000 images across ten object classes with their ICMscores derived from 2,000 participants' responses.
arXiv Detail & Related papers (2024-12-30T07:09:28Z) - Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation [90.71613903956451]
Text-to-image retrieval is a fundamental task in multimedia processing.
We propose an autoregressive voken generation method, named AVG.
We show that AVG achieves superior results in both effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-24T13:39:51Z) - Improving Image Recognition by Retrieving from Web-Scale Image-Text Data [68.63453336523318]
We introduce an attention-based memory module, which learns the importance of each retrieved example from the memory.
Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query.
We show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.
arXiv Detail & Related papers (2023-04-11T12:12:05Z) - A domain adaptive deep learning solution for scanpath prediction of
paintings [66.46953851227454]
This paper focuses on the eye-movement analysis of viewers during the visual experience of a certain number of paintings.
We introduce a new approach to predicting human visual attention, which impacts several cognitive functions for humans.
The proposed new architecture ingests images and returns scanpaths, a sequence of points featuring a high likelihood of catching viewers' attention.
arXiv Detail & Related papers (2022-09-22T22:27:08Z) - Deep Bayesian Image Set Classification: A Defence Approach against
Adversarial Attacks [32.48820298978333]
Deep neural networks (DNNs) are susceptible to be fooled with nearly high confidence by an adversary.
In practice, the vulnerability of deep learning systems against carefully perturbed images, known as adversarial examples, poses a dire security threat in the physical world applications.
We propose a robust deep Bayesian image set classification as a defence framework against a broad range of adversarial attacks.
arXiv Detail & Related papers (2021-08-23T14:52:44Z) - Focus on the Positives: Self-Supervised Learning for Biodiversity
Monitoring [9.086207853136054]
We address the problem of learning self-supervised representations from unlabeled image collections.
We exploit readily available context data that encodes information such as the spatial and temporal relationships between the input images.
For the critical task of global biodiversity monitoring, this results in image features that can be adapted to challenging visual species classification tasks with limited human supervision.
arXiv Detail & Related papers (2021-08-14T01:12:41Z) - Collaboration among Image and Object Level Features for Image
Colourisation [25.60139324272782]
Image colourisation is an ill-posed problem, with multiple correct solutions which depend on the context and object instances present in the input datum.
Previous approaches attacked the problem either by requiring intense user interactions or by exploiting the ability of convolutional neural networks (CNNs) in learning image level (context) features.
We propose a single network, named UCapsNet, that separate image-level features obtained through convolutions and object-level features captured by means of capsules.
Then, by skip connections over different layers, we enforce collaboration between such disentangling factors to produce high quality and plausible image colourisation.
arXiv Detail & Related papers (2021-01-19T11:48:12Z) - City-Scale Visual Place Recognition with Deep Local Features Based on
Multi-Scale Ordered VLAD Pooling [5.274399407597545]
We present a fully-automated system for place recognition at a city-scale based on content-based image retrieval.
Firstly, we take a comprehensive analysis of visual place recognition and sketch out the unique challenges of the task.
Next, we propose yet a simple pooling approach on top of convolutional neural network activations to embed the spatial information into the image representation vector.
arXiv Detail & Related papers (2020-09-19T15:21:59Z) - Rethinking of the Image Salient Object Detection: Object-level Semantic
Saliency Re-ranking First, Pixel-wise Saliency Refinement Latter [62.26677215668959]
We propose a lightweight, weakly supervised deep network to coarsely locate semantically salient regions.
We then fuse multiple off-the-shelf deep models on these semantically salient regions as the pixel-wise saliency refinement.
Our method is simple yet effective, which is the first attempt to consider the salient object detection mainly as an object-level semantic re-ranking problem.
arXiv Detail & Related papers (2020-08-10T07:12:43Z) - Semantic Photo Manipulation with a Generative Image Prior [86.01714863596347]
GANs are able to synthesize images conditioned on inputs such as user sketch, text, or semantic labels.
It is hard for GANs to precisely reproduce an input image.
In this paper, we address these issues by adapting the image prior learned by GANs to image statistics of an individual image.
Our method can accurately reconstruct the input image and synthesize new content, consistent with the appearance of the input image.
arXiv Detail & Related papers (2020-05-15T18:22:05Z) - Fine-grained Image-to-Image Transformation towards Visual Recognition [102.51124181873101]
We aim at transforming an image with a fine-grained category to synthesize new images that preserve the identity of the input image.
We adopt a model based on generative adversarial networks to disentangle the identity related and unrelated factors of an image.
Experiments on the CompCars and Multi-PIE datasets demonstrate that our model preserves the identity of the generated images much better than the state-of-the-art image-to-image transformation models.
arXiv Detail & Related papers (2020-01-12T05:26:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.