Learning Human-Human Interactions in Images from Weak Textual
Supervision
- URL: http://arxiv.org/abs/2304.14104v4
- Date: Mon, 18 Sep 2023 17:50:31 GMT
- Title: Learning Human-Human Interactions in Images from Weak Textual
Supervision
- Authors: Morris Alper and Hadar Averbuch-Elor
- Abstract summary: We propose a new paradigm of learning human-human interactions as free text from a single still image.
We show that the pseudo-labels produced by this procedure can be used to train a captioning model.
We will release our code and pseudo-labels along with Waldo and Wenda, a manually-curated test set for still image human-human interaction understanding.
- Score: 15.71242350678679
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Interactions between humans are diverse and context-dependent, but previous
works have treated them as categorical, disregarding the heavy tail of possible
interactions. We propose a new paradigm of learning human-human interactions as
free text from a single still image, allowing for flexibility in modeling the
unlimited space of situations and relationships between people. To overcome the
absence of data labelled specifically for this task, we use knowledge
distillation applied to synthetic caption data produced by a large language
model without explicit supervision. We show that the pseudo-labels produced by
this procedure can be used to train a captioning model to effectively
understand human-human interactions in images, as measured by a variety of
metrics that measure textual and semantic faithfulness and factual groundedness
of our predictions. We further show that our approach outperforms SOTA image
captioning and situation recognition models on this task. We will release our
code and pseudo-labels along with Waldo and Wenda, a manually-curated test set
for still image human-human interaction understanding.
Related papers
- Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - See or Guess: Counterfactually Regularized Image Captioning [32.82695612178604]
We present a generic image captioning framework that employs causal inference to make existing models more capable of interventional tasks, and counterfactually explainable.
Our method effectively reduces hallucinations and improves the model's faithfulness to images, demonstrating high portability across both small-scale and large-scale image-to-text models.
arXiv Detail & Related papers (2024-08-29T17:59:57Z) - Self-Explainable Affordance Learning with Embodied Caption [63.88435741872204]
We introduce Self-Explainable Affordance learning (SEA) with embodied caption.
SEA enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning.
We propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner.
arXiv Detail & Related papers (2024-04-08T15:22:38Z) - Stellar: Systematic Evaluation of Human-Centric Personalized
Text-to-Image Methods [52.806258774051216]
We focus on text-to-image systems that input a single image of an individual and ground the generation process along with text describing the desired visual context.
We introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available.
We derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA.
arXiv Detail & Related papers (2023-12-11T04:47:39Z) - Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension.
But to achieve these results, LMs must be trained in distinctly un-human-like ways.
Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning?
We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z) - ArtWhisperer: A Dataset for Characterizing Human-AI Interactions in Artistic Creations [26.4215586218117]
This work investigates how people use text-to-image models to generate desired target images.
We created ArtWhisperer, an online game where users are given a target image and are tasked with iteratively finding a prompt that creates a similar-looking image as the target.
We recorded over 50,000 human-AI interactions; each interaction corresponds to one text prompt created by a user and the corresponding generated image.
arXiv Detail & Related papers (2023-06-13T21:10:45Z) - Borrowing Human Senses: Comment-Aware Self-Training for Social Media
Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity.
The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales.
The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z) - Who's Waldo? Linking People Across Text and Images [56.40556801773923]
We present a task and benchmark dataset for person-centric visual grounding.
Our new task masks out the names of people in captions in order to encourage methods trained on such image-caption pairs to focus on contextual cues.
We propose a Transformer-based method that outperforms several strong baselines on this task.
arXiv Detail & Related papers (2021-08-16T17:36:49Z) - Hidden Footprints: Learning Contextual Walkability from 3D Human Trails [70.01257397390361]
Current datasets only tell you where people are, not where they could be.
We first augment the set of valid, labeled walkable regions by propagating person observations between images, utilizing 3D information to create what we call hidden footprints.
We devise a training strategy designed for such sparse labels, combining a class-balanced classification loss with a contextual adversarial loss.
arXiv Detail & Related papers (2020-08-19T23:19:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.