Related papers: PhotoBot: Reference-Guided Interactive Photography via Natural Language

PhotoBot: Reference-Guided Interactive Photography via Natural Language

URL: http://arxiv.org/abs/2401.11061v4
Date: Thu, 26 Dec 2024 03:38:10 GMT
Title: PhotoBot: Reference-Guided Interactive Photography via Natural Language
Authors: Oliver Limoyo, Jimmy Li, Dmitriy Rivkin, Jonathan Kelly, Gregory Dudek,
Abstract summary: PhotoBot is a framework for fully automated photo acquisition based on an interplay between high-level human language guidance and a robot photographer.<n>We leverage a visual language model (VLM) and an object manipulator to characterize the reference images.<n>We also use a large language model (LLM) to retrieve relevant reference images based on a user's language query.
Score: 15.486784377142314
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce PhotoBot, a framework for fully automated photo acquisition based on an interplay between high-level human language guidance and a robot photographer. We propose to communicate photography suggestions to the user via reference images that are selected from a curated gallery. We leverage a visual language model (VLM) and an object detector to characterize the reference images via textual descriptions and then use a large language model (LLM) to retrieve relevant reference images based on a user's language query through text-based reasoning. To correspond the reference image and the observed scene, we exploit pre-trained features from a vision transformer capable of capturing semantic similarity across marked appearance variations. Using these features, we compute suggested pose adjustments for an RGB-D camera by solving a perspective-n-point (PnP) problem. We demonstrate our approach using a manipulator equipped with a wrist camera. Our user studies show that photos taken by PhotoBot are often more aesthetically pleasing than those taken by users themselves, as measured by human feedback. We also show that PhotoBot can generalize to other reference sources such as paintings.

Related papers

Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects [3.9825600707172986]
We present a comparative study of captioning strategies for tabletop scenes captured by a robotic arm equipped with an RGB camera.<n>The robot collects images of objects from multiple viewpoints, and we evaluate several models that generate scene descriptions.<n>Our experiments examine the trade-offs between single-view and multi-view captioning, and difference between recognising real-world and 3D printed objects.
arXiv Detail & Related papers (2025-06-24T12:45:09Z)
Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation [0.0]
This survey reviews attention-based image captioning models, categorizing them into transformer-based, deep learning-based, and hybrid approaches.<n>It explores benchmark datasets, discusses evaluation metrics such as BLEU, METEOR, CIDEr, and ROUGE, and highlights challenges in multilingual captioning.<n>We identify future research directions, such as multimodal learning, real-time applications in AI-powered assistants, healthcare, and forensic analysis.
arXiv Detail & Related papers (2025-06-03T22:18:19Z)
Vision-Speech Models: Teaching Speech Models to Converse about Images [67.62394024470528]
We introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules. An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics. We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis.
arXiv Detail & Related papers (2025-03-19T18:40:45Z)
Multilingual Vision-Language Pre-training for the Remote Sensing Domain [4.118895088882213]
Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data. This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model. Our resulting model, which we named Remote Sensing Multilingual CLIP (RS-M-CLIP), obtains state-of-the-art results in a variety of vision-and-language tasks.
arXiv Detail & Related papers (2024-10-30T18:13:11Z)
Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z)
Large Language Models for Captioning and Retrieving Remote Sensing Images [4.499596985198142]
RS-CapRet is a Vision and Language method for remote sensing tasks. It can generate descriptions for remote sensing images and retrieve images from textual descriptions.
arXiv Detail & Related papers (2024-02-09T15:31:01Z)
User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning [35.211749514733846]
Traditional image captioning methods often overlook the preferences and characteristics of users. Most existing methods emphasize the user context fusion process by memory networks or transformers. We propose a novel personalized image captioning framework that leverages user context to consider personality factors.
arXiv Detail & Related papers (2023-12-08T02:08:00Z)
Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z)
Real-Time Neural Character Rendering with Pose-Guided Multiplane Images [75.62730144924566]
We propose pose-guided multiplane image (MPI) synthesis which can render an animatable character in real scenes with photorealistic quality. We use a portable camera rig to capture the multi-view images along with the driving signal for the moving subject.
arXiv Detail & Related papers (2022-04-25T17:51:38Z)
Visual Information Guided Zero-Shot Paraphrase Generation [71.33405403748237]
We propose visual information guided zero-shot paraphrase generation (ViPG) based only on paired image-caption data. It jointly trains an image captioning model and a paraphrasing model and leverage the image captioning model to guide the training of the paraphrasing model. Both automatic evaluation and human evaluation show our model can generate paraphrase with good relevancy, fluency and diversity.
arXiv Detail & Related papers (2022-01-22T18:10:39Z)
Exploring Explicit and Implicit Visual Relationships for Image Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning. Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information. Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z)
Telling the What while Pointing the Where: Fine-grained Mouse Trace and Language Supervision for Improved Image Retrieval [60.24860627782486]
Fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is. In this paper, we describe an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where") Our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.
arXiv Detail & Related papers (2021-02-09T17:54:34Z)
Batteries, camera, action! Learning a semantic control space for expressive robot cinematography [15.895161373307378]
We develop a data-driven framework that enables editing of complex camera positioning parameters in a semantic space. First, we generate a database of video clips with a diverse range of shots in a photo-realistic simulator. We use hundreds of participants in a crowd-sourcing framework to obtain scores for a set of semantic descriptors for each clip.
arXiv Detail & Related papers (2020-11-19T21:56:53Z)
Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects. The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image. We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.