ClipCrop: Conditioned Cropping Driven by Vision-Language Model
- URL: http://arxiv.org/abs/2211.11492v1
- Date: Mon, 21 Nov 2022 14:27:07 GMT
- Title: ClipCrop: Conditioned Cropping Driven by Vision-Language Model
- Authors: Zhihang Zhong, Mingxi Cheng, Zhirong Wu, Yuhui Yuan, Yinqiang Zheng,
Ji Li, Han Hu, Stephen Lin, Yoichi Sato, Imari Sato
- Abstract summary: We take advantage of vision-language models as a foundation for creating robust and user-intentional cropping algorithms.
We develop a method to perform cropping with a text or image query that reflects the user's intention as guidance.
Our pipeline design allows the model to learn text-conditioned aesthetic cropping with a small dataset.
- Score: 90.95403416150724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image cropping has progressed tremendously under the data-driven paradigm.
However, current approaches do not account for the intentions of the user,
which is an issue especially when the composition of the input image is
complex. Moreover, labeling of cropping data is costly and hence the amount of
data is limited, leading to poor generalization performance of current
algorithms in the wild. In this work, we take advantage of vision-language
models as a foundation for creating robust and user-intentional cropping
algorithms. By adapting a transformer decoder with a pre-trained CLIP-based
detection model, OWL-ViT, we develop a method to perform cropping with a text
or image query that reflects the user's intention as guidance. In addition, our
pipeline design allows the model to learn text-conditioned aesthetic cropping
with a small cropping dataset, while inheriting the open-vocabulary ability
acquired from millions of text-image pairs. We validate our model through
extensive experiments on existing datasets as well as a new cropping test set
we compiled that is characterized by content ambiguity.
Related papers
- Debiasing Vison-Language Models with Text-Only Training [15.069736314663352]
We propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
arXiv Detail & Related papers (2024-10-12T04:34:46Z) - Evaluating Data Attribution for Text-to-Image Models [62.844382063780365]
We evaluate attribution through "customization" methods, which tune an existing large-scale model toward a given exemplar object or style.
Our key insight is that this allows us to efficiently create synthetic images that are computationally influenced by the exemplar by construction.
By taking into account the inherent uncertainty of the problem, we can assign soft attribution scores over a set of training images.
arXiv Detail & Related papers (2023-06-15T17:59:51Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - Prefix Conditioning Unifies Language and Label Supervision [84.11127588805138]
We show that dataset biases negatively affect pre-training by reducing the generalizability of learned representations.
In experiments, we show that this simple technique improves the performance in zero-shot image recognition accuracy and robustness to the image-level distribution shift.
arXiv Detail & Related papers (2022-06-02T16:12:26Z) - Simple Open-Vocabulary Object Detection with Vision Transformers [51.57562920090721]
We propose a strong recipe for transferring image-text models to open-vocabulary object detection.
We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning.
We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection.
arXiv Detail & Related papers (2022-05-12T17:20:36Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Zero-Shot Text-to-Image Generation [15.135825501365007]
We describe a transformer that autoregressively models the text and image tokens as a single stream of data.
With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
arXiv Detail & Related papers (2021-02-24T06:42:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.