VLPose: Bridging the Domain Gap in Pose Estimation with Language-Vision
Tuning
- URL: http://arxiv.org/abs/2402.14456v1
- Date: Thu, 22 Feb 2024 11:21:54 GMT
- Title: VLPose: Bridging the Domain Gap in Pose Estimation with Language-Vision
Tuning
- Authors: Jingyao Li, Pengguang Chen, Xuan Ju, Hong Xu, Jiaya Jia
- Abstract summary: We bridge the domain gap between natural and artificial scenarios with efficient tuning strategies.
We develop a novel framework called VLPose to extend the generalization and robustness of pose estimation models.
Our approach has demonstrated improvements of 2.26% and 3.74% on HumanArt and MSCOCO, respectively.
- Score: 53.35114015288077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Thanks to advances in deep learning techniques, Human Pose Estimation (HPE)
has achieved significant progress in natural scenarios. However, these models
perform poorly in artificial scenarios such as painting and sculpture due to
the domain gap, constraining the development of virtual reality and augmented
reality. With the growth of model size, retraining the whole model on both
natural and artificial data is computationally expensive and inefficient. Our
research aims to bridge the domain gap between natural and artificial scenarios
with efficient tuning strategies. Leveraging the potential of language models,
we enhance the adaptability of traditional pose estimation models across
diverse scenarios with a novel framework called VLPose. VLPose leverages the
synergy between language and vision to extend the generalization and robustness
of pose estimation models beyond the traditional domains. Our approach has
demonstrated improvements of 2.26% and 3.74% on HumanArt and MSCOCO,
respectively, compared to state-of-the-art tuning strategies.
Related papers
- High Quality Human Image Animation using Regional Supervision and Motion Blur Condition [97.97432499053966]
We leverage regional supervision for detailed regions to enhance face and hand faithfulness.
Second, we model the motion blur explicitly to further improve the appearance quality.
Third, we explore novel training strategies for high-resolution human animation to improve the overall fidelity.
arXiv Detail & Related papers (2024-09-29T06:46:31Z) - ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models.
Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios.
We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z) - Regularized Training with Generated Datasets for Name-Only Transfer of Vision-Language Models [36.59260354292177]
Recent advancements in text-to-image generation have inspired researchers to generate datasets tailored for perception models using generative models.
We aim to fine-tune vision-language models to a specific classification model without access to any real images.
Despite the high fidelity of generated images, we observed a significant performance degradation when fine-tuning the model using the generated datasets.
arXiv Detail & Related papers (2024-06-08T10:43:49Z) - COPAL: Continual Pruning in Large Language Generative Models [23.747878534962663]
COPAL is an algorithm developed for pruning large language generative models under a continual model adaptation setting.
Our empirical evaluation on a various size of LLMs show that COPAL outperforms baseline models.
arXiv Detail & Related papers (2024-05-02T18:24:41Z) - Language-Guided World Models: A Model-Based Approach to AI Control [31.9089380929602]
This paper introduces the concept of Language-Guided World Models (LWMs)
LWMs are probabilistic models that can simulate environments by reading texts.
We take initial steps in developing robust LWMs that can generalize to compositionally novel language descriptions.
arXiv Detail & Related papers (2024-01-24T03:11:36Z) - Has Your Pretrained Model Improved? A Multi-head Posterior Based
Approach [25.927323251675386]
We leverage the meta-features associated with each entity as a source of worldly knowledge and employ entity representations from the models.
We propose using the consistency between these representations and the meta-features as a metric for evaluating pre-trained models.
Our method's effectiveness is demonstrated across various domains, including models with relational datasets, large language models and image models.
arXiv Detail & Related papers (2024-01-02T17:08:26Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.
Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.
Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Style-Hallucinated Dual Consistency Learning: A Unified Framework for
Visual Domain Generalization [113.03189252044773]
We propose a unified framework, Style-HAllucinated Dual consistEncy learning (SHADE), to handle domain shift in various visual tasks.
Our versatile SHADE can significantly enhance the generalization in various visual recognition tasks, including image classification, semantic segmentation and object detection.
arXiv Detail & Related papers (2022-12-18T11:42:51Z) - Latent Variable Representation for Reinforcement Learning [131.03944557979725]
It remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of model-based reinforcement learning.
We provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle.
In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models.
arXiv Detail & Related papers (2022-12-17T00:26:31Z) - QAGAN: Adversarial Approach To Learning Domain Invariant Language
Features [0.76146285961466]
We explore adversarial training approach towards learning domain-invariant features.
We are able to achieve $15.2%$ improvement in EM score and $5.6%$ boost in F1 score on out-of-domain validation dataset.
arXiv Detail & Related papers (2022-06-24T17:42:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.