CBIL: Collective Behavior Imitation Learning for Fish from Real Videos
- URL: http://arxiv.org/abs/2504.00234v1
- Date: Mon, 31 Mar 2025 21:15:00 GMT
- Title: CBIL: Collective Behavior Imitation Learning for Fish from Real Videos
- Authors: Yifan Wu, Zhiyang Dou, Yuko Ishiwaka, Shun Ogawa, Yuke Lou, Wenping Wang, Lingjie Liu, Taku Komura,
- Abstract summary: We present a scalable approach, Collective Behavior Imitation Learning (CBIL), for learning fish schooling behavior directly from videos.<n>MVAE effectively maps 2D observations to implicit states that are compact and expressive for following the imitation learning stage.<n>CBIL can be used for various animation tasks with the learned collective motion priors.
- Score: 58.81930297206828
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Reproducing realistic collective behaviors presents a captivating yet formidable challenge. Traditional rule-based methods rely on hand-crafted principles, limiting motion diversity and realism in generated collective behaviors. Recent imitation learning methods learn from data but often require ground truth motion trajectories and struggle with authenticity, especially in high-density groups with erratic movements. In this paper, we present a scalable approach, Collective Behavior Imitation Learning (CBIL), for learning fish schooling behavior directly from videos, without relying on captured motion trajectories. Our method first leverages Video Representation Learning, where a Masked Video AutoEncoder (MVAE) extracts implicit states from video inputs in a self-supervised manner. The MVAE effectively maps 2D observations to implicit states that are compact and expressive for following the imitation learning stage. Then, we propose a novel adversarial imitation learning method to effectively capture complex movements of the schools of fish, allowing for efficient imitation of the distribution for motion patterns measured in the latent space. It also incorporates bio-inspired rewards alongside priors to regularize and stabilize training. Once trained, CBIL can be used for various animation tasks with the learned collective motion priors. We further show its effectiveness across different species. Finally, we demonstrate the application of our system in detecting abnormal fish behavior from in-the-wild videos.
Related papers
- SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning [50.98341607245458]
Masked video modeling is an effective paradigm for video self-supervised learning (SSL)<n>This paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics.<n>We establish a new self-supervised video learning paradigm capable of learning strong video representations without requiring any natural video data.
arXiv Detail & Related papers (2025-04-01T08:20:55Z) - Learning to Animate Images from A Few Videos to Portray Delicate Human Actions [80.61838364885482]
Video generative models still struggle to animate static images into videos that portray delicate human actions.<n>In this paper, we explore the task of learning to animate images to portray delicate human actions using a small number of videos.<n>We propose FLASH, which learns generalizable motion patterns by forcing the model to reconstruct a video using the motion features and cross-frame correspondences of another video.
arXiv Detail & Related papers (2025-03-01T01:09:45Z) - You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations [38.835807227433335]
Bimanual robotic manipulation is a long-standing challenge of embodied intelligence.<n>We propose YOTO, which can extract and then inject patterns of bimanual actions from as few as a single binocular observation.<n>YOTO achieves impressive performance in mimicking 5 intricate long-horizon bimanual tasks.
arXiv Detail & Related papers (2025-01-24T03:26:41Z) - Video2Reward: Generating Reward Function from Videos for Legged Robot Behavior Learning [27.233232260388682]
We introduce a new video2reward method, which directly generates reward functions from videos depicting the behaviors to be mimicked and learned.<n>Our method surpasses the performance of state-of-the-art LLM-based reward generation methods by over 37.6% in terms of human normalized score.
arXiv Detail & Related papers (2024-12-07T03:10:27Z) - Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos [64.48857272250446]
We introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer.
We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge.
To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control.
arXiv Detail & Related papers (2024-12-05T18:57:04Z) - Monkey See, Monkey Do: Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer [55.109778609058154]
Existing diffusion-based motion editing methods overlook the profound potential of the prior embedded within the weights of pre-trained models.
We uncover the roles and interactions of attention elements in capturing and representing motion patterns.
We integrate these elements to transfer a leader motion to a follower one while maintaining the nuanced characteristics of the follower, resulting in zero-shot motion transfer.
arXiv Detail & Related papers (2024-06-10T17:47:14Z) - CALM: Conditional Adversarial Latent Models for Directable Virtual
Characters [71.66218592749448]
We present Conditional Adversarial Latent Models (CALM), an approach for generating diverse and directable behaviors for user-controlled interactive virtual characters.
Using imitation learning, CALM learns a representation of movement that captures the complexity of human motion, and enables direct control over character movements.
arXiv Detail & Related papers (2023-05-02T09:01:44Z) - Preserve Pre-trained Knowledge: Transfer Learning With Self-Distillation
For Action Recognition [8.571437792425417]
We propose a novel transfer learning approach that combines self-distillation in fine-tuning to preserve knowledge from the pre-trained model learned from the large-scale dataset.
Specifically, we fix the encoder from the last epoch as the teacher model to guide the training of the encoder from the current epoch in the transfer learning.
arXiv Detail & Related papers (2022-05-01T16:31:25Z) - Self-supervised Motion Learning from Static Images [36.85209332144106]
Motion from Static Images (MoSI) learns to encode motion information.
MoSI can discover regions with large motion even without fine-tuning on the downstream datasets.
We demonstrate that MoSI can discover regions with large motion even without fine-tuning on the downstream datasets.
arXiv Detail & Related papers (2021-04-01T03:55:50Z) - State-Only Imitation Learning for Dexterous Manipulation [63.03621861920732]
In this paper, we explore state-only imitation learning.
We train an inverse dynamics model and use it to predict actions for state-only demonstrations.
Our method performs on par with state-action approaches and considerably outperforms RL alone.
arXiv Detail & Related papers (2020-04-07T17:57:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.