Follow Anything: Open-set detection, tracking, and following in
real-time
- URL: http://arxiv.org/abs/2308.05737v2
- Date: Sat, 10 Feb 2024 03:53:18 GMT
- Title: Follow Anything: Open-set detection, tracking, and following in
real-time
- Authors: Alaa Maalouf and Ninad Jadhav and Krishna Murthy Jatavallabhula and
Makram Chahine and Daniel M.Vogt and Robert J. Wood and Antonio Torralba and
Daniela Rus
- Abstract summary: We present a robotic system to detect, track, and follow any object in real-time.
Our approach, dubbed follow anything'' (FAn), is an open-vocabulary and multimodal model.
FAn can be deployed on a laptop with a lightweight (6-8 GB) graphics card, achieving a throughput of 6-20 frames per second.
- Score: 89.83421771766682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tracking and following objects of interest is critical to several robotics
use cases, ranging from industrial automation to logistics and warehousing, to
healthcare and security. In this paper, we present a robotic system to detect,
track, and follow any object in real-time. Our approach, dubbed ``follow
anything'' (FAn), is an open-vocabulary and multimodal model -- it is not
restricted to concepts seen at training time and can be applied to novel
classes at inference time using text, images, or click queries. Leveraging rich
visual descriptors from large-scale pre-trained models (foundation models), FAn
can detect and segment objects by matching multimodal queries (text, images,
clicks) against an input image sequence. These detected and segmented objects
are tracked across image frames, all while accounting for occlusion and object
re-emergence. We demonstrate FAn on a real-world robotic system (a micro aerial
vehicle) and report its ability to seamlessly follow the objects of interest in
a real-time control loop. FAn can be deployed on a laptop with a lightweight
(6-8 GB) graphics card, achieving a throughput of 6-20 frames per second. To
enable rapid adoption, deployment, and extensibility, we open-source all our
code on our project webpage at https://github.com/alaamaalouf/FollowAnything .
We also encourage the reader to watch our 5-minutes explainer video in this
https://www.youtube.com/watch?v=6Mgt3EPytrw .
Related papers
- Enhancing In-vehicle Multiple Object Tracking Systems with Embeddable Ising Machines [0.10485739694839666]
We show an in-vehicle multiple object tracking system with a flexible assignment function.
The system relies on an embeddable Ising machine based on a quantum-inspired algorithm called simulated bifurcation.
Using a vehicle-mountable computing platform, we demonstrate a realtime system-wide throughput (23 frames per second on average) with the enhanced functionality.
arXiv Detail & Related papers (2024-10-18T00:18:27Z) - VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking [61.56592503861093]
This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT)
Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens.
We propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint.
arXiv Detail & Related papers (2024-10-11T05:01:49Z) - Track Anything Rapter(TAR) [0.0]
Track Anything Rapter (TAR) is designed to detect, segment, and track objects of interest based on user-provided multimodal queries.
TAR utilizes cutting-edge pre-trained models like DINO, CLIP, and SAM to estimate the relative pose of the queried object.
We showcase how the integration of these foundational models with a custom high-level control algorithm results in a highly stable and precise tracking system.
arXiv Detail & Related papers (2024-05-19T19:51:41Z) - Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation.
Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal.
We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z) - Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models [28.304047711166056]
Large-scale pre-trained models have shown promising advances in detecting and segmenting objects in 2D static images in the wild.
This begs the question: can we re-purpose these large-scale pre-trained static image models for open-vocabulary video tracking?
In this paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical flow estimator, into a model that tracks and segments objects of any category in 2D videos.
arXiv Detail & Related papers (2023-10-10T20:25:30Z) - RoboTAP: Tracking Arbitrary Points for Few-Shot Visual Imitation [36.43143326197769]
Track-Any-Point (TAP) models isolate the relevant motion in a demonstration, and parameterize a low-level controller to reproduce this motion across changes in the scene configuration.
We show this results in robust robot policies that can solve complex object-arrangement tasks such as shape-matching, stacking, and even full path-following tasks such as applying glue and sticking objects together.
arXiv Detail & Related papers (2023-08-30T11:57:04Z) - OVTrack: Open-Vocabulary Multiple Object Tracking [64.73379741435255]
OVTrack is an open-vocabulary tracker capable of tracking arbitrary object classes.
It sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark.
arXiv Detail & Related papers (2023-04-17T16:20:05Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - MVLidarNet: Real-Time Multi-Class Scene Understanding for Autonomous
Driving Using Multiple Views [60.538802124885414]
We present Multi-View LidarNet (MVLidarNet), a two-stage deep neural network for multi-class object detection and drivable space segmentation.
MVLidarNet is able to detect and classify objects while simultaneously determining the drivable space using a single LiDAR scan as input.
We show results on both KITTI and a much larger internal dataset, thus demonstrating the method's ability to scale by an order of magnitude.
arXiv Detail & Related papers (2020-06-09T21:28:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.