TaxaBind: A Unified Embedding Space for Ecological Applications
- URL: http://arxiv.org/abs/2411.00683v1
- Date: Fri, 01 Nov 2024 15:41:30 GMT
- Title: TaxaBind: A Unified Embedding Space for Ecological Applications
- Authors: Srikumar Sastry, Subash Khanal, Aayush Dhakal, Adeel Ahmad, Nathan Jacobs,
- Abstract summary: We present TaxaBind, a unified embedding space for characterizing any species of interest.
TaxaBind is a multimodal embedding space across six modalities: ground-level images of species, geographic location, satellite image, text, audio, and environmental features.
- Score: 7.291750095728984
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present TaxaBind, a unified embedding space for characterizing any species of interest. TaxaBind is a multimodal embedding space across six modalities: ground-level images of species, geographic location, satellite image, text, audio, and environmental features, useful for solving ecological problems. To learn this joint embedding space, we leverage ground-level images of species as a binding modality. We propose multimodal patching, a technique for effectively distilling the knowledge from various modalities into the binding modality. We construct two large datasets for pretraining: iSatNat with species images and satellite images, and iSoundNat with species images and audio. Additionally, we introduce TaxaBench-8k, a diverse multimodal dataset with six paired modalities for evaluating deep learning models on ecological tasks. Experiments with TaxaBind demonstrate its strong zero-shot and emergent capabilities on a range of tasks including species classification, cross-model retrieval, and audio classification. The datasets and models are made available at https://github.com/mvrl/TaxaBind.
Related papers
- OpenWildlife: Open-Vocabulary Multi-Species Wildlife Detector for Geographically-Diverse Aerial Imagery [5.612783442210011]
We introduce OpenWildlife, an open-vocabulary wildlife detector designed for multi-species identification in diverse aerial imagery.<n>OW leverages language-aware embeddings and a novel adaptation of the Grounding-DINO framework, enabling it to identify species specified through natural language inputs across both terrestrial and marine environments.<n>OW outperforms most existing methods, achieving up to textbf0.981 mAP50 with fine-tuning and textbf0.597 mAP50 on seven datasets featuring novel species.
arXiv Detail & Related papers (2025-06-24T00:10:19Z) - The iNaturalist Sounds Dataset [60.157076990024606]
iNatSounds is a collection of 230,000 audio files capturing sounds from over 5,500 species, contributed by more than 27,000 recordists worldwide.<n>The dataset encompasses sounds from birds, mammals, insects, reptiles, and amphibians, with audio and species labels derived from observations submitted to iNaturalist.<n>We envision models trained on this data powering next-generation public engagement applications, and assisting biologists, ecologists, and land use managers in processing large audio collections.
arXiv Detail & Related papers (2025-05-31T02:07:37Z) - CrypticBio: A Large Multimodal Dataset for Visually Confusing Biodiversity [3.73232466691291]
We present CrypticBio, the largest publicly available dataset of visually confusing species.<n>Criticized from real-world trends in species misidentification among community annotators of iNaturalist, CrypticBio contains 52K unique cryptic groups spanning 67K species.
arXiv Detail & Related papers (2025-05-16T14:35:56Z) - MammAlps: A multi-view video behavior monitoring dataset of wild mammals in the Swiss Alps [41.58000025132071]
MammAlps is a dataset of wildlife behavior monitoring from 9 camera-traps in the Swiss National Park.
Based on 6135 single animal clips, we propose the first hierarchical and multimodal animal behavior recognition benchmark.
We also propose a second ecology-oriented benchmark aiming at identifying activities, species, number of individuals and meteorological conditions.
arXiv Detail & Related papers (2025-03-23T21:51:58Z) - AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities [5.767156832161819]
We propose AnySat, a multimodal model based on joint embedding predictive architecture (JEPA) and scale-adaptive spatial encoders.
To demonstrate the advantages of this unified approach, we compile GeoPlex, a collection of $5$ multimodal datasets.
We then train a single powerful model on these diverse datasets simultaneously.
arXiv Detail & Related papers (2024-12-18T18:11:53Z) - Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery [0.0]
In this paper, we show how views of image data with contrast learning can be leveraged.
For example, we show how multiple views of image data can be combined to improve classification for species.
arXiv Detail & Related papers (2024-09-28T19:07:22Z) - OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces [67.07083389543799]
We present OmniBind, large-scale multimodal joint representation models ranging in scale from 7 billion to 30 billion parameters.
Due to the scarcity of data pairs across all modalities, instead of training large models from scratch, we propose remapping and binding the spaces of various pre-trained specialist models together.
Experiments demonstrate the versatility and superiority of OmniBind as an omni representation model, highlighting its great potential for diverse applications.
arXiv Detail & Related papers (2024-07-16T16:24:31Z) - SatBird: Bird Species Distribution Modeling with Remote Sensing and
Citizen Science Data [68.2366021016172]
We present SatBird, a satellite dataset of locations in the USA with labels derived from presence-absence observation data from the citizen science database eBird.
We also provide a dataset in Kenya representing low-data regimes.
We benchmark a set of baselines on our dataset, including SOTA models for remote sensing tasks.
arXiv Detail & Related papers (2023-11-02T02:00:27Z) - Ferret: Refer and Ground Anything Anywhere at Any Granularity [93.80461625100826]
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image.
Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image.
Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes.
arXiv Detail & Related papers (2023-10-11T17:55:15Z) - Fewshot learning on global multimodal embeddings for earth observation
tasks [5.057850174013128]
We pretrain a CLIP/ViT based model using three different modalities of satellite imagery covering over 10% of Earth's total landmass.
We use the embeddings produced for each modality with a classical machine learning method to attempt different downstream tasks for earth observation.
We visually show that this embedding space, obtained with no labels, is sensible to the different earth features represented by the labelled datasets we selected.
arXiv Detail & Related papers (2023-09-29T20:15:52Z) - Spatial Implicit Neural Representations for Global-Scale Species Mapping [72.92028508757281]
Given a set of locations where a species has been observed, the goal is to build a model to predict whether the species is present or absent at any location.
Traditional methods struggle to take advantage of emerging large-scale crowdsourced datasets.
We use Spatial Implicit Neural Representations (SINRs) to jointly estimate the geographical range of 47k species simultaneously.
arXiv Detail & Related papers (2023-06-05T03:36:01Z) - Bird Distribution Modelling using Remote Sensing and Citizen Science
data [31.375576105932442]
Climate change is a major driver of biodiversity loss.
There are significant knowledge gaps about the distribution of species.
We propose an approach leveraging computer vision to improve species distribution modelling.
arXiv Detail & Related papers (2023-05-01T20:27:11Z) - I-Nema: A Biological Image Dataset for Nematode Recognition [3.1918817988202606]
Nematode worms are one of most abundant metazoan groups on the earth, occupying diverse ecological niches.
Accurate recognition or identification of nematodes are of great importance for pest control, soil ecology, bio-geography, habitat conservation and against climate changes.
Computer vision and image processing have witnessed a few successes in species recognition of nematodes; however, it is still in great demand.
arXiv Detail & Related papers (2021-03-15T12:29:37Z) - PhraseCut: Language-based Image Segmentation in the Wild [62.643450401286]
We consider the problem of segmenting image regions given a natural language phrase.
Our dataset is collected on top of the Visual Genome dataset.
Our experiments show that the scale and diversity of concepts in our dataset poses significant challenges to the existing state-of-the-art.
arXiv Detail & Related papers (2020-08-03T20:58:53Z) - Two-View Fine-grained Classification of Plant Species [66.75915278733197]
We propose a novel method based on a two-view leaf image representation and a hierarchical classification strategy for fine-grained recognition of plant species.
A deep metric based on Siamese convolutional neural networks is used to reduce the dependence on a large number of training samples and make the method scalable to new plant species.
arXiv Detail & Related papers (2020-05-18T21:57:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.