Scaling multi-species occupancy models to large citizen science datasets
- URL: http://arxiv.org/abs/2206.08894v1
- Date: Fri, 17 Jun 2022 16:54:56 GMT
- Title: Scaling multi-species occupancy models to large citizen science datasets
- Authors: Martin Ingram, Damjan Vukcevic, Nick Golding
- Abstract summary: We develop approximate Bayesian inference methods to scale multi-species occupancy models to large datasets.
We evaluate the predictions on a spatially separated test set of 59,338 records.
We find that modelling the detection process greatly improves agreement and that the resulting maps agree as closely with expert maps as ones estimated using high quality survey data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Citizen science datasets can be very large and promise to improve species
distribution modelling, but detection is imperfect, risking bias when fitting
models. In particular, observers may not detect species that are actually
present. Occupancy models can estimate and correct for this observation
process, and multi-species occupancy models exploit similarities in the
observation process, which can improve estimates for rare species. However, the
computational methods currently used to fit these models do not scale to large
datasets. We develop approximate Bayesian inference methods and use graphics
processing units (GPUs) to scale multi-species occupancy models to very large
citizen science data. We fit multi-species occupancy models to one month of
data from the eBird project consisting of 186,811 checklist records comprising
430 bird species. We evaluate the predictions on a spatially separated test set
of 59,338 records, comparing two different inference methods -- Markov chain
Monte Carlo (MCMC) and variational inference (VI) -- to occupancy models fitted
to each species separately using maximum likelihood. We fitted models to the
entire dataset using VI, and up to 32,000 records with MCMC. VI fitted to the
entire dataset performed best, outperforming single-species models on both AUC
(90.4% compared to 88.7%) and on log likelihood (-0.080 compared to -0.085). We
also evaluate how well range maps predicted by the model agree with expert
maps. We find that modelling the detection process greatly improves agreement
and that the resulting maps agree as closely with expert maps as ones estimated
using high quality survey data. Our results demonstrate that multi-species
occupancy models are a compelling approach to model large citizen science
datasets, and that, once the observation process is taken into account, they
can model species distributions accurately.
Related papers
- Multi-Scale and Multimodal Species Distribution Modeling [4.022195138381868]
Species distribution models (SDMs) aim to predict the distribution of species relating occurrence data with environmental variables.
Recent applications of deep learning to SDMs have enabled new avenues, specifically the inclusion of spatial data.
We develop a modular structure for SDMs that allows us to test the effect of scale in both single- and multi-scale settings.
Results on the GeoLifeCLEF 2023 benchmark indicate that considering multimodal data and learning multi-scale representations leads to more accurate models.
arXiv Detail & Related papers (2024-11-06T15:57:20Z) - More precise edge detections [0.0]
Edge detection (ED) is a base task in computer vision.
Current models still suffer from unsatisfactory precision rates.
Model architecture for more precise predictions still needs an investigation.
arXiv Detail & Related papers (2024-07-29T13:24:55Z) - LD-SDM: Language-Driven Hierarchical Species Distribution Modeling [9.620416509546471]
We focus on the problem of species distribution modeling using global-scale presence-only data.
To capture a stronger implicit relationship between species, we encode the taxonomic hierarchy of species using a large language model.
We propose a novel proximity-aware evaluation metric that enables evaluating species distribution models.
arXiv Detail & Related papers (2023-12-13T18:11:37Z) - Anchor Points: Benchmarking Models with Much Fewer Examples [88.02417913161356]
In six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models.
We propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset.
Just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error.
arXiv Detail & Related papers (2023-09-14T17:45:51Z) - Spatial Implicit Neural Representations for Global-Scale Species Mapping [72.92028508757281]
Given a set of locations where a species has been observed, the goal is to build a model to predict whether the species is present or absent at any location.
Traditional methods struggle to take advantage of emerging large-scale crowdsourced datasets.
We use Spatial Implicit Neural Representations (SINRs) to jointly estimate the geographical range of 47k species simultaneously.
arXiv Detail & Related papers (2023-06-05T03:36:01Z) - Knowledge is a Region in Weight Space for Fine-tuned Language Models [48.589822853418404]
We study how the weight space and the underlying loss landscape of different models are interconnected.
We show that language models that have been finetuned on the same dataset form a tight cluster in the weight space, while models finetuned on different datasets from the same underlying task form a looser cluster.
arXiv Detail & Related papers (2023-02-09T18:59:18Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Model-based micro-data reinforcement learning: what are the crucial
model properties and which model to choose? [0.2836066255205732]
We contribute to micro-data model-based reinforcement learning (MBRL) by rigorously comparing popular generative models.
We find that on an environment that requires multimodal posterior predictives, mixture density nets outperform all other models by a large margin.
We also found that deterministic models are on par, in fact they consistently (although non-significantly) outperform their probabilistic counterparts.
arXiv Detail & Related papers (2021-07-24T11:38:25Z) - Dataset Cartography: Mapping and Diagnosing Datasets with Training
Dynamics [118.75207687144817]
We introduce Data Maps, a model-based tool to characterize and diagnose datasets.
We leverage a largely ignored source of information: the behavior of the model on individual instances during training.
Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.
arXiv Detail & Related papers (2020-09-22T20:19:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.