Related papers: Towards Application-Specific Evaluation of Vision Models: Case Studies in Ecology and Biology

Towards Application-Specific Evaluation of Vision Models: Case Studies in Ecology and Biology

URL: http://arxiv.org/abs/2505.02825v2
Date: Tue, 06 May 2025 10:17:58 GMT
Title: Towards Application-Specific Evaluation of Vision Models: Case Studies in Ecology and Biology
Authors: Alex Hoi Hang Chan, Otto Brookes, Urs Waldmann, Hemal Naik, Iain D. Couzin, Majid Mirmehdi, Noël Adiko Houa, Emmanuelle Normand, Christophe Boesch, Lukas Boesch, Mimi Arandjelovic, Hjalmar Kühl, Tilo Burghardt, Fumihiro Kano,
Abstract summary: We show that even models with strong machine learning performance (e.g., 87% mAP) can yield data that leads to discrepancies in abundance estimates compared to expert-derived data.<n>Motivated by these findings, we call for researchers to integrate application-specific metrics in ecological/biological datasets.
Score: 9.281480445109484
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Computer vision methods have demonstrated considerable potential to streamline ecological and biological workflows, with a growing number of datasets and models becoming available to the research community. However, these resources focus predominantly on evaluation using machine learning metrics, with relatively little emphasis on how their application impacts downstream analysis. We argue that models should be evaluated using application-specific metrics that directly represent model performance in the context of its final use case. To support this argument, we present two disparate case studies: (1) estimating chimpanzee abundance and density with camera trap distance sampling when using a video-based behaviour classifier and (2) estimating head rotation in pigeons using a 3D posture estimator. We show that even models with strong machine learning performance (e.g., 87% mAP) can yield data that leads to discrepancies in abundance estimates compared to expert-derived data. Similarly, the highest-performing models for posture estimation do not produce the most accurate inferences of gaze direction in pigeons. Motivated by these findings, we call for researchers to integrate application-specific metrics in ecological/biological datasets, allowing for models to be benchmarked in the context of their downstream application and to facilitate better integration of models into application workflows.

Related papers

Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning [77.120955854093]
We show that data diversity can be a strong predictor of generalization in language models.<n>We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients.<n>We present Prismatic Synthesis, a framework for generating diverse synthetic data.
arXiv Detail & Related papers (2025-05-26T16:05:10Z)
Improving the portability of predicting students performance models by using ontologies [0.0]
One of the main current challenges in Educational Data Mining and Learning Analytics is the portability of predictive models. We propose the utilization of an ontology that uses a taxonomy of actions that summarises students interactions with the Moodle learning management system. Results indicate that the use of the proposed ontology improves the portability of the models in terms of predictive accuracy.
arXiv Detail & Related papers (2024-10-09T18:18:54Z)
GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models [41.76935689355034]
Discriminative and generative pretraining have yielded geometry estimation models with strong generalization capabilities. We build fair and strong baselines for evaluating and analyzing the geometry estimation models. We evaluate monocular geometry estimators on more challenging benchmarks for geometry estimation task with diverse scenes and high-quality annotations.
arXiv Detail & Related papers (2024-06-18T14:44:12Z)
The Paradox of Motion: Evidence for Spurious Correlations in Skeleton-based Gait Recognition Models [4.089889918897877]
This study challenges the prevailing assumption that vision-based gait recognition relies primarily on motion patterns. We show through a comparative analysis that removing height information leads to notable performance degradation. We propose a spatial transformer model processing individual poses, disregarding any temporal information, which achieves unreasonably good accuracy.
arXiv Detail & Related papers (2024-02-13T09:33:12Z)
Has Your Pretrained Model Improved? A Multi-head Posterior Based Approach [25.927323251675386]
We leverage the meta-features associated with each entity as a source of worldly knowledge and employ entity representations from the models. We propose using the consistency between these representations and the meta-features as a metric for evaluating pre-trained models. Our method's effectiveness is demonstrated across various domains, including models with relational datasets, large language models and image models.
arXiv Detail & Related papers (2024-01-02T17:08:26Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Scaling Laws Do Not Scale [54.72120385955072]
Recent work has argued that as the size of a dataset increases, the performance of a model trained on that dataset will increase. We argue that this scaling law relationship depends on metrics used to measure performance that may not correspond with how different groups of people perceive the quality of models' output. Different communities may also have values in tension with each other, leading to difficult, potentially irreconcilable choices about metrics used for model evaluations.
arXiv Detail & Related papers (2023-07-05T15:32:21Z)
Representer Point Selection for Explaining Regularized High-dimensional Models [105.75758452952357]
We introduce a class of sample-based explanations we term high-dimensional representers. Our workhorse is a novel representer theorem for general regularized high-dimensional models. We study the empirical performance of our proposed methods on three real-world binary classification datasets and two recommender system datasets.
arXiv Detail & Related papers (2023-05-31T16:23:58Z)
Discover, Explanation, Improvement: An Automatic Slice Detection Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints. This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks. Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)
Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems. Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored. We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z)
Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual Model-Based Reinforcement Learning [109.74041512359476]
We study a number of design decisions for the predictive model in visual MBRL algorithms. We find that a range of design decisions that are often considered crucial, such as the use of latent spaces, have little effect on task performance. We show how this phenomenon is related to exploration and how some of the lower-scoring models on standard benchmarks will perform the same as the best-performing models when trained on the same training data.
arXiv Detail & Related papers (2020-12-08T18:03:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.