CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping
- URL: http://arxiv.org/abs/2603.04091v1
- Date: Wed, 04 Mar 2026 14:01:47 GMT
- Title: CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping
- Authors: Simon Warmers, Muhammad Zawish, Fayaz Ali Dharejo, Steven Davy, Radu Timofte,
- Abstract summary: We propose a level-aware vision language framework that jointly predicts plant age and leaf count using a single multi-task model built on CLIP embeddings.<n>Our method aggregates rotational views into angle-invariant representations and conditions visual features on lightweight text priors encoding viewpoint level for stable prediction under incomplete or unordered inputs.<n>On the GroMo25 benchmark, our approach reduces mean age MAE from 7.74 to 3.91 and mean leaf-count MAE from 5.52 to 3.08 compared to the GroMo baseline, corresponding to improvements of 49.5% and 44.2%, respectively.
- Score: 43.24254323363639
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modeling plant growth dynamics plays a central role in modern agricultural research. However, learning robust predictors from multi-view plant imagery remains challenging due to strong viewpoint redundancy and viewpoint-dependent appearance changes. We propose a level-aware vision language framework that jointly predicts plant age and leaf count using a single multi-task model built on CLIP embeddings. Our method aggregates rotational views into angle-invariant representations and conditions visual features on lightweight text priors encoding viewpoint level for stable prediction under incomplete or unordered inputs. On the GroMo25 benchmark, our approach reduces mean age MAE from 7.74 to 3.91 and mean leaf-count MAE from 5.52 to 3.08 compared to the GroMo baseline, corresponding to improvements of 49.5% and 44.2%, respectively. The unified formulation simplifies the pipeline by replacing the conventional dual-model setup while improving robustness to missing views. The models and code is available at: https://github.com/SimonWarmers/CLIP-MVP
Related papers
- LeafInst - Unified Instance Segmentation Network for Fine-Grained Forestry Leaf Phenotype Analysis: A New UAV based Benchmark [10.61947524568352]
LeafInst is a novel segmentation framework tailored for irregular and multi-scale leaf structures.<n>It achieves 68.4 mAP, outperforming YOLOv11 by 7.1 percent and MaskDINO by 6.5 percent.
arXiv Detail & Related papers (2026-03-04T01:01:57Z) - MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation [48.84450712826316]
MSVBench is the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi-Shot Video generation.<n>We propose a hybrid evaluation framework that synergizes the high-level semantic reasoning of Large Multimodal Models with the fine-grained perceptual rigor of domain-specific expert models.
arXiv Detail & Related papers (2026-02-27T12:26:34Z) - VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling [60.341503853471494]
We show that vision-language-action models degrade sharply under novel camera viewpoints and visual perturbations.<n>We propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates.
arXiv Detail & Related papers (2025-12-02T16:16:13Z) - A Multimodal, Multitask System for Generating E Commerce Text Listings from Images [0.0]
We propose an end to end, multi task system that generates factually grounded textual listings from a single image.<n>The hierarchical generation process proves highly effective, slashing the factual hallucination rate from 12.7% to 7.1%.<n>One minor caveat is that the model does perform 3.5% worse than direct vision-to-language model on ROUGE-L score.
arXiv Detail & Related papers (2025-10-22T11:50:49Z) - ViewSparsifier: Killing Redundancy in Multi-View Plant Phenotyping [8.348234911002821]
Plant phenotyping involves analyzing observable characteristics of plants to better understand their growth, health, and development.<n>In the context of deep learning, this analysis is often approached through single-view classification or regression models.<n>To address this, the Growth Modelling (GroMo) Grand Challenge at ACM Multimedia 2025 provides a multi-view dataset featuring multiple plants.
arXiv Detail & Related papers (2025-09-10T12:53:38Z) - GroMo: Plant Growth Modeling with Multiview Images [3.7287379829068805]
We present the Growth Modelling (GroMo) challenge, which is designed for two primary tasks: plant age prediction and leaf count estimation.<n>The GroMo Challenge aims to advance plant phenotyping research by encouraging innovative solutions for tracking and predicting plant growth.
arXiv Detail & Related papers (2025-03-09T13:23:16Z) - Fisher Mask Nodes for Language Model Merging [0.0]
We introduce a novel model merging method for Transformers, combining insights from previous work in Fisher-weighted averaging and the use of Fisher information in model pruning.
Our method exhibits a regular and significant performance increase across various models in the BERT family, outperforming full-scale Fisher-weighted averaging in a fraction of the computational cost.
arXiv Detail & Related papers (2024-03-14T21:52:26Z) - Stay on topic with Classifier-Free Guidance [57.28934343207042]
We show that CFG can be used broadly as an inference-time technique in pure language modeling.
We show that CFG improves the performance of Pythia, GPT-2 and LLaMA-family models across an array of tasks.
arXiv Detail & Related papers (2023-06-30T17:07:02Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Direct Multi-view Multi-person 3D Pose Estimation [138.48139701871213]
We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images.
MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks.
We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient.
arXiv Detail & Related papers (2021-11-07T13:09:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.