Fugu-MT 論文翻訳(概要): CounTR: Transformer-based Generalised Visual Counting

論文の概要: CounTR: Transformer-based Generalised Visual Counting

arxiv url: http://arxiv.org/abs/2208.13721v3
Date: Fri, 2 Jun 2023 07:51:22 GMT
ステータス: 翻訳完了
システム内更新日: 2023-06-05 20:56:32.572774
Title: CounTR: Transformer-based Generalised Visual Counting
Title（参考訳）: CounTR: トランスフォーマーベースの一般化ビジュアルカウント
Authors: Chang Liu, Yujie Zhong, Andrew Zisserman, Weidi Xie
Abstract要約: 我々は任意の意味圏からオブジェクト数を数える計算モデルを開発し、任意の数の「例」を用いて計算する。 FSC-147のような大規模カウントベンチマークの徹底的なアブレーション研究を行い、ゼロおよび少数ショット設定の両方で最先端の性能を示す。
参考スコア（独自算出の注目度）: 94.54725247039441
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars", i.e. zero-shot or few-shot counting. To this end, we make the following four contributions: (1) We introduce a novel transformer-based architecture for generalised visual object counting, termed as Counting Transformer (CounTR), which explicitly capture the similarity between image patches or with given "exemplars" with the attention mechanism;(2) We adopt a two-stage training regime, that first pre-trains the model with self-supervised learning, and followed by supervised fine-tuning;(3) We propose a simple, scalable pipeline for synthesizing training images with a large number of instances or that from different semantic categories, explicitly forcing the model to make use of the given "exemplars";(4) We conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate state-of-the-art performance on both zero and few-shot settings.
Abstract（参考訳）: 本稿では、任意の意味圏から対象を数える計算モデルを開発することを目的として、ゼロショットや少数ショットカウントといった任意の数の「例」を用いて、一般化されたビジュアルオブジェクトカウントの問題を考察する。 To this end, we make the following four contributions: (1) We introduce a novel transformer-based architecture for generalised visual object counting, termed as Counting Transformer (CounTR), which explicitly capture the similarity between image patches or with given "exemplars" with the attention mechanism;(2) We adopt a two-stage training regime, that first pre-trains the model with self-supervised learning, and followed by supervised fine-tuning;(3) We propose a simple, scalable pipeline for synthesizing training images with a large number of instances or that from different semantic categories, explicitly forcing the model to make use of the given "exemplars";(4) We conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate state-of-the-art performance on both zero and few-shot settings.

関連論文リスト

Object Counting with GPT-4o and GPT-5: A Comparative Study [2.624902795082451]
ゼロショットオブジェクトカウントは、トレーニング中にそのカウントを実行するビジョンモデルに遭遇したことのない新しいカテゴリに属するオブジェクトインスタンスの数を推定しようとする。既存の方法は、通常大量の注釈付きデータを必要とし、しばしば数え上げのプロセスを導くために視覚的な例えを必要とする。大規模言語モデル(LLM)は、目覚ましい推論とデータ理解能力を備えた強力なツールである。
論文参考訳（メタデータ） (2025-12-02T21:07:13Z)
CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting [0.0]
人間は、クラスアイデンティティに頼るのではなく、視覚的反復と構造的関係を知覚することで、多種多様な物体を努力せずに数えることができる。本研究では,クラスに依存しないオブジェクトカウントのための繰り返しと構造的コヒーレンスを認識することを学ぶトランスフォーマーベースのフレームワークであるCountFormerを紹介する。
論文参考訳（メタデータ） (2025-10-27T19:16:02Z)
Causal Image Modeling for Efficient Visual Understanding [41.87857129429512]
本稿では,イメージをパッチトークンのシーケンスとして扱うアドベンチャーシリーズモデルを紹介し,一方向言語モデルを用いて視覚表現を学習する。このモデリングパラダイムにより、列長に対して線形な複雑度を持つ繰り返し定式化による画像の処理が可能となる。本稿では,画像入力を因果推論フレームワークにシームレスに統合する2つの簡単な設計を提案する。
論文参考訳（メタデータ） (2024-10-10T04:14:52Z)
CountGD: Multi-Modal Open-World Counting [54.88804890463491]
本稿では,画像中のオープン語彙オブジェクトの数値化の一般化と精度の向上を目的とする。本稿では,最初のオープンワールドカウントモデルであるCountGDを紹介した。
論文参考訳（メタデータ） (2024-07-05T16:20:48Z)
Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
人間は、ほんのわずかの例に晒された後に、新しい、目に見えない画像を正確に分類する能力を持っている。人工ニューラルネットワークモデルでは、限られたサンプルを持つ2つのイメージを区別する最も関連性の高い特徴を決定することが課題である。本稿では,サポートとクエリサンプルをパッチに分割するタスク内相互注意手法を提案する。
論文参考訳（メタデータ） (2024-05-06T02:02:57Z)
With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning [47.96387857237473]
我々は、他のトレーニングサンプルを処理しながら得られたアクティベーションに注意を向けるネットワークを考案した。私たちのメモリは、プロトタイプベクトルの定義を通じて過去のキーと値の分布をモデル化します。本研究では,エンコーダ・デコーダ変換器の性能を3.7 CIDErポイント向上できることを示す。
論文参考訳（メタデータ） (2023-08-23T18:53:00Z)
Counting Like Human: Anthropoid Crowd Counting on Modeling the Similarity of Objects [92.80955339180119]
メインストリームの群衆計数法は密度マップを補強して計数結果を得るために統合する。これに触発された我々は,合理的かつ人為的な集団カウントフレームワークを提案する。
論文参考訳（メタデータ） (2022-12-02T07:00:53Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
本稿では,密集自己注意の代替として,コンテンツに基づくスパースアテンション手法を提案する。具体的には、合計トークン数を減少させるコンテンツベースの方法として、キーとバリュートークンをクラスタ化し、集約する。結果として得られたクラスタ化されたTokenシーケンスは、元の信号のセマンティックな多様性を保持するが、より少ない計算コストで処理できる。
論文参考訳（メタデータ） (2022-08-28T04:18:27Z)
Shelf-Supervised Mesh Prediction in the Wild [54.01373263260449]
本研究では,物体の3次元形状とポーズを1つの画像から推定する学習手法を提案する。まず、カメラのポーズとともに、標準フレーム内の体積表現を推定する。粗い体積予測はメッシュベースの表現に変換され、予測されたカメラフレームでさらに洗練される。
論文参考訳（メタデータ） (2021-02-11T18:57:10Z)
Sequential View Synthesis with Transformer [13.200139959163574]
学習した表現に基づいて、ターゲットビューを含む画像シーケンスを予測するシーケンシャルレンダリングデコーダを導入する。我々は、様々な挑戦的なデータセットでモデルを評価し、モデルが一貫性のある予測を与えるだけでなく、微調整のための再トレーニングも必要としないことを示した。
論文参考訳（メタデータ） (2020-04-09T14:15:27Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。