Visual grounding localizes regions (boxes or segments) in the image
corresponding to given referring expressions. In this work we address image
segmentation from referring expressions, a problem that has so far only been
addressed in a fully-supervised setting. A fully-supervised setup, however,
requires pixel-wise supervision and is hard to scale given the expense of
manual annotation. We therefore introduce a new task of weakly-supervised image
segmentation from referring expressions and propose Text grounded semantic
SEGgmentation (TSEG) that learns segmentation masks directly from image-level
referring expressions without pixel-level annotations. Our transformer-based
method computes patch-text similarities and guides the classification objective
during training with a new multi-label patch assignment mechanism. The
resulting visual grounding model segments image regions corresponding to given
natural language expressions. Our approach TSEG demonstrates promising results
for weakly-supervised referring expression segmentation on the challenging
PhraseCut and RefCOCO datasets. TSEG also shows competitive performance when
evaluated in a zero-shot setting for semantic segmentation on Pascal VOC.
We therefore introduce a new task of weakly-supervised image segmentation from referring expressions and propose Text grounded semantic SEGgmentation (TSEG) that learns segmentation masks directly from image-level referring expressions without pixel-level annotations.
Our transformer-based method computes patch-text similarities and guides the classification objective during training with a new multi-label patch assignment mechanism.
The resulting visual grounding model segments image regions corresponding to given natural language expressions.
得られた視覚的接地モデルは、与えられた自然言語表現に対応する画像領域をセグメント化する。
0.63
Our approach TSEG demonstrates promising results for weakly-supervised referring expression segmentation on the challenging PhraseCut and RefCOCO datasets.
1 Introduction Image segmentation is a key component for a wide range of applications including virtual presence, virtual try on, movie post production and autonomous driving.
While most of this work addresses semantic segmentation, the more general problem of visual grounding beyond segmentation of pre-defined object classes remains open.
Moreover, the majority of existing method assume full supervision and require costly pixel-wise manual labeling of training images which prevents scalability.
Manual supervision has been recognized as a bottleneck in many vision tasks including object detection [5,30,37] and segmentation [2,3,18,68], text-image and text-video matching [44,47] and human action recognition [6,19].
Fig. 1: Given an image and a set of referring expressions such as man sitting on grass and wooden stairway, TSEG segments the image regions corresponding to the input expressions.
Here we show results of our approach TSEG for a test image of the PhraseCut dataset.
本稿では、PhraseCutデータセットのテスト画像に対するアプローチTSEGの結果を示す。
0.74
Contrary to other existing methods, TSEG only uses image-level referring expressions during training and hence does not require pixel-wise supervision.
In particular, weakly-supervised methods for image segmentation avoid the costly pixel-wise annotation and limit supervision to image-level labels [2,3,18,68].
Despite the promise of scalability, existing approaches to referring expression segmentation require pixel-wise annotation and, hence, remain limited by the size of existing datasets.
Our work aims to advance image segmentation beyond limitations imposed by the pre-defined sets of object classes and the costly pixel-wise manual annotations.
In particular, existing weakly-supervised methods for image segmentation typically rely on the completeness of image-level labels, i.e., the absence of a car in the annotation implies its absence in the image.
This Input: Text+ ImageOutput: Text grounded segmentationman sitting on grassblue raftwooden stairwayblack and white cow
これ 入力: Text+ Image Output: Text grounded segmentationman sitting on grassblue raftwooden stairwayblack and white cow
0.60
英語(論文から抽出)
日本語訳
スコア
Weakly-supervised segmentation of referring expressions
参照表現の弱教師付きセグメンテーション
0.65
3 completeness assumption does not hold for referring expression segmentation.
3 完全性仮定は表現のセグメンテーションを参照しても成立しない。
0.49
Furthermore, the vocabulary is open and compositional.
さらに、語彙はオープンで構成的です。
0.68
To address the above challenges and to learn segmentation from text-based image-level supervision, we introduce a new global weighted pooling mechanism denoted as Multi-label Patch Assignment (MPA).
Our method for Text grounded semantic SEGgmentation (TSEG) incorporates MPA and extends the recent transformer-based Segmenter architecture [53] to referring expression segmentation.
We validate our method and demonstrate its encouraging results for the task of weakly-supervised referring expression segmentation on the challenging PhraseCut [59] and RefCOCO [65] datasets.
In summary, our work makes the following three contributions.
まとめると、私たちの作品は以下の3つの貢献をしています。
0.51
(i) We introduce the new task of weakly-supervised referring expression segmentation and propose an evaluation based on the PhraseCut and RefCOCO datasets.
Furthermore, we demonstrate competitive results for zero shot semantic segmentation on PASCAL VOC.
さらに,PASCAL VOCにおけるゼロショットセマンティックセグメンテーションの競合結果を示す。
0.61
2 Related Work Weakly-supervised semantic segmentation.
2 関連作業 弱教師付きセマンティックセグメンテーション。
0.61
Given an image as input, the goal of semantic segmentation is to identify and localize classes present in the image, e g annotate each pixel of the input image with a class label.
Zhou et al [68] use Class Activation Maps (CAMs) of a Fully Convolutional Network (FCN) combined with Global Average Pooling (GAP) to obtain segmentation maps with a pooling mechanism.
Zhou et al [68]は、完全な畳み込みネットワーク(FCN)のクラス活性化マップ(CAM)とグローバル平均プール(GAP)を組み合わせて、プール機構を備えたセグメンテーションマップを取得する。
0.83
As CAMs tend to focus on most discriminative object parts [57], recent methods deploy more elaborate multi-stage approaches using pixel affinity [1,2], saliency estimation [17,18,27,35,56,66] or seed and expand strategies [27,33,57].
While these methods provide improved segmentation, they require multiple standalone and often expensive networks such as saliency detectors [18,27,66] or segmentation networks based on pixel-level affinity [1,2].
Single-stage methods have been developed based on multiple instance learning (MIL) [46] or expectation-maximiza tion (EM) [45] approaches where masks are inferred from intermediate predictions.
Single-stage methods have been overlooked given their inferior accuracy until the work of Araslanov et al [3] that proposed an efficient single-stage method addressing the limitations of CAMs.
CAMの限界に対処する効率的な単一段階法を提案するAraslanov et al[3]の作業が終わるまで、その精度が劣っているため、シングルステージ法は見過ごされてきた。
0.64
Araslanov et al [3] introduces a global weighted pooling (GWP) mechanism which we extend in this work with a new multi-label patch assignment mechanism (MPA).
Araslanov et al [3] では,新たなマルチラベルパッチ割り当て機構 (MPA) で拡張したグローバル重み付けプール機構 (GWP) を導入している。
0.83
In contrast to prior work on weakly-supervised semantic segmentation, TSEG is a single-stage
method that scales to the challenging task of referring expression segmentation.
表現のセグメンテーションを参照する困難なタスクにスケールする方法。
0.66
Referring expression segmentation.
式セグメンテーションを参照。
0.66
Given an image and a referring expression, the goal of referring expression segmentation is to annotate the input image with a binary mask localizing the referring expression.
To overcome the limitation of FCN to model global context and learn richer cross-modal features, state-of-the-art approaches [13,25,63] use a decoding scheme based on cross-modal attention.
Despite their effectiveness, these methods are fullysupervised which limits their scalability.
有効性にもかかわらず、これらの手法は拡張性を制限する完全教師付きである。
0.42
Several weakly-supervised approaches tackle detection tasks such as referring expression comprehension [7,21,40,41,60] by enforcing visual consistency [7], learning language reconstruction [40] or with a contrastive-learning objective [21].
These methods rely on an off-the-shelf object detector, Faster-RCNN [49], to generate region proposals and are thus limited by the object detector accuracy.
TSEG is a novel approach that tackles weakly-supervised referring expression segmentation based on the computation of patch-text similarities with a new multi-label patch assignment mechanism (MPA).
Such methods capture long-range dependencies among tokens (patches or words) with an attention mechanism and achieve impressive results in the context of vision-language pre-training at scale with methods such as CLIP [47], VisualBERT [38], DALL-E [48] or ALIGN [28].
Specific to referring expressions, MDETR [29] recently proposed a method for visual grounding based on a cross-modal transformer decoder trained on a fully-supervised visual grounding task.
Most similar to our work, GroupViT [61] relies on a large dataset of 30M image-text pairs to learn segmentation masks from text supervision, but the objective function and model architecture are different.
TSEG builds on CLIP [47] and uses separate encoders for different modalities with a cross-modal late-interaction mechanism.
TSEGはCLIP[47]上に構築され、異なるモダリティに対して異なるエンコーダを使用する。
0.55
Its segmentation module builds on Segmenter [53] which shows that interpolating patch features output by a Vision Transformer (ViT) [15] is a simple and effective way to perform semantic segmentation.
Here, we extend this work to perform cross-modal segmentation.
ここでは、この作業をクロスモーダルセグメンテーションに拡張する。
0.57
TSEG leverages a novel patchtext interaction mechanism to compute both image-text matching scores and pixel-level text-grounded segmentation maps in a single forward pass.
Weakly-supervised segmentation of referring expressions
参照表現の弱教師付きセグメンテーション
0.65
5 Fig. 2: Overview of our approach TSEG.
5 図2: 当社のアプローチ TSEG の概要。
0.57
(Left) Image patches and referring expressions are mapped with transformers to patch and text embeddings and then compared by computing patch-text cosine similarity scores.
(Right - Training) Our global pooling mechanism with multi-label patch assignment (MPA) reduces patch-text similarity scores to image-level labels to train the model for referring expression classification.
(Right - Inference) Sequences of patch scores (columns) are rearranged into 2D masks and bilinearly interpolated to obtain pixel-level referring expression masks.
3 Method TSEG takes as input an image and a number of referring expressions and outputs a confidence score (Fig. 2, top-right) along with a segmentation mask (Fig. 2, bottom-right) for each referring expression.
The resulting similarity matrix is S = (xi · yj)i,j ∈ RN×L.
結果の類似性行列は S = (xi · yj)i,j ∈ RN×L である。
0.88
See Figure 2 left. Image encoder.
左図2参照。 画像エンコーダ。
0.65
An image I ∈ RH×W×C is split into a sequence of patches of size (P, P ).
画像I ∈ RH×W×Cを大きさのパッチ列(P,P)に分割する。
0.68
Each image patch is then linearly projected and a position embedding is added to produce a sequence of patch tokens (p1, ..., pN ) ∈ RN×DI where N = HW/P 2 is the number of patches, DI is the number of features.
For each referring expression tj, which can consist of multiple words, we extract one token yj.
各参照表現 tj は複数の単語からなることができ、1つのトークン yj を抽出する。
0.78
To do so the text tj is tokenized into words using lower-case byte pair encoding (BPE) [51] and [BOS], [EOS] tokens are added to the beginning and the end of the sequence.
そのため、小文字バイトペア符号化(BPE)[51]および[BOS]を用いてテキストtjを単語にトークン化し、シーケンスの開始と終了に[EOS]トークンを付加する。 訳抜け防止モード: to do so―to do so tj は小文字のバイトペア符号化 (BPE ) [ 51 ] を使って単語にトークン化される そして[BOS ], [EOS ]トークンがシーケンスの開始と終了に付加されます。
0.84
A sequence of position embedding is added and a transformer encoder maps the input sequence to a sequence of contextualized word tokens from which the [BOS] token is extracted to serve as a global text representation yj ∈ RDT .
The visual and textual tokens are linearly projected to a multi-modal common embedding space and L2-normalized.
視覚的およびテキスト的トークンは、多様共通埋め込み空間とl2正規化空間に線形に投影される。
0.62
From the patch tokens (x1, ..., xN ) and the global text tokens (y1, ..., yL), we compute patch-text cosine similarities as the scalar product and obtain the similarity matrix
3.2 Global Pooling Mechanisms To leverage image-level text supervision, we need to map the matrix S ∈ RN×L of patch-text similarities to an image-level score for each referring expression, i.e., z ∈ RL.
3.2 画像レベルのテキスト管理を利用するには、パッチテキスト類似性の行列 S ∈ RN×L を参照式、すなわち z ∈ RL に対して画像レベルのスコアにマッピングする必要がある。 訳抜け防止モード: 3.2 Global Pooling Mechanisms to leverage image - level text supervision, パッチの行列 S ∈ RN×L - 画像とのテキスト類似性 - 各参照式、すなわち z ∈ RL のレベルスコア - をマッピングする必要がある。
0.82
The score vector z allows us to compute a classification loss using ground truth referring expressions.
スコアベクトル z は基底真理参照表現を用いて分類損失を計算することができる。
0.79
Note that we cannot compute per-pixel losses given the lack of pixel-wise supervision in weakly-supervised settings.
弱教師付き設定では画素単位の監督が欠如しているため、ピクセル単位の損失を計算することはできない。
0.48
Global average and max pooling (GAP-GMP).
グローバル平均と最大プール(GAP-GMP)。
0.83
A straightforward way of pooling is global average pooling (GAP), where we average the similarities for a given referring expression over all patches of an image:
Weakly-supervised segmentation of referring expressions
参照表現の弱教師付きセグメンテーション
0.65
7 Fig. 3: (Left) A patch assignment mechanism computes masks from patch-text similarities, the masks are used as weights in the global weighed pooling.
background. The masks are then soft assignments with (cid:80)L
背景 マスクは (cid:80)L でソフトな割り当てとなる
0.48
j=0 mi,j = 1 for any patch i.
パッチ i に対して j=0 mi,j = 1。
0.79
This patch assignment can be viewed as multi-class classification which is typical for semantic segmentation where one pixel is matched to a single label as proposed by [3].
can be assigned a score mM P A Patch assignment is viewed as a multi-label classification problem, this property is highly beneficial when performing weakly-supervised referring expression segmentation, as shown in Section 4.
スコアmM P A パッチ割り当てをマルチラベル分類問題と見なすことができ、セクション4に示すように、弱い教師付き参照式セグメンテーションを行う場合、この特性は極めて有益である。
0.73
i,j Image-text scores.
i.j. 画像テキストスコア。
0.54
We compute GWP scores zGW P with (4) using the masks M defined according to one of the assignment mechanism defined in (5),(6).
(5),(6) で定義された代入機構の1つに従って定義されたマスク M を用いて GWP のスコア zGW P を (4) で計算する。
0.84
Then, we compute mask size scores zsize as
そして、マスクサイズスコアをzsizeと計算する。
0.72
j = (1 − mj)p log(λ + mj), zsize
j = (1 − mj)p log(λ + mj), zsize
0.40
(7) (cid:80)N
(7) (cid:80)n
0.41
with mj = 1 i=1 mi,j.
mj = 1 i=1 mi,j の場合。
0.85
This zsize is a size-penalty term introduced by [3] to N enforce mask completeness, e g zsize < 0 for small masks.
このzサイズは、[3] から n 強制マスク完全性、例えば、小さなマスクに対して zsize < 0 というサイズペナルティ項である。 訳抜け防止モード: このサイズはサイズであり、[3 ] から N へのマスク完全性を強制するペナルティ用語である。 e g zsize < 0 for small masks
0.78
The magnitude of this penalty is controlled by λ.
このペナルティの大きさは λ によって制御される。
0.72
Due to the normalization, W used in GWP is invariant to the masks size M and zsize enforces masks to be complete.
正規化のため、GWPで用いられるWはマスクのサイズ M に不変であり、zsize はマスクの完成を強制する。
0.72
The final score defining the presence of a referring expression tj in the image is defined as the sum:
画像中の参照表現tjの存在を定義する最後のスコアは、次の和として定義される。
0.66
j zj = zGW P
j zj = zGW P
0.42
j + zsize j
j +zsize j
0.41
. (8) 3.3 Training and inference
. (8) 3.3 トレーニングと推論
0.54
In the following we describe our weakly supervised and fully supervised training procedure.
以下に、弱教師付き、完全教師付きトレーニング手順について述べる。
0.46
Furthermore, we present the approach used for inference.
さらに,推論に用いた手法を提案する。
0.60
Weakly-supervised learning. Weakly-supervised segmentation is usually addressed on datasets with a fixed number of classes.
Weakly-supervised segmentation of referring expressions
参照表現の弱教師付きセグメンテーション
0.65
9 case where visual entities in the image are defined by referring expressions we use referring expressions of samples in a mini-batch as positive and negative examples.
Finally, we optimize over the scores to match ground truth pairings z with the multi-label soft-margin loss function [2,3,58] as a classification loss,
where σ(x) = 1/(1 + exp(−x)) is the sigmoid function.
ここで σ(x) = 1/(1 + exp(−x)) はシグモイド函数である。
0.87
The loss encourages zj > 0 for positive image-text pairs and zj < 0 for negative pairs.
この損失は、正のイメージテキスト対でzj > 0、負のペアでzj < 0を奨励する。
0.65
Fully-supervised learning. In the fully-supervised case, segmentation is learned from a dataset of images annotated with referring expressions and their corresponding segmentation masks.
Only positive referring expressions (y1, ..., yL) are passed to the text encoder and the similarity matrix S is bilinearly interpolated to obtain pixel-level similarities of shape RH×W×L.
Then, we minimize the Dice loss between the sigmoid of the pixel-level similarities M = σ(S) and the ground truth masks M:
そして、画素レベルの類似性 M = σ(S) のシグモノイドと基底真理マスク M: の間のDice損失を最小化する。
0.80
where |M| =(cid:80)
ここで |M| = (cid:80)
0.60
i,j mi,j and M ∩ M = (mi,jmi,j)i,j.
i,j mi,j と M は M = (mi,jmi,j)i,j である。
0.74
Ldice(M, M) = 1 − 2
Ldice(M, M) = 1 − 2
0.42
|M ∩ M| |M| + |M| ,
M| M| |M| + |M| 。
0.29
(9) Inference. To produce segmentation masks, we reshape the patch-text masks M ∈ RN×L into a 2D map and bilinearly interpolate it to the original image size to obtain pixel-level masks of shape RH×W×L.
For SPA, pixel annotations are obtained by adding a background mask to M and applying an argmax over the refering expressions.
SPAでは、Mに背景マスクを追加し、参照式にargmaxを適用することで画素アノテーションを得る。
0.67
For MPA, we threshold the values of M using the background score.
MPAでは,背景スコアを用いてM値の閾値を決定する。
0.77
For GAP and GMP, we follow the standard approach from [2] to compute the masks M. Directly interpolating patch-level similarity scores to generate segmentation maps has been proven effective by Segmenter [53] in the context of semantic segmentation.
セグメンテーションマップを生成するパッチレベルの類似度スコアを直接補間することは、セグメンテーションの文脈において segmenter [53] によって有効であることが証明された。 訳抜け防止モード: GAP と GMP については,[2 ] からマスク M を計算するための標準的なアプローチに従う。 セグメンテーションマップを 効果的だと証明され by Segmenter [ 53 ] in the context of semantic segmentation .
0.80
Our decoding scheme is an extension of Segmenter linear decoding where the set of fixed class embeddings is replaced by text embeddings.
The dataset contains 10.5K images for training and 1.5K images for validation.
データセットにはトレーニング用の10.5Kイメージと検証用の1.5Kイメージが含まれている。
0.49
PhraseCut. PhraseCut [59] is the largest referring expression segmentation dataset with 77K images annotated with 345K referring expressions from Visual Genome [34].
Compared to RefCOCO(+), RefCOCOg has longer sentences and richer vocabulary.
RefCOCO(+)と比較して、RefCOCOgはより長い文とより豊かな語彙を持つ。
0.71
Metrics. We follow previous work and report mean Intersection over Union (mIoU) for all Pascal classes.
メトリクス。 すべてのパスカルクラスに対して、以前の研究に従い、Intersection over Union (mIoU) を報告する。
0.55
For referring expression segmentation we use standard metrics where mIoU is the IoU averaged over all image-region pairs resulting in a balanced evaluation for small and large objects [65,59].
Our TSEG model contains an image encoder initialized with an ImageNet pre-trained Vision Transformer [15,52] and a text encoder initialized with a pre-trained BERT model [12].
We use ViT-S/16 [52] and BERT-Small [54] which are both expressive models achieving strong performance on vision and language tasks, while remaining fast and compact.
We found this learning scheme to be stable resulting in good results for all three datasets.
この学習スキームは安定しており、3つのデータセットすべてに良い結果が得られた。
0.71
Regarding training iterations and the batch size, we use 16K iterations and batches of size 16 for Pascal, 80K iterations and batches of size 32 for RefCOCO, and 120K iterations with batches of size 32 for PhraseCut.
Multi-scale processing and CRF are used for inference.
マルチスケール処理とCRFが推論に使用される。
0.74
training on referring expressions, we randomly sample three positive expressions per image on average.
参照表現のトレーニングでは,画像毎に平均3つの正の表現をランダムにサンプリングする。
0.64
The resolution of images at train time is set to 384 × 384 and following standard practices we use random rescaling, horizontal flipping and random cropping.
The resolution of images at train time is 512 × 512.
列車時の画像の解像度は512×512である。
0.74
4.3 State-of-the-art methods for weakly-supervised semantic
4.3 弱教師付きセマンティックのための最先端手法
0.43
segmentation As we are the first to propose an approach for weakly-supervised learning for referring expression segmentation, we implemented state-of-the-art methods for weakly-supervised semantic segmentation to use as baselines.
We use three singlestage methods presented in Section 3.2, namely GMP [68], the seminal work GAP [2], and the more recent state-of-the-art approach SPA [3].
我々は3.2節で提示される3つの単段階法、すなわちGMP[68]、基礎的な作業 GAP [2]、そしてより最近の最先端のアプローチ SPA[3] を用いる。
0.72
SPA performs close to the best two-stage weakly-supervised methods, DRS [31] and EPS [36], two more complex methods relying on off-the-shelf saliency detectors, which is not the focus of our work.
Table 1 reports the performance on the Pascal VOC 2012 dataset.
Table 1はPascal VOC 2012データセットのパフォーマンスを報告している。
0.73
With a language model as class encoding as shown in Figure 2, we obtain similar performances as GAP [2] and SPA [3] using the same WideResNet38 backbone.
The models can directly be used to perform referring expression segmentation by replacing the class label given as input to the language model by referring expressions.
We now perform weakly-supervised referring expression segmentation.
我々は現在,弱教師付き参照表現セグメンテーションを行う。
0.51
At train time the model has to maximize the score of the image and text embeddings of correct pairings while minimizing the score of incorrect pairings.
At test time, following the standard visual grounding setting, the model is given as input the set of referring expressions present in the image and outputs a mask for each referring expression.
In Table 2c, we consider different definitions for the ground truth.
表2cでは、基底真理について異なる定義を考察する。
0.73
In the identity setup, two referring expressions of a batch are considered the same if they exactly match.
同一性設定では、バッチの2つの参照式が一致する場合、同一視される。
0.66
In the tf-idf setup, the similarity between two referring expressions if computed according to a tf-idf score.
tf-idf設定において、tf-idfスコアに従って計算された場合の2つの参照式間の類似性。
0.63
If a tabby cat is present in an image, and there is a brown cat in a second image, the ground truth score for brown cat in the first image will be positive because both referring expression share the word cat.
Weakly-supervised segmentation of referring expressions
参照表現の弱教師付きセグメンテーション
0.65
13 Fig. 4: Comparison of different pooling mechanisms for weakly supervised segmentation from referring expressions on example images from the PhraseCut dataset:
Table 2d reports the validation score for an increasing training dataset size.
テーブル2dは、トレーニングデータセットのサイズを増やすための検証スコアを報告します。
0.54
We observe that TSEG improves with the dataset size, a desirable property of weakly-supervised segmentation approach where annotations are much cheaper to collect than in the fully-supervised case.
Finally, Table 2e reports results when pretraining the visual backbone on only ImageNet for classification or by additionally pretraining the visual and language model on RefCOCO for visual grounding.
For pretraining on COCO we use box ground truth annotations as follows.
COCOの事前トレーニングには、以下のボックスグラウンドの真理アノテーションを使用します。
0.61
The model is given as input an image and referring expressions to detect, for each referring expressions the model predicts patches that are within the object bounding box.
We observe that leveraging detection related information as pretraining improves the result by 3%.
事前学習として検出関連情報を活用することで,結果が3%向上する。
0.64
In the following we report results with ImageNet pretraining only, following standard practice from the weakly-supervised semantic segmentation literature.
We now compare TSEG on referring expression datasets to weakly supervised state-of-the-art methods presented in Section 4.3, we report results in Table 3 and show qualitative results in Figure 4.
PhraseCut: GMP and GAP achieve an mIoU of 5.7 and 9.3 respectively, showing that it is possible to learn meaningful masks using referring expressions as labels.
This improvement can partly be explained by the fact that our objective allows multiple masks to overlap by design, a highly desirable property that is not satisfied by GMP, GAP and SPA.
From Figure 4d we observe that MPA generates more complete masks with both higher recall, e g the thumb on bun instance is detected, and we obtain higher precision, e g masks achieve better completeness as for the sitting woman instance.
図4dから、mpaがより高いリコール、例えばbunインスタンスの親指を検出して、より完全なマスクを生成することを観察し、例えば、着席した女性の例のように、マスクはより完全性が向上する。 訳抜け防止モード: 図4dから、MPAはより完全なマスクを生成し、両方のリコールを高くする。 例えば Bun インスタンスの親指が検出され より正確で egマスクは 座っている女性の例のように より完全性を達成します
0.70
Using CRF [8] further improves the performance to 30.12 mIoU.
CRF[8]を使用すると、さらに30.12 mIoUまで性能が向上する。
0.60
Qualitative results are presented in Figure 5.
質的な結果は図5に示します。
0.78
To obtain an upper-bound, we also train TSEG with full supervision and obtain a 49.6 mIoU.
上行程を得るには、全監督でTSEGを訓練し、49.6 mIoUを得る。
0.63
This is close to the best fully supervised method MDETR [29], which obtains 53.1 mIoU while pretraining on a much large dataset annotated for visual grounding and higher training resolution.
While there is still a gap compared to full supervision, we believe our proposed results to be promising and the first step towards large-scale weakly supervised referring expression segmentation.
There is a larger gain from using full supervision than on PhraseCut.
PhraseCutよりも完全な監視を使用することで大きな利益を得ることができます。
0.59
This could be explained by more fine-grained referring expressions such as broccoli stalk that is pointing up and is touching a sliced carrot or a darker brown teddy bear in a row of lighter teddy bears that are harder to localize without pixel-level supervision.
We evaluate the ability of TSEG to detect and localize visual concepts from text supervision by performing zero-shot experiments on Pascal VOC 2012 dataset, see Fig 6.
We, then, pass the names of Pascal classes as input to the text encoder and obtain segmentation masks and confidence scores for all 20 object classes in each image.
Interestingly, TSEG performs well on all classes except the person class.
興味深いことに、TSEGはパーソンクラス以外のすべてのクラスでうまく機能する。
0.61
As can be observed from Figure 7, the model does not detect the person label, but can be improved with label engineering by using more specific labels for the text encoder, such as woman and rider.
The model detects it by using more specific labels such as rider or woman (column 3, pink).
モデルは、ライダーや女性(カラム3、ピンク)など、より具体的なラベルを使用して検出する。
0.72
Column 4 shows the ground truth.
コラム4は真理を示す。
0.62
This bias partly comes from the annotations of PhraseCut training set and we believe that the need for label engineering may be reduced by training TSEG on a larger dataset with richer text annotations.
On the person class, by passing person as input to the text encoder we obtain an IoU of 0.6 while by merging masks for the words man, woman, men, women, child, boy, girl, baby we improve the IoU to 30.4.
By performing label engineering, TSEG reaches 50.3 mIoU.
ラベルエンジニアリングにより、TSEGは50.3mIoUに達する。
0.61
In comparison, GroupViT [61] reports an mIoU of 51.2, but it has been trained on a much larger dataset of 30M image-text pairs and was designed for zero-shot segmentation.
TSEG performs comparably to GroupViT, while trained on 350k image-text pairs.
TSEGは、350kの画像テキストペアでトレーニングしながら、GroupViTと互換性がある。
0.57
This demonstrates the ability of our approach to learn general visual concepts accurately.
これは、一般的な視覚概念を正確に学習するアプローチの能力を示しています。
0.64
5 Acknowledgements This work was partially supported by the HPC resources from GENCI-IDRIS (Grant 2021-AD011011163R1), the Louis Vuitton ENS Chair on Artificial Intelligence, and the French government under management of Agence Nationale de la Recherche as part of the ”Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute).
5 認定 この研究は、genCI-IDRIS (Grant 2021-AD011011163R1)、Louis Vuitton ENS Chair on Artificial Intelligence、フランス政府による"Investissements d’avenir"プログラムの一部として、Agence Nationale de la Rechercheの管理下にある、ANR-19-P3IA-0001 (PRAIRIE 3IA Institute)のHPCリソースによって部分的に支援された。 訳抜け防止モード: 5 認定 この作業は GENCI - IDRIS ( Grant 2021-AD011011163R1 ) の HPC リソースによって部分的に支援された。 The Louis Vuitton ENS Chair on Artificial Intelligence, and the French government under management of Agence Nationale de la Recherche as in the ” Investissements d’avenir ” program. ANR-19-P3IA-0001 (PRAIRIE 3IA Institute )を参照。
0.67
6 Conclusion This work introduces TSEG for weakly-supervised referring expression segmentation.
6 結論 本研究は、弱い教師付き参照表現セグメンテーションのためのTSEGを導入する。
0.55
We propose a multi-label patch assignment (MPA) mechanism that improves previous methods by a margin on this task.
Future work will address how to reduce the performance gap between weakly supervised and fully supervised methods and segment regions directly from image captions.
InputInputPersonRide rPersonWomanGround truthGround truth
入力PersonRiderPersonWom anGround truthGround truth
0.26
英語(論文から抽出)
日本語訳
スコア
Weakly-supervised segmentation of referring expressions
参照表現の弱教師付きセグメンテーション
0.65
17 References 1. Ahn, J., Cho, S., Kwak, S.
17 参考文献 1. Ahn, J., Cho, S., Kwak, S
0.49
: Weakly supervised learning of instance segmentation
インスタンスセグメンテーションの弱教師付き学習
0.59
with inter-pixel relations. In: CVPR (2019)
ピクセル間の関係。 CVPR(2019年)
0.62
2. Ahn, J., Kwak, S.
2. ahn, j., kwak, s。
0.32
: Learning pixel-level semantic affinity with image-level supervi-
画像レベルのsuperviを用いたピクセルレベルの意味親和性学習
0.55
sion for weakly supervised semantic segmentation.
弱い教師付き意味セグメンテーションのためのsion。
0.47
In: CVPR (2018)
CVPR(2018年)
0.40
3. Araslanov, N., Roth, S.
3. Araslanov, N., Roth, S.
0.42
: Single-stage semantic segmentation from image labels.
: 画像ラベルからの単一段階のセマンティックセグメンテーション
0.81
In: CVPR (2020)
院 CVPR(2020年)
0.58
4. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., Schmid, C.
4) Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., Schmid, C。
0.80
: ViViT: A video vision transformer.
ViViT:A ビデオビジョントランスフォーマー。
0.48
ICCV (2021)
ICCV(2021年)
0.91
5. Bilen, H., Pedersoli, M., Tuytelaars, T.
5. Bilen, H., Pedersoli, M., Tuytelaars, T。
0.82
: Weakly supervised object detection with
弱教師付き物体検出装置
0.53
posterior regularization. In: BMVC (2014)
後方正規化。 BMVC(2014年)
0.56
6. Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J.
6) Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J. 訳抜け防止モード: 6 . bojanowski, p., lajugie, r., bach, f. ラプテフ i. ポンセ j. シュミット c.、sivic、j.。
0.64
: Weakly supervised action labeling in videos under ordering constraints.
: 命令制約下でビデオ中の弱い教師付きアクションラベリング。
0.61
In: ECCV (2014)
イン:ECCV(2014年)
0.79
7. Chen, K., Gao, J., Nevatia, R.
7.チェン、k、ガオ、j、ネヴァティア、r
0.54
: Knowledge aided consistency for weakly supervised
弱教師に対する知識支援型一貫性
0.52
phrase grounding. In: CVPR (2018)
フレーズの接頭辞。 CVPR(2018年)
0.43
8. Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.
8. Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with Deep Convolutional nets, atrous convolution, and full connected crfs。 訳抜け防止モード: 8Chen, L., Papandreou, G., Kokkinos I., Murphy, K., Yuille, A.L. : Deeplab : 深部畳み込みネットを用いた意味的イメージセグメンテーション アトラス・コンボリューション と 完全に接続された crf
15. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.
15. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N。
0.84
: An image is worth 16x16 words: Transformers for image recognition at scale.
画像は16×16ワードの価値がある:大規模画像認識用トランスフォーマー。
0.71
In: ICLR (2021)
院:ICLR(2021年)
0.76
16. Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.
16. エヴァリンガム, M., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.
0.78
: The pascal visual object classes (VOC) challenge.
: pascal visual object class (voc) の略。
0.45
IJCV 88(2), 303–338 (2010)
ijcv88(2)、303-338(2010年)
0.64
17. Fan, R., Cheng, M., Hou, Q., Mu, T., Wang, J., Hu, S.
17. 扇, R., Cheng, M., Hou, Q., Mu, T., Wang, J., Hu, S.
0.41
: S4Net: Single stage salient-
:S4Net:シングルステージサリエント-
0.84
instance segmentation.
インスタンスのセグメンテーション。
0.61
In: CVPR (2019)
CVPR(2019年)
0.59
18. Fan, R., Hou, Q., Cheng, M., Yu, G., Martin, R.R., Hu, S.
18.ファン、R、フー、Q.、チェン、M.、Yu、G.、マーティン、R.R.、Hu、S.
0.72
: Associating inter-image salient instances for weakly supervised semantic segmentation.
画像の関連付け 弱教師付きセマンティックセグメンテーションのための健全なインスタンス。
0.52
In: ECCV (2018)
イン:ECCV(2018)
0.54
19. Ghadiyaram, D., Tran, D., Mahajan, D.
19. Ghadiyaram, D., Tran, D., Mahajan, D.
0.42
: Large-scale weakly-supervised pre-
大規模弱監督型プレ-
0.47
training for video action recognition.
ビデオアクション認識のためのトレーニング。
0.76
In: CVPR (2019)
CVPR(2019年)
0.59
20. Ghiasi, G., Gu, X., Cui, Y., Lin, T.
20.Ghiasi,G.,Gu,X.,C ui,Y.,Lin,T。
0.70
: Open-vocabulary image segmentation.
Open-vocabulary Image segmentation。
0.39
CoRR (2021)
CoRR (2021)
0.42
21. Gupta, T., Vahdat, A., Chechik, G., Yang, X., Kautz, J., Hoiem, D.
21. Gupta, T., Vahdat, A., Chechik, G., Yang, X., Kautz, J., Hoiem, D.
0.42
: Contrastive learning for weakly supervised phrase grounding.
42.Liu,Z.,Lin,Y.,Cao ,Y.,H.,Wei,Y.,Zhang, Z.,Lin,S.,Guo,B 訳抜け防止モード: 42.Liu,Z.,Lin,Y.,Cao ,Y. Hu, H., Wei, Y., Zhang, Z. Lin , S., Guo , B。
0.81
: Swin trans-
スイニングトランス-
0.27
former: Hierarchical vision transformer using shifted windows.
前者:シフトウインドウを用いた階層的視覚トランスフォーマー。
0.67
In: ICCV (2021)
院:iccv(2021年)
0.71
43. Loshchilov, I., Hutter, F.
43.Loshchilov, I., Hutter, F。
0.80
: Decoupled weight decay regularization.
: 脱カップリング重量減衰正則化。
0.67
In: ICLR (2019)
In: ICLR (2019)
0.42
44. Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.
44. Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.
0.46
: Endto-end learning of visual representations from uncurated instructional videos.
講義ビデオからの視覚表現のエンド・ツー・エンド学習
0.67
In: CVPR (2020)
CVPR(2020年)
0.62
45. Papandreou, G., Chen, L., Murphy, K.P., Yuille, A.L.: Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation.
45. Papandreou, G., Chen, L., Murphy, K.P., Yuille, A.L.: Weakly-and semi-supervised learning of a Deep Convolutional network for semantic image segmentation。 訳抜け防止モード: 45 . Papandreou, G., Chen, L., Murphy K.P., Yuille, A.L. : 意味的イメージセグメンテーションのための深層畳み込みネットワークの弱弱と半教師付き学習
0.86
In: ICCV (2015)
ICCV(2015年)
0.59
英語(論文から抽出)
日本語訳
スコア
Weakly-supervised segmentation of referring expressions
参照表現の弱教師付きセグメンテーション
0.65
19 46. Pinheiro, P.H.O., Collobert, R.
19 46.Pinheiro, P.H.O., Collobert, R.
0.40
: From image-level to pixel-level labeling with con-
: 画像レベルからコン付きピクセルレベルラベリング
0.77
volutional networks. In: CVPR (2015)
進化的ネットワーク CVPR(2015年)
0.59
47. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.
47. ラドフォード, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I
0.85
: Learning transferable visual models from natural language supervision.
自然言語指導による伝達可能な視覚モデルの学習
0.80
In: ICML (2021)
ICML(2021年)
0.59
48. Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M.
48. Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M. 訳抜け防止モード: 48 . Ramesh, A., Pavlov, M., Goh G., Gray, S., Voss, C., Radford, A. Chen, M。
0.80
, Sutskever, I.
, Sutskever, I。
0.41
: Zero-shot text-to-image generation.
ゼロショットテキスト画像生成。
0.71
In: ICML (2021)
ICML(2021年)
0.59
49. Ren, S., He, K., Girshick, R.B., Sun, J.
49.Ren,S.,He,K.,Girs hick,R.B.,Sun,J。
0.80
: Faster R-CNN: towards real-time object
より高速なR-CNN : リアルタイムオブジェクトに向けて
0.61
detection with region proposal networks.
エリア提案ネットワークによる検出。
0.76
PAMI 39(6), 1137–1149 (2017)
PAMI39(6)1137-1149(2 017)
0.58
50. Robbins, H., Monro, S.
50. ロビンス h. モンロ s.
0.41
: A stochastic approximation method. Annals of Mathe-
確率近似法 Mathe (複数形 Mathes)
0.33
matical Statistics (1951)
マティカル統計(1951年)
0.60
51. Sennrich, R., Haddow, B., Birch, A.
51. Sennrich, R., Haddow, B., Birch, A.
0.43
: Neural machine translation of rare words with
まれな単語のニューラル機械翻訳
0.52
subword units.
サブワードユニット。
0.77
In: ACL (2016)
In: ACL (2016)
0.42
52. Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.
52. Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.
0.42
: How to train your ViT?
ViTのトレーニング方法は?
0.54
data, augmentation, and regularization in vision transformers.
データ、拡張、および視覚変換器の正規化。
0.66
arXiv preprint arXiv:2106.10270 (2021)
arxivプレプリントarxiv:2106.10270 (2021)
0.40
53. Strudel, R., Pinel, R.G., Laptev, I., Schmid, C.
53. Strudel, R., Pinel, R.G., Laptev, I., Schmid, C。
0.91
: Segmenter: Transformer for se-
セグメンタ:se用変圧器
0.62
mantic segmentation.
マンティックセグメンテーション。
0.58
ICCV (2021)
ICCV(2021年)
0.91
54. Turc, I., Chang, M., Lee, K., Toutanova, K.
54.Turc,I.,Chang,M., Lee,K.,Toutanova,K.
0.35
: Well-read students learn better: The impact of student initialization on knowledge distillation.
: Semantic segmentation in-the-wild without seeing any
意味的セグメンテーションについて : 見えないままに (特集 意味セグメンテーション)
0.36
segmentation examples.
セグメンテーションの例。
0.69
CoRR (2021)
CoRR(2021年)
0.41
英語(論文から抽出)
日本語訳
スコア
20 R. Strudel
20 R. Strudel
0.43
68. Zhou, B., Khosla, A., Lapedriza, `A.
68. Zhou, B., Khosla, A., Lapedriza, `A。
0.82
, Oliva, A., Torralba, A.
, Oliva, A., Torralba, A。
0.81
: Learning deep features for discriminative localization.
深層的特徴の学習 差別的ローカライズのためです
0.60
In: CVPR (2016)
CVPR(2016年)
0.57
69. Zhou, C., Loy, C.C., Dai, B.
69.周、c.、ロイ、c.c.、ダイ、b.
0.68
: Denseclip: Extract free dense labels from CLIP.
: Denseclip: CLIP からフリーな高密度ラベルを抽出する。
0.74
CoRR (2021)
CoRR (2021)
0.42
英語(論文から抽出)
日本語訳
スコア
Weakly-supervised segmentation of referring expressions
参照表現の弱教師付きセグメンテーション
0.65
21 7 Appendix Qualitative results.
21 7付録 質的な結果。
0.58
We present additional qualitative results in Figures 8 and 9.
数字8と9にさらに定性的な結果を示す。
0.72
In particular, we compare TSEG trained with weak supervision to the same model trained with full supervision in Figure 8.
特に、訓練されたtsegと弱い監督とを、図8の完全な監視で訓練された同じモデルと比較します。
0.57
TSEG captures cloth related concepts, animals and parts of the bodies reasonably well, however it can fail at capturing colors, distinguish between a book and a laptop, or between a blue jean and different type of trousers.