We study a novel multimodal-learning problem, which we call text matching:
given an image containing a single-line text and a candidate text
transcription, the goal is to assess whether the text represented in the image
corresponds to the candidate text. We devise the first machine-learning model
specifically designed for this problem. The proposed model, termed TextMatcher,
compares the two inputs by applying a cross-attention mechanism over the
embedding representations of image and text, and it is trained in an end-to-end
fashion. We extensively evaluate the empirical performance of TextMatcher on
the popular IAM dataset. Results attest that, compared to a baseline and
existing models designed for related problems, TextMatcher achieves higher
performance on a variety of configurations, while at the same time running
faster at inference time. We also showcase TextMatcher in a real-world
application scenario concerning the automatic processing of bank cheques.
ABSTRACT We study a novel multimodal-learning problem, which we call text matching: given an image containing a single-line text and a candidate text transcription, the goal is to assess whether the text represented in the image corresponds to the candidate text.
We devise the first machine-learning model specifically designed for this problem.
この問題に特化して設計された最初の機械学習モデルを考案する。
0.63
The proposed model, termed TextMatcher, compares the two inputs by applying a cross-attention mechanism over the embedding representations of image and text, and it is trained in an end-to-end fashion.
We extensively evaluate the empirical performance of TextMatcher on the popular IAM dataset.
一般的なIAMデータセット上で,TextMatcherの実証性能を広範囲に評価した。
0.55
Results attest that, compared to a baseline and existing models designed for related problems, TextMatcher achieves higher performance on a variety of configurations, while at the same time running faster at inference time.
The research field of multimodal learning is an active and challenging one.
マルチモーダル学習の研究分野は活発で挑戦的な分野である。
0.81
It includes numerous (classes of) tasks – such as multimodal representation learning, modality translation, multimodal alignment, multimodal fusion, colearning – and finds application in a wide range of scenarios – such as audio-visual speech recognition, image/video captioning, media description, multimedia retrieval [2].
In this paper, we introduce a novel multimodal-learning problem, which we term text matching: given an image representing a single line of text (printed or handwritten) and a candidate text transcription, assess whether the text inside the image corresponds to the candidate text.
Applications. The need for designing an ad-hoc model for this task comes from a series of real applications, in which an image containing text is associated with the corresponding text that needs to be verified.
As a first example, consider a software which handles a user-registration procedure.
最初の例として、ユーザ登録手続きを扱うソフトウェアを考える。
0.81
This kind of software typically needs to collect information regarding personal identity documents.
この種のソフトウェアは通常、個人識別文書に関する情報を収集する必要がある。
0.71
The user is asked to upload an image of her document and also to enter data that is written in the document, such as document identifier, expiration date, and so on.
At a later stage, back-office operators check if there is a match between the uploaded document and the data inserted in the form, and, based on the outcome of the match, they accept or reject the registration.
TextMatcher: Cross-Attentional Neural Network to Compare Image and Text
textmatcher: 画像とテキストを比較するクロスタッチニューラルネットワーク
0.85
Another noteworthy application is the case of bank cheques deposited to an automated teller machine (ATM).
もう一つの注目すべき応用は、自動テラーマシン(ATM)に預金された銀行のチークケースである。
0.60
In this context, the user is typically required to insert the cheque into the ATM and, at the same time, fill in some information written on it, such as date, amount, and beneficiary.
Again, the match between what is written in the cheque and the data entered by the user is a-posteriori verified by back-office operators, who would benefit from a method that performs this check automatically.
To the best of our knowledge the text-matching problem has never been studied in the literature so far.
我々の知る限りでは、テキストマッチングの問題はこれまで文献で研究されていない。
0.65
The text matching task is related to text recognition, which has been studied extensively in multiple forms, including Optical Character Recognition for documents and Scene Text Recognition for text in natural scenes [3].
It is apparent that text recognition is more difficult than text matching, as it needs to recognize the text within the input image from scratch, rather than simply assessing whether it matches a candidate text.
For this reason, an immediate way to tackle text matching would be to use a text-recognition method to extract the text within the input image and then simply compare the extracted text with the input candidate text.
This is performed by projecting input image and text into separate embedding spaces.
これは入力画像とテキストを別々の埋め込み空間に投影することで実行される。
0.71
Then, a cross-attention mechanism is employed, which aims to discover local alignments between the characters of the text and the vertical slices of the image.
The ultimate similarity score produced by the model is a weighted cosine similarity between features of the characters and features of the slices of the image, where the weights are the computed attention scores.
The model is trained in an end-to-end fashion and, thanks to the cross-attention mechanism, it produces consistent embedding spaces for both image and text.
[2]. Referring to the taxonomy reported in that survey, the category that better complies with our text-matching problem is the (implicit) alignment one, which encompasses multimodal-learning problems whose goal is to identify relationships between sub-elements from different modalities, possibly as an intermediate step for another task.
To the best of our knowledge, our text matching has never been object of study from the literature before.
私たちの知る限りでは、私たちのテキストマッチングは、これまで文学から研究の対象にならなかった。
0.61
As such, there are no prior works that specifically focus on text matching.
そのため、特にテキストマッチングに焦点を当てた先行作品は存在しない。
0.77
In the remainder of this section, we therefore overview the literature of related (but still different) problems.
それゆえ、本節の残りでは、関連する(しかしまだ異なる)問題に関する文献を概説する。
0.71
Text recognition. Recognizing text in images has been an active research topic for decades.
テキスト認識。 画像中のテキストを認識することは、数十年にわたって活発な研究課題となっている。
0.61
A plethora of different approaches exist.
様々なアプローチが存在する。
0.71
A major state-of-the-art text-recognition model, which we take as a reference in this work, is ASTER [10, 11], i.e., an end-to-end neural network using an attentional sequence-to-sequence model that predicts a character sequence directly from the input image.
1The automatic-cheque-pro cessing use case was investigated and developed as a real application at a well-established bank, and is currently used in production.
TextMatcher: Cross-Attentional Neural Network to Compare Image and Text
textmatcher: 画像とテキストを比較するクロスタッチニューラルネットワーク
0.85
The main difference between text recognition and our text-matching problem is that the former extracts text from images without relying on any input candidate text.
A naïve approach to text matching would be to run a text-recognition method on the input image, and using the input candidate text only to check the correspondence with the recognized text.
A major limitation of this approach is that it disregards the candidate text at all, thus resulting intuitively less effective than approaches that, like the proposed TextMatcher, are specifically designed for text matching and profitably exploit the candidate text.
• While the text-recognition model is trained only on the matching data, TextMatcher is fed with both positive and negative examples during training, allowing it to better learn the frontier between the two sets whenever it is relevant, for instance when a difference of a single character can be important for a large portion of the dataset (e g “MR smith” vs. “MS Smith”).
• The similarity score we compute for the text-recognition model, described in Section 5.3, treats all character discrepancies equally and does not consider the similarity between the shapes of the characters; conversely, TextMatcher is able to assign similar embeddings to characters that look alike, especially when trained on handwritten data.
The fundamental difference lies in the fact that the input images to image-text matching are general-purpose ones, i.e., they are not constrained to represent (a single-line) text, like in our text matching.
This makes image-text matching consider the semantic content of the image, whereas text matching looks solely at (the syntax of) the text within the image.
As a result, image-text matching is typically employed in applications that are far away from the ones targeted by text matching, such as generation of text descriptions from images or image search.
From a methodological point of view, image-text matching and text matching share more similarity, as both the problems can be in principle approached with techniques that somehow involve learning a shared representation for the image and the text.
However, important technical differences are still there between models to be designed for text matching, like the proposed TextMatcher, and approaches to image-text matching.
We discuss them in detail in the following. Among the prominent models for image-text matching are the ones proposed in [6, 8], which use a crossattention mechanism to inspect the alignment between image regions and words in the sentence, and [13], which exploits the correlation of semantic roles with positions (those of objects in an image or words in a sentence).
The proposed TextMatcher uses attention as well, but, unlike [8, 13], it makes a simpler consideration of the horizontal position of a character in the image.
Also, while [6, 8, 13] use pretrained models to generate feature representations for the image regions, our TextMatcher is trained end-to-end, thus being capable of learning the weights of the convolutional layer alongside the attention layer.
This makes TextMatcher able to learn the most meaningful features associated to the shape of the characters, and, at same time, makes it sensitive to the font and handwriting style of the training set.
Finally, while the image-text-matching models in [6, 8, 13] use a recurrent neural network (RNN) to build a feature representation for the text, for our text-matching task we observed that a learnable embedding matrix over the characters of the alphabet is sufficient, and adding an RNN does not yield measurable advantages.
This is expected, as text matching is not concerned with the semantics of the text at hand.
これは、テキストマッチングが目の前のテキストのセマンティクスに関係しないため、期待できる。
0.65
3 Text Matching Problem
3 テキストマッチングの問題
0.90
We tackle a multimodal-learning problem, which we term text matching and define as follows: given an image containing a single-line text (printed or handwritten), together with a candidate text transcription, assess whether the text inside the image corresponds to the candidate text.
This corresponds to a binary supervised-classification task, in
これはbinary supervised-classific ationタスクに相当します。
0.68
which we are given a dataset of the form {(cid:0)(I i, ti), li(cid:1)|i = 1, . . . , n}, where I i and ti are image and text inputs of
それらは {(cid:0)(i i, ti), li(cid:1)|i = 1, . . . , n} という形式のデータセットを与えられ、ここで i i と ti は画像とテキストの入力である。
0.80
the i-th example, and li is the corresponding binary label.
i 番目の例と li は対応するバイナリラベルである。
0.73
In particular, we adopt the following convention: an (image, text) pair is assigned the “1” label if image and text correspond, and, in this case, the pair is recognized as a matching pair.
The image and the text are independently projected into separate embedding spaces, and then these embeddings are compared through a cross-attention mechanism.
The aim of the cross-attention mechanism consists in discovering local alignments between the characters of the input text and the vertical slices of the input image.
These blocks are jointly trained in an end-to-end fashion.
これらのブロックはエンドツーエンドで共同で訓練される。
0.59
4.1 Image Embedding
4.1 画像埋め込み
0.46
In order to produce the image embedding, the input image is first resized at a fixed dimension, and then is processed by some convolutional layers, eventually followed by recurrent layers in order to also encode contextual information.
The resulting matrix I has a fixed dimension si × di and contains features related to specific receptive fields from the input image.
得られた行列 i は固定次元 si × di を持ち、入力画像から特定の受容体に関連する特徴を含む。
0.71
In particular, we want the model to analyse the input image by scanning its embedding features along the vertical dimension: we denote di as the number of vertical receptive fields, or slices, from the original image, and si as the feature dimension.
In particular, the input image is fed into a set of convolutional layers and batch normalization layers, followed by a bidirectional Long Short-Term Memory (LSTM) module.
Let A be the alphabet, which also includes a special character for the padding.
A をアルファベットとし、パディング用の特別な文字も含む。
0.61
The embedding matrix Temb is a learnable matrix of dimension |A| × dt.
埋め込み行列 Temb は次元 |A| × dt の学習可能な行列である。
0.72
Given a text c1c2 . . . cl, we first pad it to a fixed length st (or truncate it if l > st).
テキスト c1c2 . . . cl が与えられたとき、まずそれを固定長 st (l > st ならば切り刻む) にパディングする。
0.77
Then each character is projected into the embedding space through the embedding matrix Temb, taking as embedding representation the row corresponding to the character.
So far, these systems suffered from the long-range dependency problems of Recurrant Neural Networks (RNNs), as their performance degrades rapidly as the length of the input sentence increases.
The attention mechanism overcomes these problems and at the same time it allows to give more importance to some of the input words compared to others while translating the sentence.
Later, this mechanism has been widely applied in other applications concerning sequential inputs, including natural language processing, computer vision and speech processing.
Moreover, in [12] the attention scores are used to compute a weighted sum of the value vectors of each token in the sentence, while in our case the attention scores are used to compute a weighted sum of cosine similarities between each character and the slices of the image, since our goal is to compute a similarity between image and text.
TextMatcher: Cross-Attentional Neural Network to Compare Image and Text
textmatcher: 画像とテキストを比較するクロスタッチニューラルネットワーク
0.85
Figure 2: Visual representation of the cross-attention mechanism.
図2: クロスアテンションメカニズムの視覚的表現。
0.73
Figure 3: Computation of the attention matrix.
図3:注意行列の計算。
0.57
First of all, in order to inject some positional information, we add independent positional embeddings to both image and text embeddings.
まず、位置情報を注入するために、画像とテキストの埋め込みの両方に独立した位置埋め込みを追加します。
0.71
The positional embeddings have the same dimension of the corresponding text or image embedding, and so can be summed up.
位置埋め込みは対応するテキストや画像の埋め込みと同じ次元を持ち、要約することができる。
0.62
Inspired by [12], we use sine and cosine functions of different frequencies, where each dimension of the positional encoding corresponds to a sinusoid.
From now on, with a little abuse of notation, we will consider T and I as the text and image embeddings with the addition of the positional embeddings.
これからは、表記法を少し悪用して、T と I を位置埋め込みの追加によるテキストおよび画像埋め込みとみなす。
0.58
Let us consider the perspective of the text: for each character of the text, we want to compute an attention score with respect to each vertical slice of the image embedding, in order to pay more attention to the portion of the image that is expected to contain the corresponding character.
The idea of this attention mechanism is depicted in Figure 2.
この注意機構のアイデアは図2に示されています。
0.78
We compute attention scores between the embeddings of the characters and those of the vertical slices of the image by first projecting these vectors into separate embedding spaces of dimension datt, and then computing normalized dot products between all pairs of characters and slices of the image.
These vectors are packed together respectively into the query matrix Q = T Qt and the key matrix K = IKi, where Qt and Ki are learnable parameters of dimension dt× datt and di× datt respectively.
これらのベクトルは問合せ行列 Q = T Qt とキー行列 K = IKi にまとめられ、Qt と Ki はそれぞれ dt× datt と di× datt の学習可能なパラメータである。 訳抜け防止モード: これらのベクトルはそれぞれ、クエリ行列 Q = T Qt にまとめられる。 そしてキー行列 K = IKi, ここでは Qt と Ki はそれぞれ dt× datt と di× datt の学習可能なパラメータである。
0.85
The resulting matrices are Q of dimension st× datt and K of dimension si× datt.
得られた行列は次元 st× datt の Q と次元 si× datt の K である。
0.82
Then we compute the attention matrix of dimension st × si as the dot product between the query Q and the key K, and then we apply a softmax function over the columns of the result, as illustrated in Figure 3:
次に、クエリQとキーKの間のドット積として次元 st × si の注意行列を計算し、図3に示すように、結果の列にソフトマックス関数を適用する。
0.70
In this way, the i-th row of the attention matrix contains the normalized attention scores of the i-th character of the input text with respect to each vertical slice of the image embedding.
Then, the value vectors are used to compute a weighted cosine similarity between characters and steps of the image embedding.
次に、値ベクトルを用いて、画像埋め込みの文字とステップ間の重み付きコサイン類似度を算出する。
0.80
First, value matrices are computed for both image and text embeddings:
まず、値行列は画像とテキストの埋め込みの両方に対して計算される。
0.63
A = sof tmax(QK t, dim = 1)
A = sof tmax(QK t, dim = 1)
0.42
(1) (2) (3) with learnable parameters Vt and Vi of dimension dt × datt and di × datt respectively.
(1) 2) (3) 次元 dt × datt の学習可能なパラメータ Vt と Vi をそれぞれ dt × datt と di × datt とする。
0.64
The resulting matrices Vtext of dimension st × datt and Vimage of dimension si × datt are normalized over the columns in order to directly compute image has dimension st × si: the component cosine similarities as their dot product.
その結果得られる次元 st × datt の行列 vtext と次元 si × datt の vimage は列上で正規化され、画像を直接計算するために次元 st × si が与えられる。 訳抜け防止モード: その結果得られる行列 vtext of dimension st × datt および vimage of dimension si × datt は列上で順に正規化される。 画像を直接計算するには、次元 st × si: 成分コサインの類似性が点積として用いられる。
0.78
The cosine matrix C = VtextV t
余弦行列 C = VtextV t
0.59
Vtext = normalize(T Vt, dim = 1) Vimage = normalize(IVi, dim = 1)
Vtext = normalize(T Vt, dim = 1) Vimage = normalize(IVi, dim = 1)
0.43
5
5
0.42
英語(論文から抽出)
日本語訳
スコア
TextMatcher: Cross-Attentional Neural Network to Compare Image and Text
textmatcher: 画像とテキストを比較するクロスタッチニューラルネットワーク
0.85
(i, j) is the cosine similarity between the character at position i and the vertical slice of the image embedding at position j.
(i,j)は、位置iにおける文字と位置jに埋め込まれた画像の垂直スライスとのコサイン類似性である。
0.84
Then, the cosine matrix is multiplied element-wise with the attention matrix, and a sum over the columns is performed, in order to compute a weighted cosine similarity of each character with respect to each step of the image embedding:
(4) where (cid:12) stands for the element-wise multiplication.
(4) (cid:12) は要素ワイド乗法を表す。
0.72
Finally, the similarities not related to pad characters are summed up, obtaining the final similarity score between the input image and the candidate text: Stm = sum(Catt[pad = 1]).
Ultimately, given a threshold τ, the predicted binary label is given by:
最終的に、しきい値 τ が与えられると、予測されたバイナリラベルは次のようになる。
0.54
Catt = sum(C (cid:12) A, dim = 1)
Catt = sum(C (cid:12) A, dim = 1)
0.48
(cid:26)1,
(cid:26)1,
0.44
0, ˆl = if Stm ≥ τ if Stm < τ
0, は、l。 Stm < τ が Stm ≥ τ ならば
0.61
(5) 4.4 Loss
(5) 4.4 損失
0.37
The resulting TextMatcher network contains the following parameters: Wencoder, Temb, posi, post, Qt, Ki, Vt, Vi, where Wencoder contains the weights of the image encoder and post and posi are the positional embeddings, possibly
Wencoder, Temb, posi, post, Qt, Ki, Vt, Vi ここでは、Wencoderは画像エンコーダの重みを含み、post and posiは位置埋め込みである。 訳抜け防止モード: 結果として得られたTextMatcherネットワークには、以下のパラメータが含まれている。 posi, post, Qt, Ki, Vt, Vi, Wencoder は Image Encoder と Post and posi の重みを含む 位置の埋め込みです
0.85
carefully initialized and then frozen.
慎重に初期化し 凍結します
0.70
Given a dataset of matching and non matching pairs {(cid:0)(I i, ti), li(cid:1)|i = 1, . . . , n},
where I i and ti are image and text inputs of the i-th example and li is the corresponding binary label, the matching network is trained with the following contrastive loss, originally introduced in [5] :
i と ti が i 番目の例のイメージとテキストの入力であり、li が対応するバイナリラベルである場合、マッチングネットワークは、 [5] で最初に導入された、以下のコントラスト損失で訓練される。 訳抜け防止モード: i と ti は i - th の例と li のイメージとテキストの入力です 対応するバイナリラベルであり、マッチングネットワークは次のコントラスト損失でトレーニングされる。 もともと [5 ] で導入された。
0.77
where m is the margin and α is used to balance between matching and non matching pairs.
m はマージンであり、α はマッチング対と非マッチング対のバランスをとるために用いられる。
0.72
Notice that this loss pushes matching pairs to have similarity close to 1, and non matching pairs to have similarity close to 0.
(6) 5 Experiments The experimental analysis was carried out both on a well-known real public dataset as well as in the context of a real case study on bank cheques on a proprietary real dataset, provided by a well-established bank.
In this section we will explain in details the settings for the experiments on the real public dataset, and then we will briefly introduce the real case study.
5.1 Dataset In our experiments we use the standard IAM handwriting database [7].
5.1データセット 実験では、標準のIAM手書きデータベース [7] を使用しました。
0.62
This database consists of 1539 pages of scanned text from 657 different writers.
このデータベースは、657の異なる著者から1539ページのスキャンされたテキストで構成されている。
0.52
The database also provides the isolated and labeled words that have been extracted from the pages of scanned text using an automatic segmentation scheme and were verified manually.
We use the dataset at word level, and consider the available splitting for training, validation and test sets proposed for the Large Writer Independent Text Line Recognition Task, in which each writer contributed to one set only.
The cropped words provided in the database consist of the concatenation of characters with white background.
データベースに提供されている切り抜かれた単語は、白い背景を持つ文字の結合から成り立っている。
0.60
Therefore we perform the cropping again starting from the images of the entire pages and using the provided boxes.
そのため、ページ全体の画像から始まり、提供されたボックスを使用して再び収穫を行う。
0.71
We set the alphabet to abcdefghijklmnopqrst uvwxyz-’ and we filter out words with characters outside the alphabet, or words only composed of punctuation marks.
You can see samples of matching pairs in Figure 4.
マッチングペアのサンプルは図4で見ることができる。
0.84
5.2 Non Matching Pairs Generation
5.2 非マッチングペア生成
0.80
The considered multimodal task depends on a given dataset of matching and non matching pairs, and is therefore strictly related to a particular distribution of non matching pairs.
matching text; • mixed: the text of a non matching pair is a random word inside V with probability 1
一致するテキスト; •混合:不一致対のテクストは確率 1 の v 内のランダムな単語である
0.82
3, has Levenshtein distance
3はレベンシュテイン距離を持つ
0.74
equal to 1 with probability 1
確率 1 の 1 に等しい.
0.77
3 or equal to 2 with probability 1 3.
3 または 2 に等しい確率 1 3 である。
0.90
We made 4 synthetic datasets producing one non-matching sample for each matching pair, so that the proportion of examples with labels 1 and 0 is the same.
5.3 Competing Methods We compare the text matching model with two other models: a simple baseline designed for the considered task and a model for text recognition adapted to the task.
In particular, the image embedding I of dimension st × dt and the text embedding T of dimension st × dt are defined in the same way as in the TextMatcher model, with the constraint that the feature dimensions di and dt must be equal.
特に、次元 st × dt の像埋め込み i と次元 st × dt のテキスト埋め込み t はtextmatcher モデルと同様に定義され、特徴次元 di と dt は等しくなければならないという制約がある。
0.65
So, I is the encoder of ASTER, and T is computed from an embedding matrix over the alphabet.
TextMatcher: Cross-Attentional Neural Network to Compare Image and Text
textmatcher: 画像とテキストを比較するクロスタッチニューラルネットワーク
0.85
Figure 5: Illustration of the choice of the optimal threshold.
図5: 最適なしきい値の選択の例。
0.70
with the convention that rows related to pad characters are not considered in the average of the text embedding.
パッド文字に関連する行は、テキスト埋め込みの平均では考慮されないという規約で。
0.70
Finally, the output of the model is the cosine similarity between the average image and text:
最後に、モデルの出力は平均的な画像とテキストの間のコサインの類似性である。
0.68
Sb = avg · Iavg Tt
Sb = avg ·Iavg Tt
0.42
(cid:107)Tavg(cid:10 7) · (cid:107)Iavg(cid:10 7)
(cid:107)Tavg(cid:10 7)·(cid:107)Iavg(cid:10 7)
0.38
(9) The parameters of the convolutional part and the embedding matrix for the text are trained end-to-end in the final multimodal task, using the loss in Section 4.4.
Finally, for both the baseline and the text recognition model adapted to the text matching task, we can compute the predicted binary label starting from the computed similarities in Eq (7) and Eq (9) in the same way done for the TextMatcher model in Eq (5).
5.4 Evaluation Metrics For the considered multimodal task, we focus on the evaluation as binary classification.
5.4評価指標 マルチモーダルタスクを考慮した場合,二分分類の評価に注目する。
0.72
We evaluate the models using the confusion matrix and the F1-score as a global evaluation metric.
全球評価指標として混乱行列とf1-scoreを用いてモデルを評価する。
0.76
For each considered model, we choose the optimal threshold τ in Eq (5) on the validation set with respect to the f1-score and report the performance on the test set.
The image embedding part is given by the encoder layer of ASTER [11] with a final bidirectional LSTM with 256 hidden dimension, which produces an image embedding of dimension 64 × 512.
The encoder is initialized with the weights of the pretrained model available from the original source code of [11].
エンコーダは[11]のオリジナルのソースコードから、事前訓練されたモデルの重みで初期化される。
0.82
For the text embedding we use dt = 512 and maximum word length st = 20.
テキスト埋め込みには、dt = 512 と最大単語長 st = 20 を用いる。
0.76
We add positional embeddings to both image and text embeddings, using the same initialization strategy proposed in [12], and then we freeze them during training.
The text embedding and the other attention parameters are initialized with the Xavier initialization [4].
テキスト埋め込みと他の注意パラメータは xavier 初期化 [4] で初期化されます。
0.79
The training is performed with SGD with momentum 0.9 using learning rate equal to 0.005 and batch size 8.
sgdと運動量0.9で0.005に等しい学習率とバッチサイズ8でトレーニングを行う。
0.67
The loss has margin m equal to 1 and α equal to 1.
この損失はマージン m が 1 と α が 1 と等しい。
0.68
The maximum number of epochs is 50, except for the edit1 configuration in which is 100, and we take the best model accordingly to the f1-score on the validation set.
The model is initialized with the weights of the publicly available pretrained model.
このモデルは、公開されている事前訓練モデルの重みで初期化される。
0.67
All hyperparameters are set to the default values, except the batch size set at 64, the height of input images set at 32 and the maximum number of epochs set at 35.
The alphabet is configured as abcdefghijklmnopqrst uvwxyz-’.
アルファベットはabcdefghijklmnopqrst uvwxyz-’と設定されている。
0.71
We perform a single training using the dataset
データセットを使って1つのトレーニングを行い
0.72
{(cid:0)I i, ti(cid:1)|i = 1, . . . , n}, where I i is the i-th image and ti the corresponding (correct) text.
{(cid:0)I i, ti(cid:1)|i = 1, . . . . , n} ここで I i は i 番目の画像であり、ti は対応する(正しい)テキストである。
0.80
Finally, for the Baseline model described in Section 5.3, the image embedding part is given by the encoder of ASTER [11] with a final LSTM with 256 hidden dimension, which produces an image embedding of dimension 64 × 512,
All the other hyperparameters are initialized with the same values as for TextMatcher.
他のハイパーパラメータはすべて、TextMatcherと同じ値で初期化されます。
0.78
5.6 Results As explained at the beginning of this section, we made experiments on 4 different configurations and we compared our approach with two alternatives, the ASTER model and the Baseline model: you can see the results in Table 1.
Conversely, the edit1 and edit12 datasets are more difficult, because in case of non matching the Levensthein distance between the candidate text and the corresponding matching text is small.
Indeed, the f1-scores are lower than the ones obtained in the random configuration: for instance the TextMatcher model reaches 85.77 in the first case and 88.05 in the latter.
Finally, the mixed dataset is a configuration of intermediate complexity: the f1-score in this case is between random and edit12 experiments, as we expected.
Instead, ASTER model is a valid competitor for the TextMatcher: it reaches lower but comparable f1-score for the random and edit1 configurations, while the gap is larger in the edit12 and mixed datasets.
The experiments show that in general the f1-score of the TextMatcher model is higher than the f1-score of ASTER, in particular for the edit12 and mixed configurations.
, i.e. when errors of different complexity need to be recognised together.
つまり、異なる複雑さのエラーを一緒に認識する必要がある場合です。
0.68
This is partly related to the distribution of similarities produced by the different models.
これは、異なるモデルによって生成される類似性の分布に部分的に関係している。
0.66
Indeed, the TextMatcher model produces a continuous distribution of values, treating different kinds of error similarly.
実際、TextMatcherモデルは値の連続的な分布を生成し、異なる種類のエラーを同じように扱う。
0.73
Conversely, the distribution of similarities produced by ASTER is discontinuous and would need different optimal thresholds for different kinds of error.
This can be seen in Figure 6, which shows the distribution of similarities for matching and non matching examples computed by TextMatcher and ASTER on different configurations.
For instance, for the ASTER model on the mixed configuration (image (d) in Figure 6) the optimal threshold is 0.94, but if we consider only the negative examples of type random the optimal threshold would be 0.45, while considering only the edit1 and edit12 type of negative examples the optimal threshold would be respectively 0.82 and 0.94.
You can see from the figure that the optimal threshold equal to 0.95 is necessary to distinguish the edit1 non matching examples from the matching examples, and therefore the resulting performance is analogous to the edit1 configuration.
Furthermore, the continuous distribution of similarities produced by the TextMatcher model also allows choosing the desired trade-off between false positives and false negatives, while for the ASTER model this is not always possible, at least not with the same flexibility.
Finally, the TextMatcher model can be trained on specific patterns that need to be recognised (e g you want to distinguish a text like facebook ltd from the matching text facebook inc).
kinds of negative examples, would specialise in recognising these errors, paying more attention to the relevant part of the text.
ネガティブな例は、これらのエラーを認識し、テキストの関連部分にもっと注意を払うことに特化します。
0.66
Conversely, the ASTER model treats all kind of typos inside the text in the same way.
逆に、ASTERモデルはテキスト内のすべてのタイプミスを同じように扱う。
0.66
Another advantage of the TextMatcher model is speed; we tested the CPU inference time of the two trained models for 1000 random examples taken from the test set of the mixed configuration: ASTER takes around 0.58 seconds on average per image while TextMatcher around 0.07 seconds per image, which is 8.75x faster.
5.7 Visualization We can visualize the intermediate features computed by the TextMatcher model.
5.7 可視化 TextMatcherモデルによって計算された中間機能を視覚化できる。
0.51
This is particularly interesting since you can analyse what the model is learning in a particular configuration, and can be helpful to verify if the model behaves as expected.
Notice that the rows correspond to the characters of the candidate text, i.e. l-i-v-e-l-y or f-i-e-l-d, and the columns correspond to the vertical slices of the image.
The cosine matrix C computes cosine similarities between characters and slices of the image: for instance you can see that the character l has high similarity in two areas, corresponding to the regions where there is the first and the second l of the word lively.
Conversely, the second row corresponding to the non matching text field shows a different behaviour: the attention matrix A has no longer a simil-diagonal structure, and you can see that each character tries to find the region of the image where there is the corresponding character or at least the most similar one.
The cosine matrix C shows that the characters f and d are probably missing, since there are no areas with high values.
コサイン行列 c は、高い値を持つ領域がないため、文字 f と d が欠落していることを示している。
0.69
Then, the combination of attention and cosine matrix in the third and fourth images highlight the fact that f and d are missing, especially the latter which has a similarity equal to −0.63.
そして、第3と第4の画像における注意とコサイン行列の組み合わせは、f と d が欠落していること、特に後者は −0.63 に等しい類似性を持つという事実を強調する。
0.68
5.8 Case Study
5.8 ケーススタディ
0.70
As mentioned in the introduction, we developed the TextMatcher model in the context of a real application at a well-established bank.
Later, a back office operator manually checks the correctness of the inserted fields.
その後、バックオフィスオペレータが挿入されたフィールドの正しさを手動でチェックする。
0.64
The purpose of the application consists in automating this procedure with an algorithmic solution able to analyze the scanned image of the cheque and verify the textual fields inserted in the ATM.
TextMatcher: Cross-Attentional Neural Network to Compare Image and Text
textmatcher: 画像とテキストを比較するクロスタッチニューラルネットワーク
0.85
Figure 7: Visualization of the intermediate features computed by the TextMatcher model trained on mixed configuration on an image with matching text lively.
On the top there is the example of the matching text, on the bottom of the non matching text field.
上部には、一致しないテキストフィールドの下部にある一致したテキストの例がある。
0.67
We applied the TextMacther model to the verification of the date field.
日付フィールドの検証にTextMactherモデルを適用した。
0.60
The developed solution consists of two main steps: a YoLo model [9] extracts the text region and then the TextMatcher model processes the scanned image.
In order to mitigate the absence of future dates, the dataset was enlarged with images from the amount field and synthetic images of dates in a larger time interval.
Therefore, we generated the majority of negative examples with difficult cases.
その結果, 否定的な例が多数発生し, 困難な症例が生じた。
0.63
We also prepared a test dataset where all the dates are more recent than the ones in the training and validation sets, in order to estimate the performance on future years.
Therefore, the TextMatcher model was chosen and is now used in production.
そのため、TextMatcherモデルが選択され、現在はプロダクションで使用されている。
0.68
6 Conclusions In this paper, we propose a novel task of text matching, to compare an image containing a single-line text and the corresponding text transcription, together with a model for this task, named TextMacher.
The model directly processes input image and text, computing a similarity score for the two inputs.
モデルは入力画像とテキストを直接処理し、2つの入力の類似点を計算する。
0.82
Our approach projects image and text into separate embedding spaces, and exploits a cross-attention mechanism which is able to discover local alignments between image and text.
In addition, we are hopeful about possible future work regarding the methodology proposed, the cross-attention mechanism could be adopted to different vector embeddings, e g audio and text embedding.
The iam-database: an english sentence database for offline handwriting recognition.
iam-database:オフライン手書き認識のための英語文データベース。
0.82
International Journal on Document Analysis and Recognition, 5(1):39–46, 2002.
International Journal on Document Analysis and Recognition, 5(1):39–46, 2002
0.46
[8] Xuefei Qi, Ying Zhang, Jinqing Qi, and Huchuan Lu.
[8]十恵フェイ・チー、ヤン・チャン、ジンク・チー、ユーチュアン・ル
0.46
Self-attention guided representation learning for image-text
画像テキストのための自己注意誘導表現学習
0.64
matching. Neurocomputing, 450:143–155, 2021.
一致する。 ニューロコンピューティング、450:143–155、2021。
0.57
[9] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi.
9]ジョセフ・レッドモン、サントシュ・ディヴヴァラ、ロス・ギルシック、アリ・ファラディ
0.50
You only look once: Unified, real-time object detection.
一度だけ見えます: 統一されたリアルタイムオブジェクト検出。
0.70
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
Proceedings of the IEEE conference on computer vision and pattern recognition, page 779–788, 2016 訳抜け防止モード: In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 779-788頁、2016年。
0.83
[10] Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai.
〔10〕白広志、青江王、天元龍、コン・ヤオ、西安梅
0.45
Robust scene text recognition with automatic rectification.
自動整流によるロバストなシーンテキスト認識
0.79
In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 4168–4176, 2016.
2016年 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 Las Vegas, NV, USA, June 27-30, 2016 page 4168–4176, 2016 訳抜け防止モード: 2016年、IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016に参加。 ラスベガス, NV, USA, June 27 - 30 2016 4168-4176、2016年。
0.80
[11] Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai.