The practical aspects of evaluating recommender systems is an actively
discussed topic in the research community. While many current evaluation
techniques bring performance down to a single-value metric as a straightforward
approach for model comparison, it is based on a strong assumption of the
methods' stable performance over time. In this paper, we argue that leaving out
a method's continuous performance can lead to losing valuable insight into
joint data-method effects. We propose the Cross-Validation Thought Time (CVTT)
technique to perform more detailed evaluations, which focus on model
cross-validation performance over time. Using the proposed technique, we
conduct a detailed analysis of popular RecSys algorithms' performance against
various metrics and datasets. We also compare several data preparation and
evaluation strategies to analyze their impact on model performance. Our results
show that model performance can vary significantly over time, and both data and
evaluation setup can have a marked effect on it.
The practical aspects of evaluating recommender systems is an actively discussed topic in the research community.
推薦システム評価の実践的側面は,研究コミュニティにおいて活発に議論されているトピックである。
0.65
While many current evaluation techniques bring performance down to a single-value metric as a straightforward approach for model comparison, it is based on a strong assumption of the methods’ stable performance over time.
We propose the Cross-Validation Thought Time (CVTT) technique to perform more detailed evaluations, which focus on model cross-validation performance over time.
On the one hand, RecSys methods are already actively used for recommendations in a large variety of domains such as movies [54], music [43], news [14], and many more [3, 42, 60].
Various new methods are constantly being developed in order to improve user recommendations with deep learning techniques [35, 45, 61], memory-based [51] and latent factor-based methods [11, 29, 46], or even reinforcement learning [1, 36].
On the other hand, despite all this progress, the RecSys evaluation protocol is still an open question due to various possible data splitting strategies and data preparation approaches [12, 28, 52].
In this way, in order to unlock the full potential of data-driven approaches, we still require more nuanced evaluation techniques to fully estimate RecSys methods on historical data.
In early works [6, 8], researchers highlighted the importance of time-based algorithm validation.
初期の研究 [6, 8] では、研究者は時間ベースのアルゴリズム検証の重要性を強調した。
0.69
In a prior study [57], the authors sampled 85 papers published in 2017-2019 from top conferences, and concluded that random-split-by-rati o and leave-one-out splitting strategies are used in 82% of cases.
At the same time, recent studies [40] pointed out that the most strict and realistic setting for data splitting is a global temporal split, where a fixed time-point separates interactions for training and testing.
1) are embarrassingly simple: (1) RecSys performance changes significantly over time, (2) our choice and usage of the time component affects the metrics we receive upon evaluation.
While validation over time is demanding in such a setup, a realistic setting for time-based algorithm validation is precisely what current RecSys evaluation progress insists on.
Moreover, we show that methods could have performance trends over time.
さらに,提案手法は時間とともに性能の傾向を示す。
0.74
• Although it is natural to expand the training dataset with time, we demonstrate that performance can change significantly depending on the chosen data strategy, and continuously extending the dataset can lead to biased results.
2.1 Offline evaluation Following recent studies in RecSys evaluation [9, 23, 40], we summarize the five main data splitting strategies for offline RecSys evaluation:
In this case, particular users and all their interactions are reserved for training, while a different user set and all their interactions are applied for testing.
• Leave One Out Split selects one last final transaction per user for testing while keeping all remaining interactions for training.
• leave One Out Splitは、トレーニングのために残りのすべてのインタラクションを保持しながら、ユーザ毎に最後の最後のトランザクションを選択する。 訳抜け防止モード: •片方の切り離し ユーザ毎の最後のトランザクションを1つ選択し、残りのインタラクションをすべてトレーニング用に保持する。
0.77
In the case of next-item recommendations, the last interaction corresponds to the last user-item pair per user [4, 18, 65].
• Temporal User Split splits per-user interactions into train and test sets based on interaction timestamps (e g , the last 20% of interactions are used for testing).
• 時間的ユーザ分割は、ユーザ毎のインタラクションを、インタラクションタイムスタンプに基づいてトレインとテストセットに分割する(例えば、最後の20%がテストに使用される)。 訳抜け防止モード: •temporal user splits per - user interaction splits into train and test sets based based on interaction timetamps (例) 最後の20パーセントのインタラクションはテストに使用されます。
0.85
While this scenario is actively used in the sequential RecSys domain [35, 59], it could lead to a data leakage discussed in [40].
Compared to Leave One Out Split or Temporal User Split, this method could sample fewer interactions for testing since having the same amount of users or items in train and test sets is not guaranteed.
leave One Out Split や Temporal User Split と比較すると,同量のユーザやテストセットが保証されていないため,テストのためのインタラクションのサンプリングが少なくなる可能性がある。
0.83
Nevertheless, according to recent studies [23], this is the only strategy that prevents data leakage.
しかし、最近の研究[23]によると、これはデータ漏洩を防ぐ唯一の戦略である。
0.79
An overview of where these strategies were used is presented in Table 1.
これらの戦略の使い方の概要を表1に示す。
0.74
As noted in previous works [40], there is little consistency in the selection of evaluation protocols.
前回の研究[40]で述べたように、評価プロトコルの選択には一貫性がほとんどありません。
0.65
Even when the same datasets are used, researchers can select different data splitting strategies for model comparison.
同じデータセットを使用しても、モデル比較のために異なるデータ分割戦略を選択することができる。
0.74
For example, in [37] and [64] all authors used Amazon and Yelp datasets, but different methods were used as data splitting strategies: Random Split and Leave One Out, respectively.
Furthermore, Table 1 shows that very few (3 out of 30) methods were evaluated in the most realistic and data-leak-proofed strategy [23] by using temporal global split.
2.2 Time-Aware Cross-Validation While RecSys offline evaluation is an actively studied topic, RecSys time-aware cross-validation is not as widely covered.
On the other hand, while recent works [31] focus on sequence-aware approaches, they lack simple statistical and matrix factorization-based baselines, including only RNN-based methods.
Finally, available global temporal-based benchmarks [28] lack time-dependent evaluation over time, reducing the final performance to a single-value metric.
CVTT extends and generalizes recent works [40, 52] in order to address the challenge of Recommender System evaluation over time, highlighting the corresponding time-dependent nuances.
CVTT は,Recommender System 評価の時間的課題に対処し,対応する時間依存ニュアンスを浮き彫りにするため,近年の業務を拡張・一般化している[40,52]。
0.74
The results of our experiments demonstrate that CVTT could help identify subtle patterns in the dependence of model performance on input data.
To showcase our methodology, we apply it for the next-period prediction task (Section 3.1) using various RecSys datasets (Section 3.2) and methods (Section 3.3).
The goal of a Recommender System is to learn users’ interests through observed interactions and build a predictive model to recommend new or repeated products, movies, or songs.
If the user can consume sets of items simultaneously and we want to predict the whole set of interactions, this is called the next-basket prediction task [44].
Finally, if we are curious about user interests over time, we could group user interactions into time-based baskets and predict their interactions for the next predefined period.
This is the next-period recommendation task [28, 66].
これが次の定期推薦課題[28, 66]です.
0.81
In this work, we focus on the latter task, as it is capable of natively integrating with the global temporal split and is actively used in relevant works [23, 52].
It contains over 2 million interactions between almost 10,000 users and more than 1,000 merchants over a period of 14 months, from January 2019 to February 2020.
For benchmarking purposes, we select the "Let’s Get Sort-of-Real sample 50K customers" version of the dataset, which is well-known among the research community [15, 20, 21].
ベンチマークのために、研究コミュニティ[15, 20, 21]でよく知られているデータセットの“実例50kユーザをソートしてみよう”バージョンを選択します。 訳抜け防止モード: ベンチマークのために、“Let ’s Get Sort - of - Real sample 50K customers”データセットのバージョンを選択します。 研究コミュニティで知られている[15, 20, 21]。
0.74
The choice of datasets for our research question is limited, as the datasets need to have timestamps included so that the data can be spit and evaluated based on time.
4 CROSS-VALIDATION THROUGH TIME To evaluate how the recommender’s performance changes over time, we adopt a recently proposed global temporal split for cross-validation setup, building on the ideas from [23, 28, 52].
While we find these two approaches equally fair for real-life applications (a comparison of the effects of train data strategy selection can be found in Section 5.2), we will use the first option (expand) for presenting CVTT.
”𝑇 𝑒𝑠𝑡” (1 period) represents the data from the last period, ”𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛” (1 period) - penultimate period, and ”𝑡𝑟𝑎𝑖𝑛” (N-2 periods) - all data up to the penultimate period.
T est (1 期間)は、最終期間からのデータ、validation (1 期間) - 最大期間、train (N-2 期間) - すべてのデータを表す。 訳抜け防止モード: ” T est ” ( 1 期間) は最後の周期のデータを表す。 検証 ” (1 期間) - 最後周期、”列車 ” (N-2 期間) - すべてのデータは最後期間まで。
0.71
Next, we ran a model hyperparameter search on (”𝑡𝑟𝑎𝑖𝑛”, ”𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛”) subset and used the best-found hyperparameters to train and evaluate the model on the (”𝑡𝑟𝑎𝑖𝑛 + 𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛”, ”𝑡𝑒𝑠𝑡”) one.
This ”𝑡𝑒𝑠𝑡” score is then used as a final model performance measure on that fold.
この“test”スコアは、その折りたたみの最終的なモデルパフォーマンス指標として使用される。
0.72
In the end, we get the CVTT performance graph by combining these scores over time.
最後に、これらのスコアを時間とともに組み合わせてcvttパフォーマンスグラフを得る。
0.67
5 EXPERIMENTAL EVALUATIONS Our experimental evaluations aim to address the following questions: (1) How does model performance vary over time?
5 実験的評価 実験的評価は,(1) モデルの性能は時間とともにどのように変化するか?
0.75
(2) How do different data preparation strategies affect model performance?
2)異なるデータ準備戦略はモデル性能にどのように影響するか?
0.85
(3) Are models robust for data preparation strategies?
(3) モデルはデータ準備戦略に堅牢か?
0.86
(4) How does the forecasting horizon affect model performance ranking?
(4)予測地平線はモデル性能にどのように影響するか?
0.76
(5) How do model predictive performance vary with delayed evaluation?
(5)モデル予測性能は遅延評価でどのように変化するか?
0.84
(6) Which models are robust for delayed evaluation?
(6)遅延評価にはどのモデルが堅牢か。
0.84
英語(論文から抽出)
日本語訳
スコア
CVTT: Cross-Validation Through Time
CVTT: 時間を通してのクロスバリデーション
0.61
7 (a) (d) (b)
7 (a) (d) (b)
0.43
(e) (h) (k)
(e) (h) (k)
0.42
(c) (f) (i)
(c) (f) (i)
0.43
(l) Fig. 2. CVTT performance graphs.
(l) 図2。 CVTTパフォーマンスグラフ。
0.58
X-axes correspond to the measured fold number, and Y-axes to the performance metric (𝑀𝐴𝑃@10).
X軸は測定された折りたたみ数に対応し、Y軸は性能測定値(MAP@10)に対応する。
0.73
(a, b, c) relates to expanding data strategy, while (d, e, f), (h, i), (k, l) correspond to window strategies with a window length of 1, 3, 5, respectively.
(a, b, c) は拡張データ戦略と関連し、 (d, e, f), (h, i), (k, l) はそれぞれウィンドウ長 1, 3, 5 のウィンドウ戦略に対応する。
0.72
5.1 Implementation Details Similar to previous studies [12], we search for the optimal parameters through Bayesian search using the implementation of Optuna 3.
Fig. 5. Delayed model performance comparison over time.
図5。 時間とともにモデルパフォーマンスの比較が遅れた。
0.58
At each fold 𝑡, we evaluate the resulting model on 𝑡 + 1, 𝑡 + 2 and 𝑡 + 3 test folds, each corresponding to the three estimation points.
各 fold t において、各 t + 1, t + 2 および t + 3 個のテストフォールド上で、それぞれ 3 つの推定点に対応する結果モデルを評価する。
0.80
X-axes correspond to the measured fold number, and Y-axes to the performance metric (𝑀𝐴𝑃@10).
X軸は測定された折りたたみ数に対応し、Y軸は性能測定値(MAP@10)に対応する。
0.73
英語(論文から抽出)
日本語訳
スコア
10 Sergey Kolesnikov and Mikhail Andronov
10 セルゲイ・コレスニコフとミハイル・アンドロノフ
0.42
5.2 Results In Figure 2 (a, b, c), we contrast the performance of various methods over time following the CVTT evaluation protocol, expanding the training dataset at each fold.
5.2 結果 図2 (a, b, c) では,CVTT 評価プロトコルに従って,様々な手法の性能を比較検討し,各折り畳みのトレーニングデータセットを拡張した。
0.87
Similar to previous studies [52], our analysis shows that the performance of evaluated approaches often changes over time.
However, in contrast to previous studies, we observe an example of a performance trend over time: ALS scores decrease steadily during both TTRS and DHB (Figure 2 (b, c)).
しかし, 従来の研究とは対照的に, ALS スコアは TTRS と DHB のどちらにおいても着実に低下する(第2図 (b, c) )。 訳抜け防止モード: しかし、従来の研究とは対照的に、時間経過に伴うパフォーマンス傾向の例を観察する。 ALSスコアはTTRSとDHB(図2(b,c))で着実に低下する。
0.75
Moreover, while performance remains roughly the same over time in most cases (i.e., SVD on TTRS), we observe data updates4 which can shift it significantly for some methods (i.e., SVD, NMF on DHB) while having no effect on others (i.e., Itemknn on DHB).
The results for 1, 3, and 5 window lengths could be found in Figure 2, rows 2, 3, 4, respectively.
1, 3, 5ウィンドウの長さの結果は,それぞれ図2,行2,行3,4で確認できた。
0.76
Several observations can be made based on these results.
これらの結果に基づいていくつかの観測が可能である。
0.58
First, with the window strategy, the performance of ALS stops decaying over time, mainly concentrating around an intermediate value.
まず、ウィンドウ戦略により、ALSの性能は時間とともに低下し、主に中間値を中心に集中する。
0.72
We also observe a correlation between window length and ALS performance, where the larger the length used, the less 𝑀𝐴𝑃@10 score is achieved by ALS (Figure 3 (h, i)).
Secondly, the widely known expand strategy achieves the best performance only for 6 out of 24 methods (Figure 3), suggesting that further research of data preparation approaches is required.
The last interesting insight from this comparison is that it may be possible to avoid having to provide a model with a full history of user interactions.
Instead, it could be enough to use just the amount that would converge for current user preferences.
代わりに、現在のユーザーの好みに収束する量だけを使うだけで十分かもしれない。
0.67
5.4 Effects of Forecast Horizon One experiential question that practitioners could be interested in concerns the changes in model performance rankings with varied interval steps.
These results once again highlight the importance of model re-training and re-evaluation over time, which is the core idea behind CVTT.
これらの結果は、CVTTの中核となる概念であるモデル再訓練と再評価の重要性を再び強調する。
0.76
Moreover, based on these results, a possible direction for practitioners would be to combine robust (such as EASE) and adaptive (SVD and others) approaches together for better performance.
4According to recent studies [28], DHB dataset has an items novelty rate drop in the most recent months, but its further investigation can be seen as out of the scope of this paper.
11 6 STRENGTHS, LIMITATIONS & OPPORTUNITIES This paper presents an intuitive idea by combining advancements from the fields of data splitting, cross-validation, and RecSys offline evaluation.
Such cross-validation can be considered a very practical evaluation methodology that helps avoid the issues of data leakage.
このような相互評価は、データ漏洩の問題を回避できる非常に実用的な評価手法と考えられる。
0.65
Moreover, Figure 2 shows that such evaluation highlights the important data updates.
さらに、図2は、この評価が重要なデータ更新を強調することを示している。
0.62
Limitations: While this work demonstrates promises of continuous evaluation over time, it does not investigate all possible data leakage problems that might exist.
Opportunities: CVTT stands for a simple yet practical evaluation methodology that is compatible with various methods.
機会: cvttは、様々な方法と互換性のある、単純で実用的な評価手法である。
0.64
This evaluation strategy opens up avenues for cross-domain model comparison, a better understanding of data factor influence of the model performance, and more informed method selection for one’s needs.
[2] Vito Walter Anelli, Amra Delic, Gabriele Sottocornola, Jessie Smith, Nazareno Andrade, Luca Belli, Michael M. Bronstein, Akshay Gupta, Sofia Ira Ktena, Alexandre Lung-Yut-Fong, Frank Portman, Alykhan Tejani, Yuanpu Xie, Xiao Zhu, and Wenzhe Shi.
[2] Vito Walter Anelli, Amra Delic, Gabriele Sottocornola, Jessie Smith, Nazareno Andrade, Luca Belli, Michael M. Bronstein, Akshay Gupta, Sofia Ira Ktena, Alexandre Lung-Yut-Fong, Frank Portman, Alykhan Tejani, Yuanpu Xie, Xiao Zhu, Wenzhe Shi 訳抜け防止モード: 2 ]ヴィトー・ウォルター・アネッリ、アムラ・デリク、ガブリエル・ソットコノラ jessie smith, nazareno andrade, luca belli, michael m. bronstein, akshay gupta, sofia ira ktena, alexandre lung - yut - fong, フランク・ポートマン アリカン・テジャニ センプー・シエ シャオ・ズウ そして、ウェンジー・シー。
0.59
2020. RecSys 2020 Challenge Workshop: Engagement Prediction on Twitter’s Home Timeline.
2020. recsys 2020 challenge workshop: engagement prediction on twitterのホームタイムライン。
0.58
In RecSys 2020: Fourteenth ACM Conference on Recommender Systems, Virtual Event, Brazil, September 22-26, 2020, Rodrygo L. T. Santos, Leandro Balby Marinho, Elizabeth M. Daly, Li Chen, Kim Falk, Noam Koenigstein, and Edleno Silva de Moura (Eds.).
RecSys 2020: 14th ACM Conference on Recommender Systems, Virtual Event, Brazil, September 22-26, 2020, Rodrygo L. T. Santos, Leandro Balby Marinho, Elizabeth M. Daly, Li Chen, Kim Falk, Noam Koenigstein, Edleno Silva de Moura (Eds.) 訳抜け防止モード: in recsys 2020 : 14th acm conference on recommender systems, virtual event, brazil 2020年9月22日-26日、ロドリゴ・l・t・サントス。 leandro balby marinho, elizabeth m. daly, li chen, kim falk, noam koenigstein氏とedleno silva de moura氏(eds)。
0.71
ACM, 623–627.
ACM 623-627。
0.73
https://doi.org/10.1 145/3383313.3411532
https://doi.org/10.1 145/3383313.3411532
0.15
[3] Khalid Anwar and Shahab Sohail.
[3] khalid anwarとshahab sohail。
0.30
2019. Machine Learning Techniques for Book Recommendation: An Overview.
2019. 書籍レコメンデーションのための機械学習技術:概要
0.56
SSRN Electronic Journal (01 2019).
SSRN Electronic Journal (01 2019)。
0.80
https://doi.org/10.2 139/ssrn.3356349
https://doi.org/10.2 139/ssrn.3356349
0.15
[4] Ting Bai, Lixin Zou, Wayne Xin Zhao, Pan Du, Weidong Liu, Jian-Yun Nie, and Ji-Rong Wen.
[4]天梅、リクシンゾウ、ウェイン・シン・シャオ、パン・デュ、ワイドン・リュー、ジャン・ユン・ニー、ジ・ロン・ウェン。 訳抜け防止モード: [4 ]天梅、リクシンゾウ、ウェイン・シン・シャオ、 Pan Du, Weidong Liu, Jian - Yun Nie, And Ji - Rong Wen
0.68
2019. CTRec: A Long-Short Demands Evolution Model
2019. ctrec: 長い短い要求進化モデル
0.54
for Continuous-Time Recommendation.
継続的時間を推奨する。
0.53
In SIGIR. 675–684.
SIGIR所属。 675–684.
0.47
[5] Linas Baltrunas, Tadas Makcinskas, and Francesco Ricci.
2019. RepeatNet: A Repeat Aware Neural Recommendation https:
2019. repeatnet: 繰り返し認識するニューラルレコメンデーション https:
0.50
Machine for Session-Based Recommendation.
セッションベース勧告のためのマシン。
0.61
Proceedings of the AAAI Conference on Artificial Intelligence 33 (07 2019), 4806–4813.
AAAI Conference on Artificial Intelligence 33 (07 2019), 4806–4813 に参加。
0.39
//doi.org/10.1609/aa ai.v33i01.33014806
/doi.org/10.1609/aaa i.v33i01.33014806
0.06
[46] Steffen Rendle.
ステファン・レンドル(Steffen Rendle)。
0.43
2010. Factorization Machines.
2010. ファクトリゼーションマシン。
0.42
In ICDM 2010, The 10th IEEE International Conference on Data Mining, Sydney, Australia, 14-17 December 2010, Geoffrey I. Webb, Bing Liu, Chengqi Zhang, Dimitrios Gunopulos, and Xindong Wu (Eds.).
ICDM 2010, The 10th IEEE International Conference on Data Mining, Sydney, Australia, 14-17 December 2010, Geoffrey I. Webb, Bing Liu, Chengqi Zhang, Dimitrios Gunopulos, Xindong Wu (Eds)。 訳抜け防止モード: 第10回IEEE International Conference on Data Mining, 2010に参加して オーストラリアのシドニー, 2010年12月14日 - 17日, ジェフリー・I・ウェッブ Bing Liu, Chengqi Zhang, Dimitrios Gunopulos, and Xindong Wu (Eds .)
0.85
IEEE Computer Society, 995–1000.
ieee computer society、995-1000。
0.59
https://doi.org/10.1 109/ICDM.2010.127
https://doi.org/10.1 109/ICDM.2010.127
0.13
[47] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.