Fugu-MT 論文翻訳(概要): Cast: Automated Resilience Testing for Production Cloud Service Systems

論文の概要: Cast: Automated Resilience Testing for Production Cloud Service Systems

arxiv url: http://arxiv.org/abs/2602.00972v1
Date: Sun, 01 Feb 2026 02:29:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:33.497058
Title: Cast: Automated Resilience Testing for Production Cloud Service Systems
Title（参考訳）: Cast: プロダクションクラウドサービスシステムのための自動レジリエンステスト
Authors: Zhuangbin Chen, Zhiling Deng, Kaiming Zhang, Yang Liu, Cheng Cui, Jinfeng Zhong, Zibin Zheng,
Abstract要約: Castはマイクロサービスのレジリエンスを本番環境でテストするための、エンドツーエンドの自動化フレームワークです。アプリケーションレベルの障害の包括的なライブラリに対して、プロダクショントラフィックを再生することで、高いテスト忠実性を達成する。 Castは、レジリエンスの脆弱性に積極的に対処するために、多くのサービスチームによって採用されている。
参考スコア（独自算出の注目度）: 38.54479293660192
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The distributed nature of microservice architecture introduces significant resilience challenges. Traditional testing methods, limited by extensive manual effort and oversimplified test environments, fail to capture production system complexity. To address these limitations, we present Cast, an automated, end-to-end framework for microservice resilience testing in production. It achieves high test fidelity by replaying production traffic against a comprehensive library of application-level faults to exercise internal error-handling logic. To manage the combinatorial test space, Cast employs a complexity-driven strategy to systematically prune redundant tests and prioritize high-value tests targeting the most critical service execution paths. Cast automates the testing lifecycle through a three-phase pipeline (i.e., startup, fault injection, and recovery) and uses a multi-faceted oracle to automatically verify system resilience against nuanced criteria. Deployed in Huawei Cloud for over eight months, Cast has been adopted by many service teams to proactively address resilience vulnerabilities. Our analysis on four large-scale applications with millions of traces reveals 137 potential vulnerabilities, with 89 confirmed by developers. To further quantify its performance, Cast is evaluated on a benchmark set of 48 reproduced bugs, achieving a high coverage of 90%. The results show that Cast is a practical and effective solution for systematically improving the reliability of industrial microservice systems.
Abstract（参考訳）: マイクロサービスアーキテクチャの分散した性質は、大きなレジリエンスの課題をもたらします。大規模な手作業と過度に単純化されたテスト環境によって制限された従来のテスト方法は、運用システムの複雑さを捉えることができない。これらの制限に対処するため、本番環境でのマイクロサービスレジリエンステストのための、エンドツーエンドの自動化フレームワークであるCastを紹介します。内部エラー処理ロジックを実行するために、アプリケーションレベルの障害の包括的なライブラリに対して、プロダクショントラフィックをリプレイすることで、高いテスト忠実性を実現する。組合せテストスペースを管理するために、Castは複雑性駆動の戦略を使用して、冗長なテストを体系的に実行し、最も重要なサービス実行パスをターゲットにした高価値テストの優先順位付けを行う。 Castは3フェーズパイプライン(起動、フォールトインジェクション、リカバリ)を通じてテストライフサイクルを自動化し、マルチフェイスのオラクルを使用して、ニュアンスド基準に対するシステムのレジリエンスを自動的に検証する。 Huawei Cloudに8ヶ月以上デプロイされたCastは、レジリエンスの脆弱性に積極的に対処するために、多くのサービスチームによって採用されている。数百万のトレースを持つ大規模アプリケーション4つの分析によると,潜在的な脆弱性は137件,開発者による89件が確認されている。パフォーマンスをさらに定量化するために、Castは48個の再生バグのベンチマークセットで評価され、90%のカバレッジを達成した。その結果,Castは産業用マイクロサービスシステムの信頼性を体系的に向上するための実用的で効果的なソリューションであることがわかった。

論文の概要: Cast: Automated Resilience Testing for Production Cloud Service Systems

関連論文リスト