Fugu-MT 論文翻訳(概要): Comparative analysis of large data processing in Apache Spark using Java, Python and Scala

論文の概要: Comparative analysis of large data processing in Apache Spark using Java, Python and Scala

arxiv url: http://arxiv.org/abs/2510.19012v1
Date: Tue, 21 Oct 2025 18:54:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:14.519777
Title: Comparative analysis of large data processing in Apache Spark using Java, Python and Scala
Title（参考訳）: Java, Python, Scala を用いた Apache Spark における大規模データ処理の比較解析
Authors: Ivan Borodii, Illia Fedorovych, Halyna Osukhivska, Diana Velychko, Roman Butsii,
Abstract要約: この分析は、CSVファイルからデータをダウンロードし、Apache Icebergテーブルに変換してロードするなど、いくつかの操作を実行することで実施された。その結果,Sparkアルゴリズムの性能は,使用するデータ量やプログラミング言語によって大きく異なることがわかった。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: During the study, the results of a comparative analysis of the process of handling large datasets using the Apache Spark platform in Java, Python, and Scala programming languages were obtained. Although prior works have focused on individual stages, comprehensive comparisons of full ETL workflows across programming languages using Apache Iceberg remain limited. The analysis was performed by executing several operations, including downloading data from CSV files, transforming and loading it into an Apache Iceberg analytical table. It was found that the performance of the Spark algorithm varies significantly depending on the amount of data and the programming language used. When processing a 5-megabyte CSV file, the best result was achieved in Python: 6.71 seconds, which is superior to Scala's score of 9.13 seconds and Java's time of 9.62 seconds. For processing a large CSV file of 1.6 gigabytes, all programming languages demonstrated similar results: the fastest performance was showed in Python: 46.34 seconds, while Scala and Java showed results of 47.72 and 50.56 seconds, respectively. When performing a more complex operation that involved combining two CSV files into a single dataset for further loading into an Apache Iceberg table, Scala demonstrated the highest performance, at 374.42 seconds. Java processing was completed in 379.8 seconds, while Python was the least efficient, with a runtime of 398.32 seconds. It follows that the programming language significantly affects the efficiency of data processing by the Apache Spark algorithm, with Scala and Java being more productive for processing large amounts of data and complex operations, while Python demonstrates an advantage in working with small amounts of data. The results obtained can be useful for optimizing data handling processes depending on specific performance requirements and the amount of information being processed.
Abstract（参考訳）: 調査では、Java、Python、Scalaのプログラミング言語でApache Sparkプラットフォームを使用して大規模なデータセットを処理するプロセスの比較分析結果が得られた。以前の作業では個々のステージに重点を置いていたが、Apache Icebergを使用したプログラミング言語間での完全なETLワークフローの包括的な比較は限定的のままである。この分析は、CSVファイルからデータをダウンロードし、Apache Iceberg分析テーブルに変換してロードするなど、いくつかの操作を実行することで実施された。その結果,Sparkアルゴリズムの性能は,使用するデータ量やプログラミング言語によって大きく異なることがわかった。 5メガバイトのCSVファイルを処理する場合、Pythonでは6.71秒、Scalaのスコア9.13秒、Javaのタイム9.62秒よりも優れている。 1.6ギガバイトの大規模なCSVファイルを処理する場合、全てのプログラミング言語が同様の結果を示した: Pythonで最速のパフォーマンスは46.34秒、ScalaとJavaはそれぞれ47.72秒、50.56秒であった。 2つのCSVファイルを1つのデータセットにまとめてApache Icebergテーブルにさらにロードする、より複雑な操作を実行すると、Scalaは最高パフォーマンスを374.42秒で証明した。 Java処理は379.8秒で完了し、Pythonは398.32秒で実行された。プログラム言語はApache Sparkアルゴリズムによるデータ処理の効率に大きく影響し、ScalaとJavaは大量のデータと複雑な操作を処理するのにより生産的であり、Pythonは少量のデータを扱うのに有利である。得られた結果は、特定のパフォーマンス要求と処理される情報量に応じて、データ処理プロセスの最適化に有用である。

論文の概要: Comparative analysis of large data processing in Apache Spark using Java, Python and Scala

関連論文リスト