Mathematical Problems of Computer Science 59, 35–44, 2023. doi: 10.51408/1963-0100 UDC 004.75 Data Compression-Aware Performance Analysis of Dask and Spark for Earth Observation Data Processing Arthur G. Lalayan Institute for Informatics and Automation Problems of NAS RA, Yerevan, Armenia National Polytechnic University of Armenia, Yerevan, Armenia e-mail: arthurlalayan97@gmail.com Abstract High-performance computing is a good choice for handling Big Earth Observation data, allowing the processing of the data in a distributed and performance-efficient way using in-memory computing frameworks. The data compression technique reduces the amount of storage and network transfer time and improves processing performance. The article aims to investigate the effectiveness of widely used distributed data pro- cessing frameworks in conjunction with lossless data compression techniques, to find the optimal compression method and processing framework for specific earth obser- vation workflows. Normalized Difference Vegetation Index has been evaluated for the territory of Armenia, obtaining data from the Sentinel satellite and considering the supported compression methods to compare the performance of in-memory Dask and Spark frameworks. Experiments show that the Zstandard compression method and the Dask framework are the best choices for such workflows. Keywords: Earth observation, HPC, Spark, Dask, Distributed computing, Data com- pression. Article info: Received 29 January 2022; sent for review 7 February 2023; received in revised form 15 March 2023; accepted 17 April 2023. Acknowledgement: The research was supported by the Science Committee of the Republic of Armenia and the University of Geneva Leading House by the projects enti- tled Self-organized Swarm of UAVs Smart Cloud Platform Equipped with Multi-agent Algorithms and Systems (Nr. 21AG-1B052), Remote sensing data processing methods using neural networks and deep learning to predict changes in weather phenomena (Nr. 21SC- BRFFR-1B009), and ADC4SD: Armenian Data Cube for Sustainable Develop- ment. 1. Background and Motivation Earth Observation (EO) satellite data are necessary for environmental monitoring and gath- ering vital information about various Earth layers [1]. Specifically, EO data are widely used 35 36 Data Compression-Aware Performance Analysis of Dask and Spark for Earth Obser. Data Processing to monitor the atmosphere including air pollution [2] and temperature [3], the oceans con- sidering sea pollution and ocean acidity [4], and ground, such as deforestation [5] and forest fire [6], as well as to detect climatic changes [7]. To facilitate work with EO data, Australian researchers [8] have provided an open-source Open Data Cube (ODC) [9], which is deployed and widely used by several communities from different countries, including Armenia [10]. Nevertheless, the ODC communities still encounter the Big EO data processing challenge requiring high-performance computational (HPC) resources. For instance, the Sentinel-2 satellite [11] provides approximately 200-300 GB, 3 TB, and 36 TB of daily, monthly, and annual data for the territory of Armenia. Handling this amount of data is a complex task. Therefore, HPC is the right choice to improve data processing performance using distributed computing techniques. Thus, the Big EO data processing obstacle is coping with using open-source Apache Spark [12] and Dask [13] frameworks, which can process data in parallel by dividing them into chunks, processing them in a distributed way using computational clusters, and aggregating the result. Both frameworks have master-slave architecture, where slave nodes are worker nodes executing functions in parallel, and the master node is the driver or scheduler to manage them. Spark ecosystem supports many projects in data streaming, SQL analytics, and machine learning. Spark is a multi-language engine that processes and analyzes data, while Dask is a Python library. Therefore, Spark has its ecosystem APIs and memory models, while Dask uses them from the Python ecosystem. However, these frameworks have some differences and limitations in finding an optimal solution for EO data processing workflows. Besides using HPC, the format of EO satellite images also has a crucial influence on performance. The data compression techniques can reduce storage usage and the number of I/O operations, improving processing performance. Recent studies [14, 15] show that compression methods combined with HPC can significantly enhance the performance of Big data workflows. One of the optimal satellite image formats is Cloud Optimized GeoTIFF (COG) [16], which provides essential advantages compared to traditional formats, such as NetCDF [17]. COG format provides an HTTP range request to extract a part of the data. Hence, when extracting EO data using COG, there is no need to download the entire image and then extract the area of interest as in the NetCDF format. Besides the mentioned benefit, both COG and NetCDF formats support data compression methods. Several studies [18, 19, 20] evaluate and compare the performance of the frameworks for particular cases, such as data-intensive neuroimaging pipelines [18], different applications of molecular dynamics [20], and scientific image analytics [19]. Nevertheless, they did not consider performance-tuning techniques, such as data compression. The main objective of the article is to investigate the efficacy of widely used distributed data processing frameworks, such as Dask and Spark, in combination with lossless data compression methods, to enhance the performance of EO data processing. The methodology involved evaluating the approach on the Armenian hybrid research computing platform, and the results obtained from the evaluation could be used by EO communities to make informed decisions about improving their data processing performance. 2. Methodology A test-bed platform for EO data processing has been deployed to execute EO data pro- cessing functions and compare the performances in Spark and Dask. The platform is a container-based solution within the Kubernetes system, enabling evaluating and comparing A. Lalayan 37 the environments’ performance. It relies on the computational resources of the Armenian hybrid research computing platform [21]. Fig. 1 shows the architecture of the experimental platform. Fig. 1. Test-bed platform based on Spark and Dask. As the figure shows, each node scheduler/driver or worker/executor corresponds to a pod in Kubernetes with some fixed computational resources. It is possible to configure the computational resource characteristics of nodes with Kubernetes API. The Jupyter Notebook [22] corresponds to the FrontEnd of the Spark and Dask cluster BackEnd. It connects to Dask and Spark of master nodes, configures environments by providing the number of worker nodes and computational resources for each node, requests to process EO data using Dask and Spark clusters, and visualizes the output. Dask and Spark clusters fetch data from repositories of either local Armenian DataCube [23] or global EO data providers. Armenian DataCube [10] provides data from Landsat 5, 7, 8 [24], and Sentinel-2 satellites, and one of the global EO data providers is Sentinel-2 Cloud-Optimized GeoTIFFs [25]. The functionality evaluation of the Dask and Spark frameworks is quite interesting. Dask is a flexible Python library, which makes it easy to migrate and execute the old- written Python code in a distributed manner. Moreover, Python is widely used in EO data workflows, and various useful libraries provide vital tools to make the work with EO data easier. However, working with EO data in Spark is a little tricky because the execution of the old-written codes in the Spark environment is impossible, as it supports APIs of its ecosystem, therefore, the code adjustment is inevitable. The GeoPySpark library [26] makes working with EO data somewhat easier in Spark. So the data processing function can be easily parallelized only in Dask, considering the limitations and complexity of using Spark. As EO data processing applications, the Normalized Difference Vegetation Index (NDVI) [27] was evaluated during the experiments, which provides information for monitoring the health of the vegetation. The formula of the index is presented in (1). NDV I = NIR − RED NIR + RED , (1) where RED is the red band, and NIR is the near-infrared band. All bands and the calcula- tion result are matrices or images and the NDVI index is calculated from Sentinel-2 satellite images. 38 Data Compression-Aware Performance Analysis of Dask and Spark for Earth Obser. Data Processing Several experiments were conducted with different parameters to evaluate the perfor- mances of Dask and Spark using the developed experimental platform. Table 1 presents all parameters and their values. Table 1: Experimental parameters and their values. Parameter name Possible values Environment Dask and Spark Input Data sizes 16, 32, 64 GBs Number of workers 4, 8, 16, 32 Applications NDVI Compression methods None, Deflate, LZW, Packbits, and Zstandard 3. Experimental Results Data compression techniques reduce the actual size of data, resulting in savings in stor- age space, providing faster network transmission times, and improving the performance of processing. EO data repositories, which provide satellite images in COG format, such as Sentinel-2 COGs, by default, use Deflate compression method to reduce the downloading time of satellite images and save some storage space. Besides the Deflate method, several compression methods, either lossy or lossless, could be applied with COGs. The accuracy of the satellite image is essential, as the spatial resolution of the Sentinel-2 image is 10m [10], which corresponds to the surface area measured on the ground represented by each pixel. Therefore, the compression methods used for optimization should be lossless to en- sure accurate results. The COG format supports several lossless compression methods, such as Deflate [28], LZW [29], Packbits [30], and Zstandard [31]. EO band tiles come in three different sizes (light, medium, and heavy) by which the compression factor is estimated to understand the average compression ratio of the method. The light band tiles (coastal, water vapor, etc.) usually have up to 5-10 MB size, medium 50-70 MB (Short-wave infrared (SWIR), vegetation red edge, etc.), and heavy 200-250 MB (RED, NIR, etc.). They consider all types of possible lossless compression methods. The compression ratio is calculated for each method by dividing the compressed data size by the original uncompressed data size. The compression ratios for various compression methods are presented in Fig. 2. The figure shows that the best compression factor is provided by the Zstandard method, whereas the worst one is provided by the Packbits method. Zstandard codec compresses the band image more than the Deflate does, which is by default used by the Sentinel-2 COGs repository. Therefore, using Zstandard instead of Deflate will lead to more storage savings, and less network transfer time and I/O operations. The storage reduction, in this case, is 34 % compared with the uncompressed data and 16 % compared with Deflate. The compression ratio of the Packbits method for the heavy tiles is close to 1, which means that the method is useless for data size reduction since the actual size and compressed data size will be the same. Besides the storage saving, further data processing is also essential, as A. Lalayan 39 Fig. 2. Compression ratio of Deflate, LZW, Packbits, and Zstandard methods for light, medium, and heavy tiles. high compression needs more CPU time to decompress into memory before processing. The majority of the time spent in computing NDVI is devoted to transferring satellite images over the network and loading them into memory, rather than performing calculations using the CPU. The comparison of the performances of Dask and Spark, considering different sizes of input data, compression methods, and 32 worker nodes is shown in Fig. 3. The execution time of the COG tile compressed with the Packbits method and without compression is almost the same, as Packbits provides weak compression; thus, it uses lit- tle CPU time for decompression. The worst performance for both environments from the possible compression methods is Deflate, whereas the best one is Zstandard. Hence, the best compression method for satellite images in COG format is Zstandard, as it provides the highest compression ratio and optimal memory loading time. The performance im- provement when using Zstandard compared to uncompressed mode is achieved by reducing network transfer time. Zstandard provides on average 2.15 and 1.82 times faster execu- tion time compared with the uncompressed mode, approximately 4.72 and 3.99 times faster than the default selected Deflate method provided by global satellite image repositories cor- respondingly for Dask and Spark environments. Performance evaluation using Dask and Spark is quite interesting. For the default used Deflate compression method provided by EO repositories, Spark and Dask show similar execution times; however, Spark is a bit faster. The LZW compression method for the Dask environment is better than Deflate but worse than without compressing or compressing with Zstandard. Also, Spark does not support the compression method. With uncompressed data, Dask is faster than Spark for 16 GB input, whereas, in cases of 32 GB and 64 GB, Spark is faster. Performance in Dask using the Zstandard compression method is an optimal choice. 40 Data Compression-Aware Performance Analysis of Dask and Spark for Earth Obser. Data Processing Fig. 3. Comparison of Dask and Spark considering 16, 32, 64 GBs of input data and compression methods. 4. Discussion The study showed that various data compression methods could reduce storage require- ments and network transfer time at different scales. Moreover, compressed data processing using multiple techniques in distributed environments such as Spark and Dask exhibited other execution times, with some compression methods outperforming uncompressed data processing time. The study aims to determine the optimal data compression method that balances performance and storage savings in the chosen distributed processing environments. The evaluation shows that the Dask and Zstandard combination is the best choice for the environment and compression method for EO satellite images. It provides the highest com- pression factor and performance compared to other supported compression methods. The Armenian DataCube was initially set up with a 2-terabyte storage capacity, which is limited. To manage this, only the essential bands for specific EO applications that researchers are interested in during a particular period are downloaded and stored. If the storage capacity is exceeded, the options are to scale vertically or add external storage. The Zstandard compression technique was used in experiments to conserve 34 % of storage. This allows more data to be stored in the allocated DataCube space. The Zstandard compression method combined with the Dask environment offers benefits such as improved data storage efficiency and EO data processing time. However, additional steps are required to achieve these benefits, such as converting analysis-ready data from the DataCube to Cloud Optimized GeoTIFF format and compressing them using the Zstandard method. Although this may increase the total execution time of downloading and prepro- cessing, it provides such benefits as enhanced processing time and storage savings. Moreover, this efficient method of storing compressed data can be applied to other types of EO data repositories and DataCubes. A. Lalayan 41 In conclusion, data compression methods can effectively reduce the amount of EO data stored and improve processing performance. Zstandard exhibits the best performance and storage efficiency for EO data among the available compression methods. Additionally, the implementation of the Dask environment speeds up distributed processing. 5. Conclusion The study evaluates the performance of EO data processing in Dask and Spark, considering compression methods. Experimental results show that Dask and Spark provide similar data processing performances. The mixture of the Dask and Zstandard compression methods is optimal, as the compression method provides the best compression factor of all possible lossless compression methods. It reduces the amount of used storage by 16 % and speeds up execution times by 4.72x and 3.99x in Dask and Spark, correspondingly compared with the Deflate method, which is used by default from the EO data repositories. In further work, it is planned to store the data in Armenian DataCube compressed with the Zstandard method and use the Dask environment for data processing. References [1] O. R. Young, M. Onoda. “Satellite Earth Observations in Environmental Problem- Solving”, In book: Satellite Earth Observations and Their Impact on Society and Pol- icy, pp. 3-27, 2017. [2] D. A. Chu, Y. J. Kaufman, “Global monitoring of air pollution over land from the Earth Observing System-Terra Moderate Resolution Imaging Spectroradiometer (MODIS)”, Journal of Geophysical Research Atmospheres, vol. 108, no. 21, November 2003. [3] R.S. dos Santos, “Estimating spatio-temporal air temperature in London (UK) us- ing machine learning and earth observation satellite data”, International Journal of Applied Earth Observation and Geoinformation, vol. 88, June 2020. [4] T. Krishnamurti and A. Chakraborty, “Impact of Arabian Sea pollution on the Bay of Bengal winter monsoon rains”, Journal of Geophysical Research, vol. 114, March 2009. [5] R. DeFries and F. Achard, “Earth observations for estimating greenhouse gas emissions from deforestation in developing countries”, Environmental Science & Policy, vol. 10, no. 4, pp. 385–394, June 2007. [6] Y. J. Kaufman and C. Ichoku, “Fire and smoke observed from the Earth Observing System MODIS instrument–products, validation, and operational use”, International Journal of Remote Sensing, vol. 24, no. 8, pp. 1765–1781, November 2010. [7] H. D. Guo and L. Zhang, “Earth observation big data for climate change research”, Advances in Climate Change Research, vol. 6, no. 2, pp. 108–117, June 2015. [8] A. Lewis, S. Oliver and L. Lymburner, “The Australian Geoscience Data Cube Foun- dations and lessons learned”, Remote Sensing of Environment, vol. 202, pp. 276–292, 2017. [9] Open data cube, [Online]. Available: https://www.opendatacube.org/ [10] S. Asmaryan and V. Muradyan, “Paving the Way towards an Armenian Data Cube”, Data, vol. 4, no. 1, 2019. 42 Data Compression-Aware Performance Analysis of Dask and Spark for Earth Obser. Data Processing [11] M. Drusch and U. D. Bello, “Sentinel-2: ESA’s Optical High-Resolution Mission for GMES Operational Services”, Remote Sensing of Environment, vol. 120, pp. 25–36, May 2012. [12] M. Xiangrui, “Mllib: Machine learning in apache spark”, The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1235–1241, 2016. [13] R. Matthew, “Dask: Parallel computation with blocked algorithms and task schedul- ing”, Proceedings of the 14th python in science conference, vol. 130, 2015. [14] H. Astsatryan and A. Kocharyan, “Performance Optimization System for Hadoop and Spark Frameworks”, Cybernetics and Information Technologies, vol. 20, no. 6, pp. 5–17, 2020. [15] H. Astsatryan and A. Lalayan, “Performance-efficient Recommendation and Prediction Service for Big Data frameworks focusing on Data Compression and In-memory Data Storage Indicators”, Scalable Computing: Practice and Experience, vol. 22, no. 4, pp. 401–412, 2021. [16] Cloud Optimized GeoTIFF, [Online]. Available: https://www.cogeo.org/ [17] J. Li, “Parallel netCDF: A High-Performance Scientific I/O Interface”, Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, 2003. [18] D. Mathieu and H. Sasson, “A Performance Comparison of Dask and Apache Spark for Data-Intensive Neuroimaging Pipelines”, 2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS), pp. 40–49, 2019. [19] P. Mehta and S. Dorkenwald, “Comparative evaluation of big-data systems on scientific image analytics workloads”, Proceedings of the VLDB Endowment, vol. 10, no. 11, pp. 1226-1237, 2017. [20] I. Paraskevakos and A. Luckow, “Task-parallel Analysis of Molecular Dynamics Tra- jectories”, ICPP 2018: Proceedings of the 47th International Conference on Parallel Processing, no. 49, pp. 1-10, 2018. [21] Y. Shoukourian and V. Sahakyan, “E-Infrastructures in Armenia: Virtual research environments”, Ninth International Conference on Computer Science and Information Technologies Revised Selected Papers, pp. 1-7, 2013. [22] B. M. Randles and I. V. Pasquetto, “Using the Jupyter Notebook as a Tool for Open Science: An Empirical Study”, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 1-2, 2017. [23] Armenian DataCube, [Online]. Available: http://datacube.sci.am/ [24] M. A. Wulder and T. R. Loveland, “Current status of Landsat program, science, and applications”, Remote Sensing of Environment, vol. 225, pp. 127-147, 2019. [25] Sentinel-2 Cloud-Optimized GeoTIFFs, [Online]. Available: https://registry.opendata.aws/sentinel-2-l2a-cogs [26] G. Jifu and C. Huang, “A Scalable Computing Resources System for Remote Sensing Big Data Processing Using GeoPySpark Based on Spark on K8s”, Remote Sensing, vol. 14, no. 3, 2022. A. Lalayan 4 3 [2 7 ] N . P e t t o r e lli, J. O. V ik, \ U s in g t h e s a t e llit e -d e r ive d N D V I t o a s s e s s e c o lo g ic a l r e s p o n s e s t o e n vir o n m e n t a l c h a n g e " , Trends in E cology & E volution, vo l. 2 0 , n o . 9 , p p . 5 0 3 { 5 1 0 , 2 0 0 5 . [2 8 ] S . Os wa l, A . S in g h , \ D e ° a t e c o m p r e s s io n a lg o r it h m " , International J ournal of E ngi- neering R esearch and General Science, vo l. 4 , n o . 1 , 2 0 1 6 . [2 9 ] M. J. K n ie s e r , F. G. W o l®, \ A t e c h n iqu e fo r h ig h r a t io L ZW c o m p r e s s io n [lo g ic t e s t ve c t o r c o m p r e s s io n " , Automation and Test in E urope Conference and E xhibition, p p . 1 1 6 { 1 2 1 , 2 0 0 3 . [3 0 ] G. Fe n g , C. A . B o u m a n , \ E ± c ie n t d o c u m e n t r e n d e r in g wit h e n h a n c e d r u n le n g t h e n c o d in g " , Color Imaging XI: P rocessing, Hardcopy, and Applications, Ja n u a r y 2 0 0 6 . [3 1 ] Y . Co lle t , M. K u c h e r a wy, \ Zs t a n d a r d Co m p r e s s io n a n d t h e 'a p p lic a t io n / z s t d ' Me d ia Typ e " , R F C E ditor, USA, Fe b r u a r y 2 0 2 1 . Dask-Ç ¨ Spark-Ç Ï³ï³ñáճϳÝÇ í»ñÉáõÍáõÃÛáõÝ` ѳßíÇ ³éÝ»Éáí ïíÛ³ÉÝ»ñÇ ë»ÕÙáõÙÁ ºñÏñÇ ¹Çï³ñÏÙ³Ý ïíÛ³ÉÝ»ñÇ Ùß³ÏÙ³Ý Ñ³Ù³ñ ²ñÃáõñ ¶. È³É³Û³Ý ÐÐ ¶²² ÆÝýáñÙ³ïÇϳÛÇ ¨ ³íïáÙ³ï³óÙ³Ý åñáµÉ»ÙÝ»ñÇ ÇÝëïÇïáõï, ºñ¨³Ý, г۳ëï³Ý e-mail: arthurlalayan97@gmail.com ²Ù÷á÷áõÙ ´³ñÓñ ϳï³ñáÕ³Ï³Ý Ñ³ßí³ñÏÁ ɳí ÁÝïñáõÃÛáõÝ ¿ »ñÏñÇ ¹Çï³ñÏÙ³Ý Ù»Í ïíÛ³ÉÝ»ñÇ Ùß³ÏÙ³Ý Ñ³Ù³ñ, ÇÝãÁ ÃáõÛÉ ¿ ï³ÉÇë ïíÛ³ÉÝ»ñÇ Ùß³ÏáõÙÁ µ³ßËí³Í ¨ µ³ñÓñ ³ñ¹Ûáõݳí»ïáõÃÛ³Ùµ? û·ï³·áñÍ»Éáí ÑÇßáÕáõÃÛ³Ý Ù»ç ѳßíáÕ³Ï³Ý Ñ³ñóÏÝ»ñ: îíÛ³ÉÝ»ñÇ ë»ÕÙÙ³Ý ï»ËÝáÉá·Ç³Ý Ýí³½»óÝáõÙ ¿ å³Ñ³ÝçíáÕ å³Ñ»ëï³íáñÙ³Ý Í³í³ÉÁ ¨ ó³ÝóÇ ÷á˳ÝóÙ³Ý Å³Ù³Ý³ÏÁ, ÇÝãå»ë ݳ¨ µ³ñ»É³íáõÙ ¿ ïíÛ³ÉÝ»ñÇ Ùß³ÏÙ³Ý Å³Ù³Ý³ÏÁ: Ðá¹í³ÍÇ Ýå³ï³ÏÝ ¿ áõëáõÙݳëÇñ»É ɳÛÝáñ»Ý û·ï³·áñÍíáÕ ïíÛ³ÉÝ»ñÇ Ùß³ÏÙ³Ý ßñç³Ý³ÏÝ»ñÇ ³ñ¹Ûáõݳí»ïáõÃÛáõÝÁ‘ ïíÛ³ÉÝ»ñÇ ³ÝÏáñáõëï ë»ÕÙÙ³Ý ï»ËÝÇϳÛÇ Ñ»ï ѳٳï»Õ, ºñÏñÇ ¹Çï³ñÏÙ³Ý Ñ³ïáõÏ ³ß˳ï³Ýù³ÛÇÝ Ñáëù»ñÇ Ñ³Ù³ñ ë»ÕÙÙ³Ý ûåïÇÙ³É Ù»Ãá¹ ¨ Ùß³ÏÙ³Ý ßñç³Ý³Ï ·ïÝ»Éáõ ѳٳñ: ´áõë³Ï³ÝáõÃÛ³Ý Ýáñٳɳóí³Í ï³ñµ»ñáõÃÛ³Ý Çݹ»ùëÁ ·Ý³Ñ³ïí»É ¿ г۳ëï³ÝÇ ï³ñ³ÍùÇ Ñ³Ù³ñ` û·ï³·áñÍ»Éáí Sentinel ³ñµ³ÝÛ³ÏÇ ïíÛ³ÉÝ»ñÁ ¨ ѳßíÇ ³éÝ»Éáí ë»ÕÙÙ³Ý ³ç³ÏóíáÕ Ù»Ãá¹Ý»ñÁ ÑÇßáÕáõÃÛ³Ý Ù»ç Dask ¨ Spark ßñç³Ý³ÏÝ»ñÇ ³ß˳ï³ÝùÇ Ñ³Ù»Ù³ïÙ³Ý Ñ³Ù³ñ: öáñÓ»ñÁ óáõÛó »Ý ï³ÉÇë, áñ Zstandard ë»ÕÙÙ³Ý Ù»Ãá¹Á ¨ Dask ÙÇç³í³ÛñÁ É³í³·áõÛÝ ÁÝïñáõÃÛáõÝÝ »Ý ÝÙ³Ý ³ß˳ï³Ýù³ÛÇÝ Ñáëù»ñÇ Ñ³Ù³ñ: ´³Ý³ÉÇ µ³é»ñ` ºñÏñÇ ¹Çï³ñÏáõÙ, HPC, Spark, Dask, µ³ßËí³Í ѳßí³ñÏ, ïíÛ³ÉÝ»ñÇ ë»ÕÙáõÙ: 4 4 Data Compression-Aware Performance Analysis of Dask and Spark for Earth Observation Data Processing Àíàëèç ïðîèçâîäèòåëüíîñòè Dask è Spark äëÿ îáðàáîòêè äàííûõ íàáëþäåíèÿ Çåìëè ñ ó÷åòîì ñæàòèÿ äàííûõ Àðòóð Ã. Ëàëàÿí Èíñòèòóò ïðîáëåì èíôîðìàòèêè è àâòîìàòèçàöèè ÍÀÍ ÐÀ, Åðåâàí, Àðìåíèÿ e-mail: arthurlalayan97@gmail.com Àííîòàöèÿ Âûñîêîïðîèçâîäèòåëüíûå âû÷èñëåíèÿ ÿâëÿþòñÿ õîðîøèì âûáîðîì äëÿ îáðàáîòêè áîëüøèõ äàííûõ íàáëþäåíèÿ Çåìëè, ïîçâîëÿÿ îáðàáàòûâàòü äàííûå ðàñïðåäåëåííûì è âûñîêîïðîèçâîäèòåëüíûì ñïîñîáîì ñ èñïîëüçîâàíèåì âû÷èñëèòåëüíûõ ïëàòôîðì â ïàìÿòè. Òåõíîëîãèÿ ñæàòèÿ äàííûõ ñîêðàùàåò îáúåì õðàíèëèùà è âðåìÿ ïåðåäà÷è ïî ñåòè è ïîâûøàåò ïðîèçâîäèòåëüíîñòü îáðàáîòêè. Öåëüþ ñòàòüè ÿâëÿåòñÿ èññëåäîâàíèå ýôôåêòèâíîñòè øèðîêî èñïîëüçóåìûõ ñèñòåì ðàñïðåäåëåííîé îáðàáîòêè äàííûõ â ñî÷åòàíèè ñ ìåòîäàìè ñæàòèÿ äàííûõ áåç ïîòåðü, ÷òîáû íàéòè îïòèìàëüíûé ìåòîä ñæàòèÿ è ñòðóêòóðó îáðàáîòêè äëÿ êîíêðåòíûõ ðàáî÷èõ ïðîöåññîâ íàáëþäåíèÿ Çåìëè. Íîðìàëèçîâàííûé ðàçíîñòíûé èíäåêñ ðàñòèòåëüíîñòè áûë îöåíåí äëÿ òåððèòîðèè Àðìåíèè ñ èñïîëüçîâàíèåì äàííûõ ñî ñïóòíèêà Sentinel è ñ ó÷åòîì ïîääåðæèâàåìûõ ìåòîäîâ ñæàòèÿ äëÿ ñðàâíåíèÿ ïðîèçâîäèòåëüíîñòè ôðåéìâîðêîâ Dask è Spark â ïàìÿòè. Ýêñïåðèìåíòû ïîêàçûâàþò, ÷òî ìåòîä ñæàòèÿ Zstandard è ôðåéìâîðê Dask ÿâëÿþòñÿ íàèëó÷øèì âûáîðîì äëÿ òàêèõ ðàáî÷èõ ïðîöåññîâ. Êëþ÷åâûå ñëîâà: Íàáëþäåíèå Çåìëè, HPC, Spark, Dask, ðàñïðåäåëåííûå âû÷èñëåíèÿ, ñæàòèå äàííûõ. 04_Artur_59 04