Performance Optimization in Large-Scale ETL Workloads: Advanced Techniques in Distributed Computing
Keywords:
ETL optimization, distributed computing, performance tuningAbstract
Performance optimization of large-scale Extract, Transform, Load (ETL) pipelines are crucial challenge for companies especially in data intensive industries such as finance and consulting. The purpose of this paper is to explore the advanced techniques in distributed computing to improve efficiency and scalability of ETL pipelines utilising real world implementations in Apache Spark, Presto, and AWS Glue.
Downloads
References
G. Kar, H. Chen, and M. J. Franklin, "Optimizing ETL Workflows on Distributed Data
Processing Systems," Proceedings of the 2019 IEEE International Conference on Big
Data (Big Data), Los Angeles, CA, USA, Dec. 2019, pp. 1578–1586.
D. R. Paiva and J. P. Marques, "Improving Performance of ETL Pipelines with Spark
in Big Data Environments," IEEE Transactions on Parallel and Distributed Systems,
vol. 31, no. 7, pp. 1622–1633, Jul. 2020.
X. Zhang, L. Zhang, and Y. Zhang, "Scalable Data Processing using Presto for
Efficient ETL Operations," IEEE Transactions on Cloud Computing, vol. 8, no. 3, pp.
–701, Jul. 2021.
Singu, Santosh Kumar. "Real-Time Data Integration: Tools, Techniques, and Best
Practices." ESP Journal of Engineering & Technology Advancements 1.1 (2021):
-172.
Los Angeles Journal of Intelligent Systems and Pattern Recognition
Vol. 2 - 2022
S. Kumari, "Agile Cloud Transformation in Enterprise Systems: Integrating AI for
Continuous Improvement, Risk Management, and Scalability", Australian Journal of
Machine Learning Research & Applications, vol. 2, no. 1, pp. 416-440, Mar.
S. Kumari, "AI-Enhanced Agile Development for Digital Product Management:
Leveraging Data-Driven Insights for Iterative Improvement and Market Adaptation",
Adv. in Deep Learning Techniques, vol. 2, no. 1, pp. 49-68, Mar. 2022
Singu, Santosh Kumar. "Designing scalable data engineering pipelines using Azure
and Databricks." ESP Journal of Engineering & Technology Advancements 1.2
(2021): 176-187.
S. Kumari, "AI-Driven Cybersecurity in Agile Cloud Transformation: Leveraging
Machine Learning to Automate Threat Detection, Vulnerability Management, and
Incident Response", J. of Art. Int. Research, vol. 2, no. 1, pp. 286-305, Apr. 2022
T. R. Davis, Y. H. Goh, and W. R. Walker, "Workload-Aware Resource Management
for Optimized ETL Pipelines in Distributed Frameworks," IEEE Transactions on
Parallel and Distributed Systems, vol. 31, no. 11, pp. 2816–2828, Nov. 2020.
R. M. Alvi, S. G. Thorpe, and H. K. Black, "Parallelism Tuning in Large-scale ETL
Systems for Enhanced Performance," IEEE Transactions on Cloud Computing, vol.
, no. 3, pp. 517–530, May 2021.
D. P. Khan and K. T. Burns, "Optimizing ETL Pipelines in AWS Glue: A Case Study in
Big Data Processing," Proceedings of the 2019 IEEE International Conference on
Data Science and Advanced Analytics (DSAA), Tokyo, Japan, Oct. 2019, pp.
–187.
B. W. Zhou, A. D. Matos, and G. M. Garcia, "Advanced Parallel ETL: Leveraging
Presto for Real-time Data Pipelines," IEEE Transactions on Data and Knowledge
Engineering, vol. 34, no. 9, pp. 2319–2330, Sept. 2022.
M. S. Green and K. M. Ryan, "Efficient Data Partitioning and Caching Techniques in
Distributed ETL Workflows," Proceedings of the 2021 IEEE International Conference
on Big Data (Big Data), Orlando, FL, USA, Dec. 2021, pp. 853–863.
Los Angeles Journal of Intelligent Systems and Pattern Recognition
Vol. 2 - 2022
S. H. Jain, P. L. Kumar, and A. S. Vora, "Real-time Optimization of ETL Pipelines for
Improved Resource Utilization in Spark," IEEE Transactions on Industrial Informatics,
vol. 17, no. 5, pp. 2935–2945, May 2021.
C. A. Chong and L. Y. Chan, "Improving Resource Efficiency in Distributed ETL
Systems: A Benchmark of Presto and AWS Glue," IEEE Access, vol. 9, pp.
–12058, 2021.
R. P. Sah, B. R. Wilson, and S. C. S. Lee, "Dynamic Optimization for ETL Query
Execution in Spark-based Systems," IEEE Transactions on Database Systems, vol.
, no. 3, pp. 1243–1258, Aug. 2021.