Performance Optimization in Large-Scale ETL Workloads: Advanced Techniques in Distributed Computing

Authors

  • Prabhu Muthusamy Cognizant Technology Solutions, Canada Author
  • Arun Ayilliath Keezhadath Amazon Web Services, USA Author
  • Ravi Kumar Burila JPMorgan Chase & Co, USA Author

Keywords:

ETL optimization, distributed computing, performance tuning

Abstract

Performance optimization of large-scale Extract, Transform, Load (ETL) pipelines are crucial challenge for companies especially in data intensive industries such as finance and consulting. The purpose of this paper is to explore the advanced techniques in distributed computing to improve efficiency and scalability of ETL pipelines utilising real world implementations in Apache Spark, Presto, and AWS Glue.

Downloads

Download data is not yet available.

References

G. Kar, H. Chen, and M. J. Franklin, "Optimizing ETL Workflows on Distributed Data

Processing Systems," Proceedings of the 2019 IEEE International Conference on Big

Data (Big Data), Los Angeles, CA, USA, Dec. 2019, pp. 1578–1586.

D. R. Paiva and J. P. Marques, "Improving Performance of ETL Pipelines with Spark

in Big Data Environments," IEEE Transactions on Parallel and Distributed Systems,

vol. 31, no. 7, pp. 1622–1633, Jul. 2020.

X. Zhang, L. Zhang, and Y. Zhang, "Scalable Data Processing using Presto for

Efficient ETL Operations," IEEE Transactions on Cloud Computing, vol. 8, no. 3, pp.

–701, Jul. 2021.

Singu, Santosh Kumar. "Real-Time Data Integration: Tools, Techniques, and Best

Practices." ESP Journal of Engineering & Technology Advancements 1.1 (2021):

-172.

Los Angeles Journal of Intelligent Systems and Pattern Recognition

Vol. 2 - 2022

S. Kumari, "Agile Cloud Transformation in Enterprise Systems: Integrating AI for

Continuous Improvement, Risk Management, and Scalability", Australian Journal of

Machine Learning Research & Applications, vol. 2, no. 1, pp. 416-440, Mar.

S. Kumari, "AI-Enhanced Agile Development for Digital Product Management:

Leveraging Data-Driven Insights for Iterative Improvement and Market Adaptation",

Adv. in Deep Learning Techniques, vol. 2, no. 1, pp. 49-68, Mar. 2022

Singu, Santosh Kumar. "Designing scalable data engineering pipelines using Azure

and Databricks." ESP Journal of Engineering & Technology Advancements 1.2

(2021): 176-187.

S. Kumari, "AI-Driven Cybersecurity in Agile Cloud Transformation: Leveraging

Machine Learning to Automate Threat Detection, Vulnerability Management, and

Incident Response", J. of Art. Int. Research, vol. 2, no. 1, pp. 286-305, Apr. 2022

T. R. Davis, Y. H. Goh, and W. R. Walker, "Workload-Aware Resource Management

for Optimized ETL Pipelines in Distributed Frameworks," IEEE Transactions on

Parallel and Distributed Systems, vol. 31, no. 11, pp. 2816–2828, Nov. 2020.

R. M. Alvi, S. G. Thorpe, and H. K. Black, "Parallelism Tuning in Large-scale ETL

Systems for Enhanced Performance," IEEE Transactions on Cloud Computing, vol.

, no. 3, pp. 517–530, May 2021.

D. P. Khan and K. T. Burns, "Optimizing ETL Pipelines in AWS Glue: A Case Study in

Big Data Processing," Proceedings of the 2019 IEEE International Conference on

Data Science and Advanced Analytics (DSAA), Tokyo, Japan, Oct. 2019, pp.

–187.

B. W. Zhou, A. D. Matos, and G. M. Garcia, "Advanced Parallel ETL: Leveraging

Presto for Real-time Data Pipelines," IEEE Transactions on Data and Knowledge

Engineering, vol. 34, no. 9, pp. 2319–2330, Sept. 2022.

M. S. Green and K. M. Ryan, "Efficient Data Partitioning and Caching Techniques in

Distributed ETL Workflows," Proceedings of the 2021 IEEE International Conference

on Big Data (Big Data), Orlando, FL, USA, Dec. 2021, pp. 853–863.

Los Angeles Journal of Intelligent Systems and Pattern Recognition

Vol. 2 - 2022

S. H. Jain, P. L. Kumar, and A. S. Vora, "Real-time Optimization of ETL Pipelines for

Improved Resource Utilization in Spark," IEEE Transactions on Industrial Informatics,

vol. 17, no. 5, pp. 2935–2945, May 2021.

C. A. Chong and L. Y. Chan, "Improving Resource Efficiency in Distributed ETL

Systems: A Benchmark of Presto and AWS Glue," IEEE Access, vol. 9, pp.

–12058, 2021.

R. P. Sah, B. R. Wilson, and S. C. S. Lee, "Dynamic Optimization for ETL Query

Execution in Spark-based Systems," IEEE Transactions on Database Systems, vol.

, no. 3, pp. 1243–1258, Aug. 2021.

Downloads

Published

11-11-2022

How to Cite

[1]
Prabhu Muthusamy, Arun Ayilliath Keezhadath, and Ravi Kumar Burila, “Performance Optimization in Large-Scale ETL Workloads: Advanced Techniques in Distributed Computing”, Los Angeles J Intell Syst Pattern Rec, vol. 2, pp. 113–147, Nov. 2022, Accessed: Mar. 07, 2026. [Online]. Available: https://lajispr.org/index.php/publication/article/view/12