Reinforcement Learning Scheduler for Sub-250ms Microservice Chains

Authors

  • Priya Dharshini Kalyanasundaram Amazon, USA Author
  • Gnanendra Reddy Muthirevula Tekvana Inc, USA Author
  • Vijay Kumar Soni Discover Financial Services, USA Author

Keywords:

reinforcement learning, microservice orchestration, latency optimization, cloud infrastructure, PPO, autoscaling

Abstract

The objective of this paper is to present a RL-based schedulers which is designed to enhance latency and cost-effectiveness in tail-latency-limited microservice architectures. In the proposed container replicas and API gateway throttling controller depend on end-to-end request latency, per-node queue depths, and real-time AWS spot market pricing. The high-throughput cloud wallet stack's RL scheduler which improves dynamic resource scaling. 

Downloads

Download data is not yet available.

References

V. Mnih et al., "Human-level control through deep reinforcement learning," Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, "Proximal policy optimization algorithms," arXiv preprint arXiv:1707.06347, 2017.

L. Xu, Z. Zheng, and L. Chen, "Adaptive autoscaling for microservices in the cloud: A reinforcement learning approach," IEEE Trans. Services Computing, vol. 14, no. 2, pp. 312–324, Mar.–Apr. 2021.

A. Verma, G. Dasgupta, T. Nayak, J. De, and R. Kothari, "Large-scale cluster management at Google with Borg," in Proc. EuroSys, 2015, pp. 1–17.

J. Dean and L. A. Barroso, "The tail at scale," Commun. ACM, vol. 56, no. 2, pp. 74–80, Feb. 2013.

D. Li, H. Chen, and Z. Xu, "Deep reinforcement learning-based autoscaling for cloud-native applications," in Proc. IEEE CLOUD, 2020, pp. 368–375.

M. Mao and M. Humphrey, "Auto-scaling to minimize cost and meet application deadlines in cloud workflows," in Proc. IEEE ICAC, 2011, pp. 57–66.

C. Xu, X. Li, Q. Zhang, and Z. Li, "Learning-based adaptive autoscaling for cloud applications," IEEE Trans. Parallel Distrib. Syst., vol. 30, no. 4, pp. 835–849, Apr. 2019.

M. Zaharia et al., "Apache Spark: A unified engine for big data processing," Commun. ACM, vol. 59, no. 11, pp. 56–65, Nov. 2016.

Amazon Web Services, "Amazon EC2 Spot Instances," AWS Documentation, 2024. [Online]. Available: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html

T. Chen, H. Li, Y. Xu, and Q. Wang, "A cost-efficient container autoscaling approach for microservice architectures," IEEE Trans. Cloud Computing, vol. 9, no. 1, pp. 45–56, Jan.–Mar. 2021.

S. Yang, B. Wang, X. Zhang, and J. Xu, "Cloud resource scheduling based on deep reinforcement learning: A survey," IEEE Access, vol. 8, pp. 177472–177485, 2020.

A. Ghanbari, M. Naghibzadeh, and H. Sarbazi-Azad, "Autoscaling of microservices using reinforcement learning," in Proc. IEEE Int. Conf. Web Services, 2022, pp. 345–352.

Z. Guo, L. Chen, and D. He, "Reinforcement learning-based resource management in cloud computing: A comprehensive survey," J. Syst. Archit., vol. 117, 2021.

S. Zhang, Q. Mao, and Y. Yang, "Towards latency-aware autoscaling for microservices with reinforcement learning," in Proc. ACM Middleware, 2022, pp. 21–33.

J. Rao and X. Bu, "QoS guarantee and SLA violation handling for cloud autoscaling: A reinforcement learning approach," IEEE Trans. Network and Service Management, vol. 19, no. 2, pp. 1153–1166, June 2022.

K. Hwang, J. Dongarra, and G. C. Fox, Distributed and Cloud Computing, 2nd ed. Morgan Kaufmann, 2019.

R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. De Rose, and R. Buyya, "CloudSim: A toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms," Software: Practice and Experience, vol. 41, no. 1, pp. 23–50, Jan. 2011.

D. Kliazovich, P. Bouvry, and S. U. Khan, "GreenCloud: A packet-level simulator of energy-aware cloud computing data centers," J. Supercomput., vol. 62, no. 3, pp. 1263–1283, Dec. 2012.

Y. Zhang, Y. Chen, S. Zhang, and X. Chu, "A hybrid RL approach for cost-efficient autoscaling in serverless computing," in Proc. IEEE ICDCS, 2023, pp. 562–571.

Downloads

Published

06-05-2024

How to Cite

[1]
Priya Dharshini Kalyanasundaram, Gnanendra Reddy Muthirevula, and Vijay Kumar Soni, “Reinforcement Learning Scheduler for Sub-250ms Microservice Chains”, Los Angeles J Intell Syst Pattern Rec, vol. 4, pp. 367–398, May 2024, Accessed: Mar. 07, 2026. [Online]. Available: https://lajispr.org/index.php/publication/article/view/77