Reinforcement Learning Scheduler for Sub-250ms Microservice Chains
Keywords:
reinforcement learning, microservice orchestration, latency optimization, cloud infrastructure, PPO, autoscalingAbstract
The objective of this paper is to present a RL-based schedulers which is designed to enhance latency and cost-effectiveness in tail-latency-limited microservice architectures. In the proposed container replicas and API gateway throttling controller depend on end-to-end request latency, per-node queue depths, and real-time AWS spot market pricing. The high-throughput cloud wallet stack's RL scheduler which improves dynamic resource scaling.
Downloads
References
V. Mnih et al., "Human-level control through deep reinforcement learning," Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, "Proximal policy optimization algorithms," arXiv preprint arXiv:1707.06347, 2017.
L. Xu, Z. Zheng, and L. Chen, "Adaptive autoscaling for microservices in the cloud: A reinforcement learning approach," IEEE Trans. Services Computing, vol. 14, no. 2, pp. 312–324, Mar.–Apr. 2021.
A. Verma, G. Dasgupta, T. Nayak, J. De, and R. Kothari, "Large-scale cluster management at Google with Borg," in Proc. EuroSys, 2015, pp. 1–17.
J. Dean and L. A. Barroso, "The tail at scale," Commun. ACM, vol. 56, no. 2, pp. 74–80, Feb. 2013.
D. Li, H. Chen, and Z. Xu, "Deep reinforcement learning-based autoscaling for cloud-native applications," in Proc. IEEE CLOUD, 2020, pp. 368–375.
M. Mao and M. Humphrey, "Auto-scaling to minimize cost and meet application deadlines in cloud workflows," in Proc. IEEE ICAC, 2011, pp. 57–66.
C. Xu, X. Li, Q. Zhang, and Z. Li, "Learning-based adaptive autoscaling for cloud applications," IEEE Trans. Parallel Distrib. Syst., vol. 30, no. 4, pp. 835–849, Apr. 2019.
M. Zaharia et al., "Apache Spark: A unified engine for big data processing," Commun. ACM, vol. 59, no. 11, pp. 56–65, Nov. 2016.
Amazon Web Services, "Amazon EC2 Spot Instances," AWS Documentation, 2024. [Online]. Available: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html
T. Chen, H. Li, Y. Xu, and Q. Wang, "A cost-efficient container autoscaling approach for microservice architectures," IEEE Trans. Cloud Computing, vol. 9, no. 1, pp. 45–56, Jan.–Mar. 2021.
S. Yang, B. Wang, X. Zhang, and J. Xu, "Cloud resource scheduling based on deep reinforcement learning: A survey," IEEE Access, vol. 8, pp. 177472–177485, 2020.
A. Ghanbari, M. Naghibzadeh, and H. Sarbazi-Azad, "Autoscaling of microservices using reinforcement learning," in Proc. IEEE Int. Conf. Web Services, 2022, pp. 345–352.
Z. Guo, L. Chen, and D. He, "Reinforcement learning-based resource management in cloud computing: A comprehensive survey," J. Syst. Archit., vol. 117, 2021.
S. Zhang, Q. Mao, and Y. Yang, "Towards latency-aware autoscaling for microservices with reinforcement learning," in Proc. ACM Middleware, 2022, pp. 21–33.
J. Rao and X. Bu, "QoS guarantee and SLA violation handling for cloud autoscaling: A reinforcement learning approach," IEEE Trans. Network and Service Management, vol. 19, no. 2, pp. 1153–1166, June 2022.
K. Hwang, J. Dongarra, and G. C. Fox, Distributed and Cloud Computing, 2nd ed. Morgan Kaufmann, 2019.
R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. De Rose, and R. Buyya, "CloudSim: A toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms," Software: Practice and Experience, vol. 41, no. 1, pp. 23–50, Jan. 2011.
D. Kliazovich, P. Bouvry, and S. U. Khan, "GreenCloud: A packet-level simulator of energy-aware cloud computing data centers," J. Supercomput., vol. 62, no. 3, pp. 1263–1283, Dec. 2012.
Y. Zhang, Y. Chen, S. Zhang, and X. Chu, "A hybrid RL approach for cost-efficient autoscaling in serverless computing," in Proc. IEEE ICDCS, 2023, pp. 562–571.