Cloud computing has emerged as an important computing paradigm. One critical issue is to guarantee the service quality for end users. To address this problem, we design smart algorithms in cloud computing environment to improve the performance of jobs submitted by users. At present, we mainly focus on two metrics of job which are completion time and resource consumption.
Speculative Execution for a single job in a MapReduce-like cluster
Parallel processing plays an important role for large-scale data analytics. It breaks a job into many small tasks which run parallel on multiple machines such as MapReduce framework. One fundamental challenge faced to such parallel processing is the straggling tasks as they can delay the completion of a job seriously.
In this project, we focus on the speculative execution issue which is used to deal with the straggling problem in the literature. We present a theoretical framework for the optimization of a single job which differs a lot from the previous heuristics-based work. More precisely, we propose two schemes when the number of parallel tasks the job consists of is smaller than cluster size. In the first scheme, no monitoring is needed and we can provide the job deadline guarantee with a high probability while achieve the optimal resource consumption level. The second scheme needs to monitor the task progress and makes the optimal number of duplicates when the straggling problem happens. On the other hand, when the number of tasks in a job is larger than the cluster size, we propose an Enhanced Speculative Execution (ESE) algorithm to make the optimal decision whenever a machine is available for a new scheduling. The simulation results show the ESE algorithm can reduce the job flowtime by 50% while consume fewer resources comparing to the strategy without backup.
Online Job Scheduling
Jobs in modern computing clusters have highly diverse processing durations and heterogeneous resource requirements. The job profiles in today’s computing clusters are becoming increasingly diverse as small, but latency-sensitive jobs coexist with large batch-processing applications that may take from hours to weeks to complete. To satisfy these job demands, a modern computing cluster often scales out to hundreds or even thousands of servers. Towards this end, various schedulers have been implemented and deployed by the industry. However, these schedulers are mostly designed based on heuristic arguments in order to achieve high scalability. Relatively little modeling or analytical effort is taken to characterize and optimize the performance trade-offs of the resultant design in a systematic, theoretically vigorous manner.
In this research, we consider the problem of online job scheduling for a computing cluster comprised of multiple servers with heterogeneous computation resources, while taking the diversity of resource demands for different jobs into account. Our focus is to achieve a low overall job response time for the system (which is also referred to as the job flowtime) while providing fairness between small and large jobs
Approximation jobs that allow partial execution of their many tasks to achieve valuable results have played an important role in today’s large-scale data analytics. This fact can be utilized to maximize the system utility of a big data computing cluster by choosing proper tasks in scheduling for each approximation job. we model a cluster with heterogeneous machines as a multi-armed bandit where each machine is treated as an arm. By making estimations on machine service rates while balancing the exploration-exploitation trade-off, we design an efficient online resource allocation algorithm from a bandit perspective.
- Huanle Xu*, Huangting Wu*, Wing Cheong Lau, "Revisiting SRPT for Job Scheduling in Computing Clusters," the 14th International Conference on Queueing Theory and Network Applications (QTNA), August 2019.
- Yang Liu*, Huanle Xu*, Wing Cheong Lau, "Online Job Scheduling with Resource Packing on a Cluster of Heterogeneous Servers," IEEE Infocom, Apr 2019.
- Huanle Xu*, Yang Liu*, Wing Cheong Lau, Jun Guo, Alex X. Liu, "Efficient Online Resource Allocation in Heterogeneous Clusters with Machine Variability," IEEE Infocom, Apr 2019.
- Huanle Xu*, Gustavo De Veciana, Wing Cheong Lau, and Kunxiao Zhou. "Online Job Scheduling with Redundancy and Opportunistic Checkpointing: A Speedup-Function-Based Analysis." IEEE Transactions on Parallel and Distributed Systems 30, no. 4 (2018): 897-909. Harvard
- Huanle Xu*, Wing Cheong Lau, Zhibo Yang*, Gustavo de Veciana, Hanxu Hou, "Mitigating Service Variability in MapReduce Clusters via Task Cloning: A Competitive Analysis," in IEEE Transactions on Parallel and Distributed Systems, Vol.28, Issue 10, Oct 2017.
- Huanle Xu*, Gustavo de Veciana, and Wing Cheong Lau. "Addressing job processing variability through redundant execution and opportunistic checkpointing: A competitive analysis." In IEEE INFOCOM 2017-IEEE Conference on Computer Communications, pp. 1-9. IEEE, 2017.
- Huanle Xu*, Wing Cheong Lau, "Optimization for Speculative Execution in Big Data Processing Clusters," in IEEE Transactions on Parallel and Distributed Systems, Vol.28, Issue 2, Feb 2017.
- Huanle Xu*, Ronghai Yang*, Zhibo Yang*, and Wing Cheong Lau. "Solving Large Graph Problems in MapReduce-Like Frameworks via Optimized Parameter Configuration." In International Conference on Algorithms and Architectures for Parallel Processing, pp. 525-539. Springer, Cham, 2015.
- Huanle Xu*, and Wing Cheong Lau. "Task-cloning algorithms in a MapReduce cluster with competitive performance bounds." In 2015 IEEE 35th International Conference on Distributed Computing Systems, pp. 339-348. IEEE, 2015.
- Huanle Xu*, and Wing Cheong Lau. "Optimization for speculative execution in a MapReduce-like cluster." In 2015 IEEE Conference on Computer Communications (INFOCOM), pp. 1071-1079. IEEE, 2015.
- Huanle Xu*, Ronghai Yang*, Zhibo Yang* and Wing Cheong Lau, "Solving Graph Problems in MapReduce-like Frameworks via Optimized Parameter Configuration," The 15th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP), Zhangjiajie, China, Nov. 2015.