author | Huang Tao 、 Wang Menghai source | Alibaba cloud official account
As a further form of Cloud Computing , Cloud nativity is becoming a new technology standard in the cloud era , By reshaping the entire software life cycle , Become the shortest path to release cloud value .
Within the enterprise , The cloud has become the internal infrastructure of the enterprise . meanwhile , It also brings about the compatibility problems brought by the integration of various basic platforms , Especially the larger the scale 、 More and more enterprises have been deposited in history , such “ Technical debt ” The more obvious it is .
The experience shared in this paper comes from Alibaba's experience in hybrid scheduling over the past few years , It has strong practical value in production . The content starts from shallow , The scheduler is getting deeper , In the case of large-scale container scheduling , Alibaba's unified infrastructure designed for cloud native applications ASI（Alibaba Serverless infrastructure） How the scheduler manages Alibaba is so complicated 、 Busy resource scheduling tasks ; And try to make you fully understand through some specific cases , I believe it will open up the design ideas for readers with similar problems , And provide reference for the landing . Through this paper , I believe you will systematically understand Alibaba's resource hybrid scheduling in complex task scenarios .
ASI Within Alibaba group, it is leading the implementation of Container Comprehensive cloud application , Undertook the evolution of lightweight container architecture within Alibaba group 、 Operation and maintenance system, cloud original biochemical and other responsibilities , And further accelerate the promotion of emerging technologies, including Mesh、Serverless、Faas Waiting for the landing in Ali group ; Support, including Taobao 、 Tmall 、 youku 、 Gao de 、 Are you hungry 、UC、 Almost all economies, such as koala, have internal business 、 There are many scenes and ecology of Alibaba cloud products .
ASI The core is based on Kubernetes, And provide complete cloud native technology stack support . Today, ASI It has also been successfully implemented with Alibaba cloud container service ACK（Alibaba Cloud Container Service for Kubernetes） The meeting of ; and ACK It retains all the capabilities on the cloud , It can also successfully deal with the complex business environment of Alibaba group .
ASI The dispatcher is ASI One of the core components of cloud native . stay ASI In the development of cloud Nativity , The role of the scheduler is crucial . The most intuitive cognition is ： Alibaba's huge online e-commerce trading container , For example, shopping cart 、 Order 、 Taobao details, etc , The distribution of each container , Including container choreography 、 Stand alone computing resources 、 Memory resources , Both are assigned and scheduled by the scheduler ; Especially in double 11 In the zero peak scenario , A few container choreography errors can have a fatal impact on the business , The scheduler is responsible for controlling the quality of each container's computation at peak , Its importance can be imagined .
ASI Scheduler originated from 2015 Online e-commerce trading container scheduling , The first schedulers of the year covered only online transactions T4（ Ali early based on LXC and Linux Kernel Custom container technology ） and Alidocker scene , Born with responsibility , And in 2015 Year to carry double 11 Play a role in peak traffic .
ASI The evolution of scheduler is also accompanied by the whole process of cloud original development . It went through the earliest online transaction container scheduler 、Sigma Scheduler 、Cerebulum Scheduler 、ASI Scheduler ; The next generation scheduler we're building up to now Unified-Scheduler, It will further absorb and integrate Alibaba over the past few years ODPS（ Fu Xi ）、Hippo Search for 、 Advanced experience of online scheduling in various fields . The scheduling interpretation of each stage is shown in the figure below ：
stay ASI There are many challenges to be solved in the evolution of scheduler , Mainly reflected in ：
There are many kinds of tasks scheduled by scheduler , There are massive online long life cycle containers and POD example 、Batch Mission 、 Many forms of BestEffort The task is different SLO Level of mission ; There's computational 、 Storage type 、 Network type 、 Heterogeneous and other tasks of different resource types , The demands and scenarios of different tasks are different .
Host resources on schedule vary . Dispatch and manage a large number of host resources in Alibaba group , Including many models of non cloud physical machines in stock 、 Dragon on the cloud 、ECS、 Heterogeneous models such as GPU/FPGA etc. .
The scheduler has a wide range of service scenarios . for example ： The most typical pan trading scenario ; The most complex middleware scenario ;Faas/Serverless/Mesh/Job And many new computing scenarios ; Are you hungry 、 The koala 、 New ecological scenes such as Shenma ; The public cloud is accompanied by the demand of multi rent security isolation ; And it's very challenging all over the world ODPS（ Fu Xi ）、Hippo、 Ant 、ASI Unified scheduling scenario .
There are many responsibilities at the infrastructure level . The scheduler part is responsible for the infrastructure model definition 、 Computing storage network resource integration 、 Convergence hardware form 、 Transparent infrastructure and so on .
About the detailed development history of alicloud , Interested students can go through 《 A world changing “ The box ”》 This article is about . below , Let's focus on sharing ASI How does the scheduler manage such a large scale of Ali 、 Such a complex and busy computing resource scheduling task .
A preliminary study of scheduler
1. What is a scheduler
Scheduler in ASI The role of many components is very central . Scheduler is one of the core components in the scheduling system of cloud native container platform ; The scheduler is the cornerstone of resource delivery . The dispatcher is ASI The cloud's original brain . The value of scheduler is mainly reflected in ：
Powerful & Scenario rich resource delivery ( Calculation 、 Storage )
Cost optimal resource delivery
Stable and optimal resource delivery at business runtime
More generally speaking , What the scheduler has to do is ：
The optimal scheduling of a job ： Choose the most suitable host in the cluster , And on this host , Use posture with the best resources , To minimize mutual interference （ Such as CPU Distribution 、IO Scramble for ） To run a user submitted calculation job .
The optimization of cluster global scheduling ： Make sure the global resource arrangement is optimal （ Such as fragments, etc ）、 The most stable operation of resources 、 The global cost is optimal .
stay ASI In the cloud native system , The location of the central scheduler is shown in the figure below （ The box marked in red shows ）：
2. The general scheduler
Most of the time , When it comes to schedulers in the industry, they mean “ Central scheduler ”, For example, in the community K8s kube-scheduler. But the real scheduling scenario is complex , Every scheduling is a complex and flexible complex . After the assignment is submitted , It needs a central scheduler 、 Single machine scheduling 、 Kernel scheduling and multi-level scheduling are coordinated together , And further in K8s Components kubelet、controller And so on ; Online scheduling scenarios , There is also a batch scheduling scheduler ; Multiple scheduling under rescheduling ensures that the cluster is always optimal .
ASI The broad sense of scheduler is ： Central scheduler 、 Single machine scheduling 、 Kernel scheduling 、 Rescheduling 、 Large scale scheduling 、 Multi tier scheduling An integrated complex .
1） Central scheduler
The central scheduler is responsible for calculating each of the （ Or a batch of ） The resource arrangement calculation of the job , It ensures that the first schedule is optimal . The central scheduler computes for this specific task, such as clusters 、 regional 、 Execution node （ The host machine ） Etc , Further refine the CPU Distribute 、 Storage 、 Network resource allocation .
The central scheduler is K8s With the collaboration of ecological components , Managing the life cycle of most tasks .
ASI In the evolution of cloud origin , The central scheduler is described above Sigma Scheduler 、Cerebulum Scheduler 、ASI Scheduler and so on .
2） Single machine scheduling
There are two main types of responsibilities ：
The first type of responsibility ： Overall coordination, more than one single machine POD The best operation of .ASI After receiving the node selection instruction from the central scheduler , Schedule tasks to specific nodes for execution , Single machine scheduling starts to work ：
- Single machine scheduling will immediately 、 Or cyclical 、 Or operation and maintenance Dynamically ensure that there are many in a single machine POD The best job , This means that it will co-ordinate collaborative resources within a single machine , for example ： every last POD Of CPU The best adjustment of nuclear distribution .
- Real time according to POD Operating indicators such as load 、QPS etc. , For some run-time resources, execute in stand-alone VPA Expansion and contraction capacity 、 Or evict low priority tasks . for example ： Dynamic expansion POD Of CPU Capacity .
The second type of responsibility ： Single machine resource information collection 、 Report 、 Convergence Computing , Provide decision basis for central dispatching . stay ASI Inside , Single machine scheduling component mainly refers to SLO-Agent、Kubelet Part of the enhancement capability ; In the construction of Unified-Scheduler In the dispatch , Single machine scheduling mainly refers to SLO-Agent、Task-Agent、 as well as Kubelet Part of the enhancement capability .
3） Kernel scheduling
Single machine scheduling from the perspective of resources as a whole POD The best operation of , But the running state of the task is actually controlled by the kernel . This requires kernel scheduling .
The central scheduler ensures the optimal scheduling of each task , One time scheduling problem ; But the central scheduler can not achieve the global optimization of cluster dimension , This requires rescheduling .
5） Large scale scheduling
Large scale scheduling is a unique scenario of Alibaba's large-scale online scheduling , since 17 Construction started in , Now it's very mature , And it's still growing .
Take advantage of large-scale layout capabilities , We can dispatch tens of thousands at a time 、 Hundreds of thousands of containers , Ensure the global optimal layout of all containers of cluster dimension at one time . It is very clever to make up for the drawbacks of one-time central scheduling , It avoids the complexity of repeated rescheduling in large-scale station construction scenarios .
About Kernel scheduling 、 Rescheduling 、 Large scale scheduling , We'll go into details in the following chapters .
6） Scheduling hierarchy
Another dimension , We will also define the scheduling hierarchy , Include One layer scheduling 、 Second level scheduling 、 Three layer scheduling ... etc. ;Sigma In the off-line hybrid scenario, the concept of zero layer scheduling is even introduced . Each scheduling system has different understanding and definition of scheduling hierarchy , And they all have their own concepts . for example , In the past Sigma In system , Scheduling is divided into 0 layer 、1 layer and 2 Layer scheduling ：
- 0 The layer scheduler is responsible for the global resource view and management , And undertake various 1 Scheduling arbitration between layer schedules , And specific implementation ;1 Layer scheduling is mainly corresponding to Sigma Scheduler 、 Fuxi scheduler [ Other schedulers can also be included ].
- stay Sigma In the system ,Sigma Scheduler as 1 Layer scheduling , Responsible for the allocation of resources .
- 2 Layer scheduling is implemented by different access services （ E-commerce, for example 、 advertisement Captain、 database AliDB etc. ）.2 Layer scheduling is fully close to and understand their respective business , From the perspective of business optimization , Build dispatching capacity , Such as business expulsion 、 State application failure automatic operation and maintenance, etc , Do intimate service .
Sigma The fatal drawback of the hierarchical scheduling system is , The technical capability and investment of each two-tier scheduling are uneven ; For example, the two-tier scheduling system of advertising is excellent , But not all the two-tier scheduling is extremely intimate to the service .ASI Draw lessons , Sink many abilities to ASI Inside , And further standardize the upper level PAAS, Simplify the upper layer and enhance the upper layer's ability .
In the next generation scheduler concept under construction today , It's also divided into layers , for example ： Calculate the load layer （ Mainly refers to Workload Scheduling management ）、 Computing scheduling layer （ Such as DAG Dispatch 、MR Scheduling, etc ）、 The business layer （ Same as Sigma 2 The concept of layers ）.
3. Scheduling resource types
I tried to use the Unified-Scheduler Schedulers to give you a better understanding of . stay Unified-Scheduler In the scheduler , Scheduling Product resources 、Batch resources 、BE Calculate resources in three hierarchical resource forms .
Different schedulers have different definitions of hierarchical resource forms , But it's essentially the same . In order to better understand this essence , I'll talk about it in the following chapters ASI The scheduler also explained it in detail .
1）Product（ On-line ） resources
Yes Quota Budget resources , And the scheduler needs to guarantee its highest level of resource availability . The typical representative is the long life cycle of online e-commerce core transactions POD example . The most classic example is double 11 Shopping carts on the core link （Cart2）、 Order （tradeplatform2） The core business of the transaction POD. These resources require a strict guarantee of computing power 、 High priority 、 The real time 、 Response low latency 、 No interference, etc .
for instance , The long life cycle of online trading POD They've been around for a long time , several days 、 Months 、 Even for years . Most of the application R & D students apply for the application , After the completion of the construction, you need to apply for several long life cycle instances , These are all Product resources . TaoBao 、 Tmall 、 Juhuasuan 、 Gao de 、 umeng 、 Unity 、 rookie 、 internationalization 、 Idle fish .... So many business research and development students applied for POD（ Or container ） example , Quite a lot of them are product resources .
Product Resources don't just mean online long life cycle POD; Any resource request that meets the above definition , All are Product resources . But not all long life cycles POD All are Product resources . For example, Ali's interior “Aone laboratory ” Used to perform CI Build the task of POD, Can exist for a long life cycle , But it can be preempted by low-cost eviction .
Online business uses Product Resources Allocate and Usage Between Gap It is relatively stable for a period of time , This Gap and Prod The unallocated resources are taken as BE resources , Sell to aim at latency Businesses that are less sensitive and have a certain demand for resource stability .Batch Yes quota The budget , But for a period of time （ for example 10 minute ） Certain probability of （ for example 90%） Resource availability of .
in other words ,Product（ On-line ） The resource application takes away the resources on the book , But in fact, from the load utilization index, there may be a lot of unused computing power ; At this point, we will play the difference of scheduler SLO Hierarchical scheduling capability , Take those parts that are not full , Make full use of it as an over issued resource , Sell to Batch resources .
3）Best Effort(BE) resources
I don't know Quota The budget , Resource availability is not guaranteed , Can be suppressed and seized at any time ; On a node that is already assigned to a node Usage When it's below a certain level , The scheduler thinks that this part Gap It's a “ It's not stable / No bookkeeping ” Resources for , So this Gap It's called BE resources .
We can use this as an example ：Product、Batch Resources are responsible for eating large pieces of meat ,BE Resources are responsible for consumption Product and Batch Don't use the leftovers . for example ： In daily development work , R & D needs to run a lot UT Test task , This kind of computing task does not require high quality of computing resources , The tolerance of time delay is also relatively high , It's not good to evaluate the budget , For this kind of scenario to buy a lot of Product perhaps Batch resources , It will be very uneconomic ; But if you use the cheapest BE resources , The benefits will be considerable . here ,BE Resources are Product/Batch Resources that are not used in operation .
It's easy to understand , It is through this hierarchical resource scheduling capability that , In terms of Technology ,Unified-Scheduler The scheduler can use the resources of a physical node , Play to the extreme .
Overview of scheduler capabilities
The picture below is ASI Around the responsibilities that need to be covered by generalized scheduling , And corresponding to different resource levels 、 And rich business scenarios for services , An overview of the scheduling capabilities built . Through this picture , You can understand that ASI Technology panorama of scheduler .
Typical online scheduling capabilities
1. Business demands of online scheduling
stay ASI Cloud native container platform , The online part serves the transaction 、 shoppers 、 live broadcast 、 video 、 Local life 、 rookie 、 Gao de 、 Unity 、 umeng 、 Dozens of overseas BU Various scheduling scenarios . The highest level of “Product resources ” The largest proportion of scheduling is .
Online business scheduling and offline business scheduling 、 A number of JOB Compared with the scheduling model , There are typical differences （ When describing the online scene , As you can imagine , The world of offline scheduling is also wonderful ）.
1） Life cycle
- Long Running： The container life cycle of online application is generally long . At least a few days , Mostly in months , Some long tail applications even survive for several years .
- Long startup time ： The image size of the application is large , It takes a long time to download the image , Service startup, memory preheating, etc , This causes the app to start in seconds 、 For dozens of minutes .
Long life cycle features , With some typical short life cycle task scheduling （ Such as FaaS Function calculation ）, There are essential differences in task characteristics , The technical challenges behind it are also quite different . for example ： The challenge for a relatively short-lived functional computing scenario is ： The ultimate scheduling efficiency 、 100 millisecond execution efficiency 、 Fast scheduling throughput 、POD Runtime performance, etc . And the long life cycle POD The challenge of differentiation is ： The global optimal scheduling must rely on rescheduling for continuous iterative optimization ; The optimal scheduling at runtime must rely on single machine rescheduling and continuous optimization guarantee . As you can imagine , In the past, in the non cloud age , Many businesses cannot be migrated , It's a nightmare for scheduling ; This means that the scheduler is not only faced with the technical problem of scheduling capability , We also need to face the huge difficulty of stock business governance ; Online applications take a long time to start , It also aggravates and reduces the flexibility of rescheduling , Bring more complexity .
2） Container runtime
The container runtime needs to support real-time business interaction 、 Respond quickly 、 Low business RT And so on . Online container runtime , Most systems are responsible for real-time interaction , And extremely sensitive to latency , A little bit of delay will lead to a significantly worse business sense .
The characteristics of resources are obvious ： Such as network consumption 、IO Consumptive 、 Calculating consumption and so on . When instances of the same characteristics coexist , It is very easy to have obvious resource competition between each other .
The runtime of an online container is very sensitive to both business and computing power , Therefore, the scheduling quality is a severe challenge .
3） Deal with the unique complex business model of Ali online applications
Characteristics of high and low flow peaks ： Online services generally have obvious peaks , For example, the peak of hunger is at noon and at night 、 The peak of Taobao also has obvious troughs and peaks .
Burst traffic ： The complexity of the business , These burst traffic does not necessarily show a certain regularity ; For example, the live broadcast service may cause a traffic surge due to an unexpected event . The technical demand behind the sudden traffic is often elastic , The most classic case is 2020 Nail elasticity during the outbreak in .
Resource redundancy ： The online business starts at the moment of birth , It defines redundant resources ; This is mainly for the consideration of disaster recovery . But from the overall perspective of Alibaba , Quite a lot of long tail applications are not sensitive to cost and utilization due to their small scale , Many a little make a mickle , There is a huge waste of computing power behind it .
4） Unique large-scale operation and maintenance demands
Complex deployment models ： for example ： Need to support application unit deployment , Multi machine room disaster recovery , Small flow 、 Grayscale 、 Complex scheduling requirements for formal multi environment deployment .
Great promotion & The scale peak characteristics of seckill ： Alibaba's various promotion throughout the year , For example, we are familiar with double 11、 double 12、 Spring Festival red envelopes and so on . The pressure on the whole link 、 Resource consumption will increase exponentially with the growth of peak traffic , This requires the powerful large-scale scheduling capability of the scheduler .
Promote the construction of the station ： The time for big promotion is planned , In order to save the purchase cost of cloud resources , We must reduce the retention time of cloud resources as much as possible . The scheduler needs to complete the station construction before the promotion as soon as possible , And quickly return the resources to alicloud after the promotion . This means extremely severe demands for large-scale scheduling efficiency , And leave more time for the business .
2. One time scheduling ： Basic scheduling capabilities
The following table details the most common scheduling capabilities for online scheduling ：
The basic demands of application correspond to ： The basic demands of application expansion , for example POD specifications 、OS etc. . stay ASI In the scheduler , It's abstracted as ordinary label Match scheduling .
Disaster recovery and dispersion ：locality Dispatch ,ASI A lot of detailed information has been obtained by various means , For example, in the figure above Network core 、ASW etc. .
Advanced strategy ：ASI Will standardize as much as possible 、 General business demands , But there are still inevitably some businesses , The resource 、 There are many specific requirements at runtime , For example, specific infrastructure environment such as hardware, etc 、 Specific demands of container capacity such as HostConfig Parameters 、 Kernel parameters, etc .
About scheduling rule Center ： Business specific requirements for strategy , After the decision of scheduling, there will also be a strong scheduling strategy center , It guides the scheduler to use the right scheduling rules ; The data of scheduling rule center comes from learning , Or expert operation and maintenance experience . The scheduler adopts these rules , And apply it to every one of them POD In the expansion allocation of .
3. Inter application choreography strategy
Due to the limited number of cluster nodes , Many applications that potentially interfere with each other , When the same node has to coexist , At this point, we need to arrange strategy between applications , To ensure that every host node and every POD Run time optimal .
In the actual production scheduling practice ,“ Business stability ” Always in the first place , But resources are always limited ; It's hard for us to balance “ The cost of resources is the best ” and “ Business stability ”. In most cases , The layout strategy between applications can perfectly solve this balance ; By defining between applications （ Such as CPU Consumption intensive 、 Network consumption type 、IO intensive 、 Peak model characteristics, etc ） The co-existence strategy of , Fully break up within the cluster , Or when the same node coexists, there is sufficient policy constraint protection , And then make a difference POD The probability of interference between them is the smallest .
Further more , The scheduler is optimized by more technical means at runtime , for example ： Through network priority control 、CPU Fine layout control strategy , To avoid the potential impact of runtime between applications as much as possible .
The other challenges posed by the inter application choreography strategy are ： Scheduler in the construction of their own application layout ability , It is also necessary to fully understand the operation characteristics of each business running on it .
4. CPU Fine layout
CPU Fine arrangement in “ Online scheduling field ” It's a very interesting topic , It includes CpuSet Dispatch 、CpuShare Dispatch . Scheduling areas for other scenarios , For example, in the field of offline scheduling , It's not that important , Even incomprehensible ; But in the online trading scenario , Whether it's theoretical inference 、 Lab scenes 、 Or countless times of pressure measurement data , All proved to be accurate CPU Scheduling is so important .
CPU The fine arrangement of a sentence interpretation is ： Nuclear modulation , Make sure CPU Nuclear maximization 、 The most stable use of .
CPU Fine choreography is so important , So much so that ASI Over the past few years , I've learned this rule to the extreme . I believe that after you see the table below （ Only CpuSet Fine scheduling ）, You will also sigh ASI It's even been tricked .
The popular science ： With one 96 nucleus （ Actually, we're all talking about 96 A logic core ） Of X86 Architecture physical machine or dragon as an example , It has 2 individual Socket, Every Socket Yes 48 A physical nucleus , Under each core is 2 A logic core .【 Of course ,ARM And the architecture of X86 Different 】.
because CPU Architecturally L1 L2 L3 Cache Design , The ideal distribution is ： Under the same nucleus 2 A logic core , One of the nuclei Assigned to core online trading applications such as Carts2（ Shopping cart business ）, Another core is assigned to another non core application that is not busy ; In daily life 、 or double 11 Zero peak ,Carts2 You can take advantage of it . This usage , In the actual production environment 、 In the pressure test drilling environment, it has been tried and tested repeatedly .
Suppose we take two logical cores on the same physical core , All assigned to Carts2 when , Because of the same business peaks （ Especially the same POD example ）, The maximum use of resources will be greatly reduced .
In theory, we should also try to avoid two applications that are also the core of the transaction , for example Carts2（ Shopping cart business ）、tradePlatform2（ Order ）, So that it doesn't share these two cores . But actually at the micro level ,Carts2 and tradePlatform2 There's going to be a difference in the peak value of , So actually, the impact is small . Even so CPU The distribution looks a little bit “ will ”; But physical resources are limited , It can only keep this “ will ” 了 .
And in the numa-aware On , To maximize the use of L3 Cache To improve computing performance , The same POD More cores of , We should also ensure that we try to avoid straddling Socket.
And when you use CPUShare when ,Request and Limit How to allocate , Also very learned ;CPUSet and CPUShare At the same time , Scheduling will be more complex （ for example ：CpuSet New expansion of the container 、 Or offline , The potential appeal is that the whole machine has POD Of CPU Rescheduling ）; And in the emerging GPU In heterogeneous scheduling scenarios ,CPU And GPU We also have some skills in co-existence and distribution .
5. Large scale scheduling
Large scale arrangement is mainly used in station building 、 Moving stations or large-scale migration scenarios , For example, Alibaba frequently promotes the construction of websites 、 Under the demand of computer room migration, the super large-scale station relocation, etc . Based on cost considerations , We need to be in the shortest possible time , Create hundreds of thousands of levels quickly with minimal labor costs POD.
The randomness and randomness of multiple tasks in turn , There are many disadvantages in the field of dispatching center . Before there is no large-scale editing ability , Alibaba large-scale site construction , It's often complicated “ Business self expansion -> Repeatedly rescheduling ” The process of , It's going to take a lot of manpower and weeks of effort . Fortunately, we have large-scale scheduling , At the same time of hourly scale delivery efficiency , And make sure that 99% The above resource allocation rate .
General scheduling capability
The central scheduler achieves one-time optimal scheduling ; But with the final desired cluster dimension, the global scheduling is optimal , It's two completely different concepts . Rescheduling also includes global center rescheduling and single machine rescheduling .
Why must central rescheduling be used as compensation for one-time scheduling ？ Let's give a few examples ：
ASI How long is the lifetime of scheduling cluster memory POD example ; Over time , Cluster dimension will produce a lot of resource fragments 、CPU The problem of uneven utilization rate .
Big nucleus POD The allocation of resources needs dynamic and mobile scheduling capability （ Expel some small cores in real time POD And free up resources ）、 Or global rescheduling based on advance planning , Pre idle some large cores on many nodes .
The supply of resources is always tight . To someone POD When doing one-time scheduling , There may be some “ will ”, It means some kind of flaw and imperfection ; But cluster resources are dynamic , We can at some point after that , For the POD Initiate a dynamic migration , That is, rescheduling , This will lead to a better runtime experience for the business .
Central rescheduling algorithm 、 Implementation is often very complex . We need to understand the various rescheduling scenarios and fully cover , Clearly defined rescheduling DAG chart , Dynamic execution and ensure the success rate of execution .
Many scenarios also require single machine rescheduling . for example ：CPU Finely choreographed SLO Optimize 、 be based on OQS Data driven single machine rescheduling optimization and so on .
It's important to note that , Execution of single machine rescheduling , The problem of safety risk control must be solved first , Avoid uncontrollable explosion radius . Before the single machine side wind control ability is insufficient , We suggest that you do not use node autonomy for the time being , Instead, it's a centralized trigger under strict protection control . In fact, K8s Intra domain , There are many inevitable scenarios of node autonomy （ for example pod yaml When the change ,Kubelet The corresponding changes will be implemented ）, In the past ASI Spend years sorting out every potential risk control point , And Iterative Construction of hierarchical risk control management （ Nuclear button 、 High-risk 、 Medium risk, etc ） Of Defender System ; For potential risk items , Before performing single side action , With the central Defender Interaction , Through security prevention and control to avoid the occurrence of disasters . We suggest that the scheduler must also have a tight security level , To allow nodes to operate autonomously .
2. Kernel scheduling
The background of kernel scheduling is ： A busy host running in parallel , Even if central scheduling & Single machine scheduling , We have worked together to ensure the optimal allocation of resources （ Such as CPU Distribute 、IO Break up, etc ）, But the actual run time , It is inevitable for multi tasks to compete for resources in kernel mode , In the well-known off-line hybrid scene, the competition is particularly fierce . This requires central scheduling 、 Single machine scheduling 、 Kernel scheduling Through many collaborations , For example, coordinate the resource priorities of tasks , And the execution is controlled by the kernel .
This also corresponds to many kernel isolation technologies . Include CPU： Scheduling priority BVT、Noise Clean Mechanism, etc ; Memory ： Memory recovery 、OOM Priority, etc ; The Internet ： Network gold and silver copper priority 、IO wait .
Today we have safe containers . Security container based Guest Kernel and Host Kernel Isolation mechanism , We can more elegantly avoid the partial contention problem of kernel running state .
3. Flexible scheduling 、 Time sharing scheduling
Flexibility and time-sharing logic are better resource reuse , It's just that the dimensions are different .
ASI The scheduler fully cooperates with the alicloud infrastructure layer , utilize ECS Provides a strong flexibility , In the hungry scene , Return resources to the cloud during the low peak period , Re apply for corresponding resources during peak period .
We can use ASI Big resource pool （ notes ：ASI The host resources of the resource pool are all from alicloud resources ） Built in flexibility of Buffer, You can also use alicloud directly IaaS Elastic technology of layers . The balance between the two is a controversial topic , It's also a process of comparative art .
ASI The time-sharing scheduling of resource reuse is the ultimate , And it brings huge cost optimization . By massively shutting down online transactions every night POD example , Free up resources for ODPS Offline tasks use , Every morning the offline task drops water and relaxes the online app . This classic scenario is to maximize the value of offline hybrid technology .
The essence of time sharing is resource reuse , And rely on the construction and management of large resource pool , This is resource operation & Scheduling technology The synthesis of . This requires the scheduler to accumulate rich forms of jobs 、 As well as a large number of tasks .
4. Vertical scaling scheduling /X+1/VPA/HPA
Vertical scaling scheduling is a second level delivery technology , It solves the problem of burst traffic perfectly . Vertical expansion scheduling is also the killer of zero peak pressure risk , By analyzing the stock POD Vertical adjustment of resources 、 Accurate and reliable CPU Scheduling and shuffling algorithms to achieve the second level delivery of computing resources . Vertical scaling scheduling 、VPA Technology comes down in one continuous line , Vertical scaling scheduling is also VPA One of the scenes of .
“X+1” In a sense, horizontal capacity expansion scheduling can also be understood as HPA One of the scenes , It's just “X+1” Horizontal expansion scheduling is triggered manually .“X+1” Focus on the ultimate efficiency of resource delivery , Behind this is the great improvement in R & D efficiency ： On-line POD“X（ A number of ）” Minutes to start and provide business services ; All other operations except application startup , Be sure to “1” All in minutes .
Vertical scaling scheduling and “X+1” Horizontal capacity expansion scheduling complements each other , Together for all kinds of peak escort .
ASI And more are being implemented VPA and HPA scene . for example , We can go through VPA technology , Additionally, it provides more free computing power for ants' Spring Festival red packets , It's going to be a huge cost savings .
VPA/HPA Such as scheduling technology, the ultimate implementation of more scenarios , It is also the place where Alibaba will continue to pursue perfection in the future .
5. to grade [ Differentiation SLO] Resource scheduling
Differentiation SLO Scheduling is one of the essence of scheduler ; This section is similar to 【 Scheduling resource types 】 There are certain repetitions in the chapters . In view of the differences SLO Complexity , So I'd like to put it in the last section of this chapter .
ASI In the scheduler , And very precisely defined SLO( Service quality objectives )、QoS and Priority.
SLO It describes the quality of service goals .ASI Through different QoS and Priority To provide differentiation SLO, Different SLO There are different pricing . Users can decide according to different business characteristics " subscription " What kind of SLO Guaranteed resources . Such as ： Offline data analysis tasks , You can use a lower level of SLO To enjoy a lower price . For important business scenarios, high-level SLO, Of course, the price will be higher .
QoS Describes the quality of resource assurance .K8s Community defined QOS Include Guaranteed、Burstable、BestEffort.ASI As defined in QOS, It's not completely mapped to the community （ The community uses Request / Limit To map ）. In order to make the group's scene （ Such as CPUShare, And so on ） Clear description ,ASI Defines... From another dimension QOS, It includes LSE / LSR / LS / BE, Clearly delineate the different resource guarantees , It can be selected according to different business sensitivity QOS.
PriorityClass and QoS It's a concept of two dimensions .PriorityClass It describes the importance of the task .
The importance of resource allocation strategies and tasks （ namely PriorityClass and QoS） There will be different combinations , Of course, there needs to be a certain correspondence . for example , We can define a name as Preemptible Of PriorityClass, Most of its tasks correspond to BestEffort Of QoS.
Each dispatching system is responsible for PriorityClass There are different definitions . for example ：
- stay ASI in ,ASI Of priority Definition , At present, the definition of System、Production、Preemptible、Production、Preemptible. The details of each level are not interpreted in detail here .
- Search for Hippo The categories and granularity defined in are finer , Include ：System、ServiceHigh、ServiceMedium、ServiceLow、JobHigh、JobMedium、JobLow etc. . The details of each level are not interpreted in detail here .
Global optimal scheduling
1. Scheduling simulator
Scheduling simulator is a bit similar to Alibaba's full link pressure testing system , Through real traffic playback online 、 Or analog traffic playback , Verify new scheduling capabilities in a simulated environment , And then constantly temper various scheduling algorithms , Optimize various indicators .
Another common use of scheduling simulators is , It's an offline simulation of online problems , Be harmless 、 Locate problems efficiently .
a certain extent , Scheduling simulator is the basis of global scheduling optimization . With the scheduling simulator , We were able to simulate the environment , Repeatedly refining various algorithms 、 Technical framework 、 Technology link , And then optimize the global indicators , for example ： Global allocation rate 、 Scheduling performance in different scenarios 、 Scheduling stability and so on .
2. Elastic Scheduling Platform（ESP platform ）
In order to achieve global optimal scheduling , Around the scheduler ,ASI A new set of Elastic Scheduling Platform（ESP platform ）, Designed around the scheduler , Create guidance based on scheduling data & Core scheduling capabilities & Production scheduling operation One stop self closing loop dispatching efficiency system .
in the past , We've built a lot of similar modules , For example, scheduling SLO On-Site Inspection 、 Many scheduling tools 、 Different scenarios of the two-tier scheduling platform ; And based on ESP platform , Set up more two-tier scheduling capabilities , Bring ASI Global optimal scheduling quality , And around Business stability 、 Resource cost 、 User performance improvement , Bring more intimate service to customers .
More scheduling capabilities
This paper tries to explain systematically ASI The basic concept of scheduler 、 Principles and various scenarios , And lead you into the beautiful and wonderful world of scheduler . The scheduler is broad and profound , Unfortunately , Limited by length , We have to control the space , A lot of content points haven't been developed in depth so far . In the scheduler , There are more and deeper scheduling insider , Such as heterogeneous machine scheduling 、 It's a portrait 、 Fair scheduling 、 Priority scheduling 、 Transfer scheduling 、 Preemptive scheduling 、 Disk scheduling 、Quota、CPU normalization 、GANG Scheduling、 Dispatch Tracing、 Scheduling diagnosis and other scheduling capabilities , This paper does not elaborate . Limited by length , This article also does not talk about ASI Powerful scheduling framework structure and optimization 、 Scheduling performance optimization and other deeper technical insider .
As early as 2019 year ,ASI Optimized K8s From single cluster to the industry-leading 10000 level node scale , And benefit from Alibaba cloud ACK Powerful K8s Operation and maintenance system , There are a large number of large-scale computing clusters in Alibaba group , At the same time, it has accumulated industry-leading K8s Multi cluster production practice . It's in these large-scale K8s Within cluster ,ASI Based on the perfect container Scheduling Technology , Continue to provide computing resources and computing power for many complex task resources .
Over the past few years , With the help of the group's comprehensive cloud access , Ali group is in the field of dispatching , Has realized from ASI Control to alicloud container service ACK The overall migration and evolution of . But Ali group is complicated 、 Enrich 、 Large scale business scenarios , The future will also continue to output 、 Enhance and temper cloud technology capabilities .
本文为[Alibaba cloud native]所创，转载请带上原文链接，感谢