brief introduction ：Dubbo As a distributed microservice framework , Many companies are based on Dubbo Build a distributed system architecture . After restarting open source , We not only see Dubbo 3.0 Abreast of the times Roadmap Release , Moreover, we can see that Alibaba has begun to promote its own e-commerce Dubbo And internal HSF Fusion , And in double 11 Start using Dubbo 3.0. This article is based on the industrial and Commercial Bank of China Dubbo Share the construction of financial micro service architecture , It mainly describes the coping strategies and achievements of service discovery , The practice of large-scale service monitoring and governance of ICBC will be released in the future , And how to deal with it from the perspective of enterprises Dubbo Second development, etc . Welcome to your attention .
author | Zhang Yuanzheng
source | Alibaba cloud official account
Reading guide ：Dubbo As a distributed microservice framework , Many companies are based on Dubbo Build a distributed system architecture . After restarting open source , We not only see Dubbo 3.0 Abreast of the times Roadmap Release , Moreover, we can see that Alibaba has begun to promote its own e-commerce Dubbo And internal HSF Fusion , And in double 11 Start using Dubbo 3.0. This article is based on the industrial and Commercial Bank of China Dubbo Share the construction of financial micro service architecture , It mainly describes the coping strategies and achievements of service discovery , The practice of large-scale service monitoring and governance of ICBC will be released in the future , And how to deal with it from the perspective of enterprises Dubbo Second development, etc . Welcome to your attention .
Background and Overview
The traditional business system of ICBC is generally based on JEE The single architecture of , Facing the development trend of online and diversified financial business , The traditional architecture has been unable to meet the needs of the business . So from 2014 Year begins , ICBC has chosen a business system to try to serve , And verify 、 assessment 、 This paper compares several distributed service frameworks at that time , Finally, I chose to be relatively perfect 、 And there are many companies using it in China Dubbo. meanwhile , ICBC is also right about Dubbo Made enterprise customization , Help this business system to complete the implementation of service , After online, it also received a very good effect .
2015 year , ICBC began to expand the scope of its service structure , On the one hand, it helps traditional business systems to transform their architectures , On the other hand, it has gradually precipitated a large-scale service group similar to that of China Taiwan , Support business system fast service composition and reuse . As experience accumulates , Industrial and Commercial Bank of China (ICBC) has been constantly focusing on Dubbo Iterative optimization and enterprise customization , At the same time, we have gradually built a perfect service ecosystem around services .
2019 year , The micro service system of ICBC has also been officially upgraded to one of the key capabilities of the core banking system of ICBC's open platform , Helping ICBC IT Architecture for true distributed transformation .
The composition of ICBC's microservice system is shown in the figure below ：
- Infrastructure , Whether it is the service node of business system , Or the working node of the micro service platform itself , They have been deployed on the cloud platform of ICBC .
- Service registration and discovery , In addition to the regular service registry , We also deployed the metadata Center , To implement discovery by node registration of services .
- In terms of service configuration , Through an external distributed configuration center , In order to realize the unified management and distribution of various dynamic parameters .
- Service monitoring , Realize the unified collection and storage of various service operation indicators , And connect with the monitoring platform of the enterprise .
- Service tracking , It is mainly used for the overall link of real-time tracking service , Help the business system locate the fault point quickly , And accurately assess the scope of the fault .
- Service gateway is to meet the traditional business system access service requirements , stay Dubbo Service subscription and RPC Above ability , New services have been implemented 、 Automatic discovery of new versions 、 Automatic subscription and protocol conversion capabilities （HTTP Transfer agreement RPC agreement ）, Realization 7×24 Hours of uninterrupted operation .
- Service governance platform , Provide a one-stop management for operation and maintenance personnel and development testers 、 monitor 、 Query platform , Improve the efficiency of daily service governance .
The biggest challenge
After years of practice in ICBC , This article summarizes the two biggest challenges ：
- In terms of performance and capacity , The current number of online services （ namely Dubbo The number of service interfaces in the concept ）, It's over 2 ten thousand , Number of provider entries per registry （ That is, all providers of each service add up ）, It's over 70 ten thousand . According to the assessment , The future needs to be able to support 10 Ten thousand levels of service , And every registry 500 The number of ten thousand provider entries .
- High availability , The goal of ICBC is ： Any node failure of micro service platform can not affect online transaction . The business system of the bank 7×24 Hour run , Even within the release window , The production time of each business system is also staggered , The platform itself needs to be upgraded , How to avoid the impact on online trading , In particular, the registry's own version updates .
This article will start with service discovery , Let's share the coping strategies and achievements of ICBC .
Service discovery difficulties and optimization
stay Dubbo in , The registration, subscription and invocation of services is a standard paradigm , The provider of the service registers the service when it initializes , When the service consumer initializes, it subscribes to the service and gets the full list of providers . And during operation , When service providers change , Service consumers have access to the latest list of providers . Point to point between consumers and providers RPC call , The calling procedure does not go through the registry .
On the choice of registry , ICBC is in 2014 I chose Zookeeper.Zookeeper There are large-scale applications in various scenarios in the industry , And support cluster deployment , Data consistency between nodes is achieved by CP Mode guarantee .
stay Zookeeper Inside ,Dubbo Different services will be established according to different nodes , Each service node has providers、consumers、configurations And routers Four word nodes ：
- providers Temporary node ： Record the list of service providers . When the provider goes offline, the child node will be deleted automatically , adopt Zookeeper Of watch Mechanism , Consumers can know for the first time that the list of providers has changed .
- consumers Temporary node ： Record the list of consumers , It is mainly used to query consumers in service governance .
- configurations Persistent node ： It mainly saves the service parameters that need to be adjusted during service governance .
- routers： Child nodes are persistent nodes , It is mainly used to configure the dynamic routing policy of services .
Online production environment ,Zookeeper Multiple clusters are deployed in the sub data center , Each cluster is configured with 5 Election nodes , Several Observer node .Observer Node is Zookeeper3.3.3 Version introduces a new node type , It doesn't participate in elections , Just listen to the vote , Other abilities and Follower Same node .Observer Nodes have the following benefits ：
- Shunt network pressure ： With the increase of service nodes , If all clients are connected to the election node , For election nodes, it costs a lot of CPU To handle network connections and requests . But the election node cannot be expanded at any level , The more election nodes , The longer the business voting process , It is disadvantageous to high concurrent write performance .
- Reduce the cross city span DC Sign up subscription traffic for ： When there is 100 Consumers need to subscribe to the same service across cities ,Observer This part of cross city network traffic can be handled uniformly , Avoid the pressure on the inter city network bandwidth .
- Client isolation ： You can put a few Observer Nodes are specially assigned to a key application , Protect its network traffic isolation .
2. Problem analysis
The industrial and Commercial Bank of China (ICBC) has been online in recent years Zookeeper The use of sad history of blood and tears , Sum up Zookeeper The problem with being a service registry ：
- As the number of services and service provider nodes increases , The amount of data pushed by the service will grow explosively . for instance , A service has 100 Providers , When the provider starts , because Zookeeper Of CP characteristic , Every provider online , Consumers will be informed of the incident , And from Zookeeper To read a list of all current providers of this service , Then refresh the local cache . In this scene , In theory, each consumer received a total of 100 Event notification , And from Zookeeper Read out 100 List of secondary service providers ,1+2+3+...+100, A total of 5050 Provider data . This problem is particularly prominent in the peak period of business system production , Easily lead to Zookeeper The network of the cluster is full , As a result, the service subscription efficiency is extremely low , And it further affects the performance of service registration .
- With the words in Zookeeper The number of nodes on the increase ,Zookeeper Of snapshot The files are getting bigger and bigger , Every time snapshot Write to disk , There will be a disk IO Rush high . Peak production period , Because of the amount of business , Write snapshot The frequency of files is also very high , This poses a greater risk to infrastructure . meanwhile snapshot The bigger the file , Also indicates the Zookeeper The longer the recovery time after node failure .
- When Zookeeper After re election of election node ,Observer Nodes have to be new Leader Nodes synchronize full transactions , If this stage takes too long , It's easy to cause connections in Observer The client on the node session Overtime , Make corresponding providers All temporary nodes under the node are deleted , From the perspective of the registry , These services are offline , On the consumer side, there is an abnormal error report without a provider . Then , These providers will be reconnected Zookeeper And re register the service , This phenomenon of registration turnover of mass services in a short period of time , It often brings more serious performance problems of service registration push .
Sum up , It can be concluded that ： Overall Zookeeper As a registration center or more competent , But in larger service scenarios , Need to further optimize .
3. Optimization plan
The main optimization measures of ICBC include the following aspects ： Subscribe to delayed updates 、 The registry takes multiple Pattern 、 Upgrade to per node registration, etc .
1） Subscribe to delayed updates
Industrial and Commercial Bank of China Zookeeper Client component zkclient Optimized , After the consumer receives the event notification, it takes a small delay to get the list of providers .
When zkclient received childchange After a one-off event ,installWatch() adopt EventThread To resume listening to nodes , At the same time getChildren() Read all the child nodes under the node to get the list of providers , And refresh the local service provider cache . That's what I said earlier “5050 Data ” The root of the problem .
ICBC is in zkclient received childchange() After the incident , Made a wait delay , let installWatch() To do what it should have done . If the service provider changes during the waiting process , It doesn't produce childchange event .
Someone will ask. , Is this against zookeeper Of CP The model , It's not ,zookeeper Server side data is strongly consistent , Consumers are also informed of the incident , Just delay reading the provider list , Carry out later getChildren() when , What I read is already zookeeper The latest data on , So there is no problem .
The internal pressure test results show that , When service providers go online on a large scale , Before optimization , Each consumer received a total of 422 Data volume of 10000 provider nodes , And delay 1 Seconds later , This amount of data becomes 26 ten thousand ,childchange The number of events and network traffic have become the same 5% about , Finish this optimization , Can calmly deal with the peak production of a large number of services online and offline .
ICBC has adopted and optimized Dubbo In the new version registry-multiple Of SPI Realization , It is used to optimize the service subscription in the multi registry scenario .
Dubbo The original processing logic of Chinese service consumers is as follows ： When multiple registries exist , Consumers according to the registration center invoker Cache to filter providers , If the cache corresponding to the first registry is not found , Find the cache corresponding to the second registry . If there is an availability problem with the first registry at this point , There is a lack of data to push to consumers , Even empty , It will affect the screening process of consumers , If there is no provider exception 、 Call load imbalance, etc .
and multiple Registry is to merge the data pushed by multiple registries and update the cache , So even if a single registry fails , Pushed data is incomplete or empty , As long as there is any other registry data to make complete , It doesn't affect the final merged data .
also ,multiple The registry mechanism is also used in heterogeneous registry scenarios , If there is a problem, you can take the registration center offline at any time , This process is completely transparent to service invocation of service nodes , It is suitable for gray pilot or emergency switching .
Further more , And there's the extra benefit , Consumer side Reference The object is relatively occupied JVM Memory , adopt multiple Registry model , Can help consumers save half of invoker Object overhead , therefore , It is highly recommended to use multiple Pattern .
3） Register by node
ICBC's reverse transplantation Dubbo2.7 And Dubbo3.0 Service discovery logic of , Use “ Register by node ” Service registration for - Discover the model . This is the configuration center 、 Metadata Center 、 This triangle iron center ：
- Configuration center ： It is mainly used to store dynamic parameters at the node level , And the original service is written in Zookeeper Upper configurations and routers The data of these persistent nodes .
- Metadata Center ： Storage node metadata , That is, the name of each service node （ That is to say applicaiton-name） And the services it provides , And class definition information for each service , For example, the input and output parameter information of each method .
- Registry Center ： In this case, the registry only needs to store the service provider node name and the actual ip The relationship between ports .
Changes in this model , There is no impact on the consumer's service invocation . According to the metadata center on the consumer side “ Service node name ” And “ service ” The relationship between , And the registry “ Service node name ” With reality ip The relationship between ports , Service providers that generate compatible stock patterns invoker cache .
The result of pressure test shows , Registration by node can change the amount of data on the registry to the original 1.68%, This pair of quantities is on the line Zookeeper There's no pressure ,10 Ten thousand levels of service and 10 10000 level nodes can easily support .
future , ICBC also hopes to have a chance to go out , Deeply involved in the community , Put yourself in Dubbo、Zookeeper Server side 、zkclient On the good feature Contribute , For example, in addition to the above optimization points , ICBC is still Dubbo I did RPC Refined recognition of results ,PAAS The adaptation of , Same port multi protocol 、 Self isolation, etc , still Zookeeper Added registration fuse mechanism on , At the same time, we are studying Observer To avoid a series of problems caused by full data synchronization .
in addition , From the development of micro service ,Mesh It's already one of the hot spots at present . The main pain point of ICBC is service SDK Version upgrade ,Istio Discredit ,MCP This test , How can we make stock Dubbo Smooth transition from service to MESH framework , At present, there are preliminary plans , But there are a lot of technical hurdles to overcome .
welcome Dubbo There are practical students to discuss the problems and experience of large-scale scene together , Put together Dubbo We will do a better job ！
More practice content of enterprise implementation , can Download the cloud native architecture white paper to learn more ！
Link to the original text ：https://developer.aliyun.com/article/779740?
Copyright notice ： The content of this article is contributed by alicloud real name registered users , The copyright belongs to the original author , Alicloud developer community does not own its copyright , It also does not bear the corresponding legal liability . Please check the specific rules 《 Alicloud developer community user service agreement 》 and 《 Alibaba cloud developer community intellectual property protection guidelines 》. If you find any suspected plagiarism in this community , Fill in the infringement complaint form to report , Once verified , The community will immediately delete the suspected infringement content .