编程知识 cdmana.com

Case sharing | online fault caused by Dubbo 2.7.12 bug

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" This article has been included https://github.com/lkxiaolou/lkxiaolou welcome star. WeChat official account \" Master bug catcher \", Back end technology sharing , Architecture design 、 performance optimization 、 Source code reading 、 Troubleshoot problems 、 Step on the pit practice .","attrs":{}}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":" background ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Late one night recently , Just after taking a bath, I received a call from the business side , Talk about them dubbo The service is out of order , You want me to help you check .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The phone , Asked them what time ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Is it a damaged fault on the line ?—— yes ","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Did you stop the loss ?—— Stop loss ","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Is there a reserved scene ?—— No, ","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" So I turned on the computer , Even on VPN Look at the problem . For the sake of understanding , The architecture is simplified as follows ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c6/c61e50c9f785c762bcdd153daadd6243.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Just pay attention A、B、C Three services , The calls between them are dubbo call .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In case of failure B The service has several machines completely rammed to death , Can't handle the request , The remaining normal machine requests surge , Time consuming increase , Here's the picture ( Figure 1 request volume 、 Figure 2 )","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/60/60480cc33ebdc296dfc0de649b300299.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/23/2354b5cd5860ea6980468efa9c9acf5e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":" Troubleshoot problems ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" As the site has been damaged , You can only look at the monitoring and log first ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" monitor ","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" In addition to the above monitoring , Looking at the B service CPU And memory , The memory of several machines that found faults increased more , All have reached 80% Horizontal line of , And CPU Consumption also becomes more ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1a/1aeef03289909f8d05c5dce409600b97.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" At this time, I doubt the memory problem , So I took a look JVM Of fullGC monitor ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4e/4e8818135e2afcffa8ed5ec23c14ae80.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Sure enough fullGC Time goes up a lot , Basically, it can be concluded that the service is unavailable due to a memory leak . But why is there a memory leak , There's no clue yet .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" journal ","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Apply for machine permission , Check the log , Found a very strange WARN journal ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"[dubbo-future-timeout-thread-1] WARN org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelTimeout\n(HashedWheelTimer.java:651) \n- [DUBBO] An exception was thrown by TimerTask., dubbo version: 2.7.12, current host: xxx.xxx.xxx.xxx\njava.util.concurrent.RejectedExecutionException: \nTask org.apache.dubbo.remoting.exchange.support.DefaultFuture$TimeoutCheckTask$$Lambda$674/[email protected] \nrejected from [email protected][Terminated, pool size = 0, \nactive threads = 0, queued tasks = 0, completed tasks = 21]\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" It can be seen that the business party uses ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2.7.12","attrs":{}},{"type":"text","text":" Version of dubbo","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Take this log dubbo Of github The warehouse searched , Found the following issue:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://github.com/apache/dubbo/issues/6820","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a8/a85504d7cdf2169b6f082cfdccbc8a78.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" But the problem was quickly ruled out , Because in 2.7.12 There is already fixed code in the version .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Continue to find these two issue:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://github.com/apache/dubbo/issues/8172","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://github.com/apache/dubbo/pull/8188","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Judging from the error report and version , It's completely in line , But there is no mention of memory problems , Forget the memory problem first , See if you can follow ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"#8188","attrs":{}},{"type":"text","text":" This issue Reappear ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/db/db20a20636bb7b77f3dee7a432df9cc8.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"issue It is also clear how to reproduce , So I took these three services to reproduce , It didn't reappear at first . Push back by fixing the code ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b5/b56f28e29b0b3fad44e8c85643074a4a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" There is a problem deleting the code part , But it's hard for us to get into this , How can I get in ?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Here is a feature On behalf of a request , Only when the request is not completed will it enter , That's easy , Give Way provider Never return , It can certainly be achieved , So in provider Add end test code ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"Thread.sleep(Integer.MAX_VALUE);\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" After the test, it reappeared , Such as issue said , When kill -9 Drop the first provider when , The overall situation of consumers ExecutorService Shut down , When kill -9 the second provider when ,SHARED_EXECUTOR Also closed .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" So what is this thread pool used for ?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" It's in ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"HashedWheelTimer","attrs":{}},{"type":"text","text":" Is used to detect consumer Whether the request timed out .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HashedWheelTimer yes dubbo Implementation of a time wheel to detect whether the request times out , The details will not be expanded here , You can write a detailed article another day dubbo Middle time wheel algorithm .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" When the request is sent , If you can return normally, it's ok , But if it exceeds the set timeout, it has not returned , You need the task of this thread pool to detect , Interrupt a task that has timed out .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The following code is to submit the task , When the thread pool is closed , When you submit a task, you throw an exception , Timeout cannot be detected .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"java"},"content":[{"type":"text","text":"public void expire() {\n if (!compareAndSetState(ST_INIT, ST_EXPIRED)) {\n return;\n }\n try {\n task.run(this);\n } catch (Throwable t) {\n if (logger.isWarnEnabled()) {\n logger.warn(\"An exception was thrown by \" + TimerTask.class.getSimpleName() + '.', t);\n }\n }\n}\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Here, I suddenly realized : If the request keeps sending , No timeout , Is it possible to burst the memory ? So I simulated it again , And opened 3 A thread has been requesting provider, Sure enough, the memory burst scene reappears , And when it doesn't trigger the problem , Memory is always stable at a low level .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7b/7b1338b84f531580d70be8ac6fe6de71.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" I used it here arthas Look at the memory changes , Very convenient ","attrs":{}}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":" Come to the conclusion ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" After local reproduction , So check with the business side , The recurrence of this problem is still relatively harsh , First of all ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" Asynchronous call ","attrs":{}},{"type":"text","text":", secondly ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"provider Abnormal offline is required ","attrs":{}},{"type":"text","text":", Last ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"provider There needs to be a blockage ","attrs":{}},{"type":"text","text":", That is, the request never returns .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" The asynchronous call is confirmed by the business party ,provider Abnormal offline , This is more common , This happens when the container drifts due to the failure of the physical machine , Last provider This has been confirmed by the business party , exactly C The service had a machine that froze near that point in time , Unable to process request , But the process is alive .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" So the question is dubbo 2.7.12 Of bug Lead to . Look at this bug yes 2.7.10 introduce , 2.7.13 Repair .","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":" replay ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" It's almost spent 1 Days to locate and reproduce , Fairly smooth , Good luck , No detours , But there are also some areas that need attention .","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" It's best to keep the scene while stopping the loss , If this time before restart dump Remove the memory or remove the flow to keep the machine on site , May help speed up locating problems . Such as configuration OOM Automatically dump Memory and other means . This is also the deficiency of this accident ","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Observability of services is very important , Whether it's a log 、 Monitoring or other , Everything should be complete . Basic, such as log 、 exit 、 Import request monitoring 、 Machine index ( Memory 、CPU、 Network, etc )、JVM monitor ( Thread pool 、GC etc. ). This is OK , There are basically everything that should be ","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" Open source products , You can search the network from the Key log , The problems you encounter with great probability have also been encountered by everyone . This is also the lucky point this time , A lot of detours ","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" WeChat official account \" Master bug catcher \", Back end technology sharing , Architecture design 、 performance optimization 、 Source code reading 、 Troubleshoot problems 、 Step on the pit practice .","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ab/ab569246c22ca2250ad5695e903299fb.jpeg","alt":null,"title":"","style":[{"key":"width","value":"25%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://cdmana.com/2021/10/20211002145414077o.html

Scroll to Top