编程知识 cdmana.com

The love and hate between Tomcat and netty, my colleagues clapped their hands at this series of magic operations

 When Tomcat run into sb. Netty, My series of divine operations , My colleagues clapped their hands and exclaimed

The story background

di ~ di ~ di ~, Production accident , Memory leak !

Yesterday afternoon , Suddenly received the operation and maintenance message , The production environment memory leak of a system in the branch , Help to find out .

The investigation process is as follows :

First step , To log

The exception log given by the division is like this ( In view of the company's regulations, it is forbidden to take pictures and to transmit any information , The following is a similar error I found on the Internet ):

LEAK: ByteBuf.release() was not called before it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information.

Recent access records: 

#1: 
  io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:273)
  io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
  io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
  io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
	io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646)
	io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:581)
	io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
	io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
	io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
	java.lang.Thread.run(Thread.java:748)

Here's a look , It's amazing ,ByteBuf There is no release , Causing a memory leak .

The second step , Look at memory metrics

Now that I know it's a memory leak , Let the operation and maintenance see the memory usage , Especially out of heap memory usage ( Because in the Netty), According to the operation and maintenance feedback , In heap memory usage is normal , Out of heap memory is high .

OK, At this point, it is clear that : Out of heap memory leak .

here , A two-step , One step is to put gateway Switch to zuul Piezometric observation , The first step is to check the memory leak .

The third step , To code

Ask the person in charge of this project to give me the code , I'll open it up , Silly eyes , Just a simple one Spring Cloud Gateway project , It also contains two classes , One is AuthFilter It's used to check the permissions , One is XssFilter To defend against attack .

Spring Cloud Gateway It uses Netty,zuul 1.x It uses Tomcat, This article comes from the source code of industrial waste tongge .

Step four , Preliminary doubt

Quickly scan the code of each class , stay XssFilter There's a trace inside ByteBuf Related code , however , It's not obvious ByteBuf There is no release of information , It's simple , Block this class first , See if there are any memory leaks .

however , How to detect whether there is a memory leak ? You can't delete this class , Run in production .

Step five , Parameter and monitoring transformation

Actually , It's simple , seen Netty Source code students , It should be clear ,Netty By default, it uses pooled direct memory ByteBuf, namely PooledDirectByteBuf, therefore , For debugging , First , To turn pooling off .

Direct memory , Out of heap memory .

Why turn off pooling ?

Because pooling is a cache of memory , It allocates 16M Memory is not immediately released , It is not convenient to observe after opening the pool , Unless you debug it slowly .

that , How to turn off pooling ?

stay Netty in , be-all ByteBuf It's all through a name called ByteBufAllocator To create , At the interface ByteBufAllocator There is a default allocator in , Find the default allocator , And find where it was created , You can see the relevant code .

public interface ByteBufAllocator {

    ByteBufAllocator DEFAULT = ByteBufUtil.DEFAULT_ALLOCATOR;
}
public final class ByteBufUtil {

    static final ByteBufAllocator DEFAULT_ALLOCATOR;

    static {
        //  This article comes from the source code of industrial waste tongge 
        String allocType = SystemPropertyUtil.get(
                "io.netty.allocator.type", PlatformDependent.isAndroid() ? "unpooled" : "pooled");
        allocType = allocType.toLowerCase(Locale.US).trim();

        ByteBufAllocator alloc;
        if ("unpooled".equals(allocType)) {
            alloc = UnpooledByteBufAllocator.DEFAULT;
            logger.debug("-Dio.netty.allocator.type: {}", allocType);
        } else if ("pooled".equals(allocType)) {
            alloc = PooledByteBufAllocator.DEFAULT;
            logger.debug("-Dio.netty.allocator.type: {}", allocType);
        } else {
            alloc = PooledByteBufAllocator.DEFAULT;
            logger.debug("-Dio.netty.allocator.type: pooled (unknown: {})", allocType);
        }

        DEFAULT_ALLOCATOR = alloc;
    }

}

You can see , It's through io.netty.allocator.type This parameter controls .

OK, stay JVM Add this parameter to the startup parameter , And assign it as unpooled.

 When Tomcat run into sb. Netty, My series of divine operations , My colleagues clapped their hands and exclaimed

After turning off pooling , We also need to be able to observe whether there is a real memory leak in real time , How to do this ?

Actually , This is very simple ,Netty Of PlatformDependent This class counts all direct memory usage .

I've been studying Netty Source code , therefore , I am right. Netty We know all kinds of details clearly , This article comes from the source code of industrial waste tongge , I'm still preparing for , When it's done later , Start Netty Column creation .

therefore , We just need to write a timer , Print out the statistics regularly , here , I'll just give the code :

@Component
public class Metrics {

    @PostConstruct
    public void init() {
        ScheduledExecutorService scheduledExecutorService = Executors.newSingleThreadScheduledExecutor();
        scheduledExecutorService.scheduleAtFixedRate(()->{
            System.out.println("used direct memory: " + PlatformDependent.usedDirectMemory());
        }, 1, 1, TimeUnit.SECONDS);
    }
}

Just drop it into a directory at the same level or below the startup class .

Come here , Pooling and monitoring are done , Here's the debugging .

Step six , Preliminary commissioning

Run the startup class directly , Watch the log .

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0

In limine , Direct memory is normal , Has always been a 0.

Send a random request , newspaper 404 了 , And observe that direct memory doesn't change , still 0, explain , It's not enough to simulate a request , This is directly by spring It's intercepted , Haven't come yet. Netty.

Step seven , Modify the configuration

Any request won't work , That can only simulate normal request forwarding , I quickly launched a SpringBoot project , And defined a request in it , modify gateway Configuration of , So that it can be forwarded :

spring:
  cloud:
    gateway:
      routes:
      - id: test
        uri: http://localhost:8899/test
        predicates:
        - Path=/test

Step eight , Debug again

After modifying the configuration , Start two projects at the same time , One gateway, One springboot, Request to send , Observe direct memory usage :

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 1031
used direct memory: 1031
used direct memory: 1031

Sure enough , Memory is not free .

Step nine , Delete XssFilter

In order to verify what was initially suspected XssFilter, Cut it out. , Restart the project , Send a request , Observe the use of direct memory .

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 1031
used direct memory: 1031
used direct memory: 1031

Problems still exist , and , It's the same size as the previous leak .

This is like this ,Netty It's by guessing (guess) To determine the size of each memory allocation , The initial value of this guess is 1024.

@Override
public ByteBuf allocate(ByteBufAllocator alloc) {
    return alloc.ioBuffer(guess());
}

Didn't you think Netty There is such a lovely side ^^, Cough , Off topic , Pull back by force !

then , Here's another one 7B It stores line feed, carriage return and so on , this 7B It will not be released , Add it up to make it 1031.

private static final byte[] ZERO_CRLF_CRLF = { '0', CR, LF, CR, LF };
// 2B
private static final ByteBuf CRLF_BUF = unreleasableBuffer(directBuffer(2).writeByte(CR).writeByte(LF));
// 5B
private static final ByteBuf ZERO_CRLF_CRLF_BUF = unreleasableBuffer(directBuffer(ZERO_CRLF_CRLF.length)
        .writeBytes(ZERO_CRLF_CRLF));

Um. , I little interesting , Since it's not XssFilter The problem of , that , Could it be AuthFilter The question of ?

Step 10 , kill AuthFilter

Do as you say , kill AuthFiler, Restart project , Send a request , Look at direct memory :

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 1031
used direct memory: 1031
used direct memory: 1031

The problem is still there , Or the familiar memory size .

here , My thinking is not smooth any more , Here's the way out of the way .

Step 11 , reflection

In the XssFilter and AuthFilter After deleting one after another , There is only one startup class left , Of course , There is also a new monitoring class .

Is it Spring Cloud Gateway There's something wrong with it , Why , I seem to have found a new continent , If you find out Spring Cloud Gateway There is a problem , I can boast about it again ( inner YY).

since , Memory allocation was not released , Then we'll find the memory allocation , Make a break .

Through the previous analysis , We already know that the memory allocator used is UnpooledByteBufAllocator 了 , That's in it newDirectBuffer() Method , Because we have a direct memory leak .

The twelfth step , Step by step debugging

Follow the train of thought in step 11 , stay UnpooledByteBufAllocator Of newDirectBuffer() Method , Step by step debugging , Last , Here comes the method :

// io.netty.handler.codec.ByteToMessageDecoder.channelRead
@Override
public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
    if (msg instanceof ByteBuf) {
        CodecOutputList out = CodecOutputList.newInstance();
        try {
            first = cumulation == null;
            // 1.  The return is msg In itself ,msg It's a ByteBuf
            cumulation = cumulator.cumulate(ctx.alloc(),
                    first ? Unpooled.EMPTY_BUFFER : cumulation, (ByteBuf) msg);
            // 2.  decode , This article comes from the source code of industrial waste tongge 
            callDecode(ctx, cumulation, out);
        } catch (DecoderException e) {
            throw e;
        } catch (Exception e) {
            throw new DecoderException(e);
        } finally {
            if (cumulation != null && !cumulation.isReadable()) {
                numReads = 0;
                // 3.  Free memory 
                cumulation.release();
                cumulation = null;
            } else if (++ numReads >= discardAfterReads) {
                // We did enough reads already try to discard some bytes so we not risk to see a OOME.
                // See https://github.com/netty/netty/issues/4275
                numReads = 0;
                discardSomeReadBytes();
            }

            int size = out.size();
            firedChannelRead |= out.insertSinceRecycled();
            // 4.  Finished reading out The remaining value in 
            fireChannelRead(ctx, out, size);
            // 5.  Recycling out
            out.recycle();
        }
    } else {
        ctx.fireChannelRead(msg);
    }
}

It took hours , especially ChannelPipeLine It's just a jump inside , We have to start again , Really, it can only be done step by step .

This method is mainly used to put ByteBuf convert to Message,Message It's the news , It can be understood as simple Java object , The main logic is shown in the code above .

You can see , Here is a cumulation.release();, It's where the memory is released , however , It didn't release , Before calling this line of code ,msg(=cumulation) The reference count for is 4, After release, it was 2, therefore , And counting , It can't be recycled .

After walking down 4、5 Two steps ,out It's all recycled ,msg Still not recycled , The problem must be in this area .

I've been tangled here , Include decode The code in it has been read many times , There's no release here msg The object after the content conversion is DefaultHttpContent, It means Http Requested body, But here it is Http Request the return value of body.

It's also a bit confusing to me , I tried ,Http Requested body I don't seem to have reached this logic , I've been looking for it again and again Http Requested Body, It's been a long time , There has been no progress .

until night 9 When it's too much , There are no more people in the office , The lights are off ( During the epidemic , Each department can only go to a few people a day ), I packed up and went home .

Thirteenth Step , Take a taxi home

In the car , I've been thinking about it , Recall the whole process , Could it be that I was in the wrong direction ?

Spring Cloud Gateway It's been a long time since I came out , I haven't heard of a memory leak , here , I began to doubt myself .

no way , I have to write my own project when I go home , Use Spring Cloud Gateway Run and try .

The fourteenth step , Write a use Spring Cloud Gateway Project

Got home , Turn on the computer , I started to write a usage Spring Cloud Gateway Project and a SpringBoot Project , Turn the surveillance on , Remove the pooling function , Start project , Send a request , Look at direct memory .

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0

what , Asiba , Come here , It's clear , No Spring Cloud Gateway The problem of , What's the problem ?

It must be the wrong posture , But the company project , There's nothing else , I've deleted all the classes , Only the startup class is left .

Ah, no ,pom file .

Turn on the springboard , Log in to the company computer , see pom.xml, I found it was full of SpringBoot perhaps SpringCloud Some of its own quotations .

Um. , incorrect , There is one common package , It was written by the division itself common package , Click in , There are three quotations in it jar package , among , One of them is particularly eye-catching ,tomcat!!!!

Oh, my second Olympics , here , I really want to curse my mother , It's nothing ~~

Actually , In I delete AuthFilter You should think of pom Problem. , at that time , Just focus on YY Spring Cloud Gateway There may be bug The problem. , It's just one end of it .

We know ,Spring Cloud Gateway It uses Netty As the server receives the request , And then forward it to the downstream system , Here quote tomcat What will happen? ? It's really an interesting thing .

The fifteenth step , kill tomcat

stay pom In file , hold tomcat Of jar Remove the bag , Restart project , Send a request , Look at direct memory :

used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0
used direct memory: 0

Oh , There is no problem , Namely tomcat The devil of it .

that ,tomcat How to make trouble ? added tomcat Can also respond to requests normally , Requests can also be forwarded normally , Return to the client , and , It's even scarier , It is also used internally Netty Make a read-write response to the request , It's really a little magical .

The sixteenth step , Discover the new world

To verify this problem , Let's get out of the springboard first , Back to my own computer , stay pom Add tomcat, Start project , Why , You can do it , Have a good time ~~

Is it tomcat and Netty Listening to the same port at the same time , Both are up ?

Take a look at the project startup log :

Connected to the target VM, address: '127.0.0.1:52162', transport: 'socket'

  .   ____          _            __ _ _
 /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
 \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::        (v2.2.6.RELEASE)

2020-05-19 08:50:04.448  INFO 7896 --- [           main] com.alan.test.Application                : No active profile set, falling back to default profiles: default
2020-05-19 08:50:04.829  INFO 7896 --- [           main] o.s.cloud.context.scope.GenericScope     : BeanFactory id=082e67ca-d4c7-3a8c-b051-e806722fd225
2020-05-19 08:50:04.998  INFO 7896 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat initialized with port(s): 8080 (http)
2020-05-19 08:50:05.006  INFO 7896 --- [           main] o.apache.catalina.core.StandardService   : Starting service [Tomcat]
2020-05-19 08:50:05.006  INFO 7896 --- [           main] org.apache.catalina.core.StandardEngine  : Starting Servlet engine: [Apache Tomcat/9.0.33]
2020-05-19 08:50:05.859  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [After]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [Before]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [Between]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [Cookie]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [Header]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [Host]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [Method]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [Path]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [Query]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [ReadBodyPredicateFactory]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [RemoteAddr]
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    :  This article comes from the source code of industrial waste tongge 
2020-05-19 08:50:05.860  INFO 7896 --- [           main] o.s.c.g.r.RouteDefinitionRouteLocator    : Loaded RoutePredicateFactory [CloudFoundryRouteService]
2020-05-19 08:50:06.731  INFO 7896 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080 (http) with context path ''
2020-05-19 08:50:07.304  INFO 7896 --- [           main] com.alan.test.Application                : Started Application in 4.271 seconds (JVM running for 5.0)

It turns out that it's only activated tomcat, How does it hand over the request to Netty To deal with ?

The seventeenth step ,tomcat -> Netty

Have studied NIO Students with relevant knowledge should know ,NIO take SocketChannel There are two kinds of , One is ServerSocketChannel, One is SocketChannel, among ,ServerSocketChannel It was created when the service was started , Used to monitor the arrival of client connections , and SocketChannel It means the connection between the client and the server .

seen NIO Source code students also know ,SocketChannel It's through ServerSocketChannel created .

seen Netty Source code students also know ,Netty According to different agreements, we put these Channel Divided into NioXxxChannel、EpollXxxChannel wait , For each protocol Channel Divide into the same NioServerSocketChannel、NioSocketChannel etc. .

And in the Windows Under the platform , The default is NioXxxChannel, And from the above we can see that ,NioSocketChannel It should be through NioServerSocketChannel created , If it's normal use Netty, And it is .

The figure below shows normal use Netty when NioSocketChannel Thread stack at creation time :

 When Tomcat run into sb. Netty, My series of divine operations , My colleagues clapped their hands and exclaimed

however , Our current scene is tomcat + Netty, So what is it ?

here , stay NioSocketChannel A breakpoint is made in the construction method of , Send a request , I found that the breakpoint has reached NioSocketChannel In the construction method of , Observe the thread stack ( Look from the bottom up ):

 When Tomcat run into sb. Netty, My series of divine operations , My colleagues clapped their hands and exclaimed

 When Tomcat run into sb. Netty, My series of divine operations , My colleagues clapped their hands and exclaimed

 When Tomcat run into sb. Netty, My series of divine operations , My colleagues clapped their hands and exclaimed

 When Tomcat run into sb. Netty, My series of divine operations , My colleagues clapped their hands and exclaimed

 When Tomcat run into sb. Netty, My series of divine operations , My colleagues clapped their hands and exclaimed

You can see , after tomcat->spring->reactor->reactor-netty->netty, After a thousand to one hundred turns , Finally created NioSocketChannel.

The situation here is a little more complicated , There's time later , Let's analyze in detail .

The eighteenth step , Memory leak

As can be seen from the above ,Tomcat Finally, the processing of the request was handed over to Netty, But why is there a memory leak ? It's still a question .

After my comparative test , The problem still lies in the code in step 12 , In normal use Netty When asked , stay fireChannelRead() There's going to be NioEventLoop Add a task to , called MonoSendMany.SendManyInner.AsyncFlush:

final class AsyncFlush implements Runnable {
    @Override
    public void run() {
        if (pending != 0) {
            ctx.flush();
        }
    }
}

This is used to write out the data in the write buffer ( Read it and write it out ), meanwhile , It also cleans up the data in the write buffer , In other words, the client can only receive the response result after calling this method , While using tomcat + Netty When , Didn't perform the task , The data is sent to the client ( The guess may be through tomcat And not NioSocketChannel Sent by itself ), This is a legacy , We'll study it later , Now the brain is a little messy .

summary

This production event , Although the whole code is relatively simple , But it's been going on for a long time , I would like to sum up a few points :

  1. Don't easily doubt the open source framework , Especially like Spring There are many people who use it , It is easy to doubt that it is biased , But it's not that don't doubt it ;
  2. When you can't find the cause of the problem , Think about taking a break 、 Take it easy , Change the way of thinking ;
  3. Spring Cloud Gateway Why can China tomcat and Netty Can coexist , That's a problem , It should be mentioned to the official issue, When both are detected , It's better not to let the program work directly ;

Official account number :Java Architects Alliance , Update the technology every day

版权声明
本文为[Java architects Alliance]所创,转载请带上原文链接,感谢

Scroll to Top