编程知识 cdmana.com

In depth understanding of netty: viewing netty traffic control from occasional downtime

author :vivo Internet server team -Zhang Lin


One 、 Business background


At present, a large number of message push will be used in the use scenario of the mobile terminal ,push Messages can help operators achieve operational goals more efficiently ( For example, push marketing activities or reminders to users APP new function ).


For the push system, the following two features are required :

  • The message is sent to the user in seconds , No delay , Support million push per second , Single machine million long connection .

  • Support notification 、 Text 、 Customize message transparent and other presentation forms . It is for the above reasons , It brings challenges to the development and maintenance of the system . The following figure is a brief description of the push system (API-> Push module -> mobile phone ).



Two 、 The problem background


Stability test of long connection cluster in push system 、 After the stress test stage runs for a period of time, a process will hang up randomly , Low probability ( The frequency is about once a month ), This will affect the timeliness of some client messages .


Long... In the push system Connecting nodes (Broker System ) Is based on Netty Development , This node maintains the long connection between the server and the mobile terminal , After an online problem occurs , add to Netty Troubleshoot memory leak monitoring parameters , Observed for many days, but no problem was found .


from The long connection node is Netty Development , For the convenience of the reader , Here is a brief introduction Netty.


3、 ... and 、 Netty Introduce


Netty It's a high performance 、 Asynchronous event driven NIO frame , be based on Java NIO Provided API Realization . It provides the right TCP、UDP And file transfer support , As the most popular NIO frame ,Netty In the field of Internet 、 Big data distributed computing 、 Game industry 、 Communication industry has been widely used ,HBase,Hadoop,Bees,Dubbo And other open source components are also based on Netty Of NIO Frame building .


Four 、 Problem analysis


4.1 guess


The initial conjecture was caused by the long connection number , But after checking the log 、 Analysis of the code , It is found that this is not the reason .


Number of long connections :39 ten thousand , Here's the picture :



Every channel Byte size 1456, Press 40 10000 long connection calculation , It will not cause excessive memory .


4.2 see GC journal


see GC journal , Frequently before the process is found to hang up full GC( frequency 5 Minutes at a time ), But the memory is not reduced , Suspected out of heap memory leak .


4.3 analysis heap The memory of


ChannelOutboundBuffer Objects account for nearly 5G Memory , The cause of leakage can be basically determined :ChannelOutboundBuffer Of entry Too many counts lead to , see ChannelOutboundBuffer The source code can analyze , yes ChannelOutboundBuffer Data in .


Didn't write to , Resulting in a backlog ;

ChannelOutboundBuffer Inside is a linked list structure .



4.4 The analysis data from the above figure is not written out , Why does this happen ?


The code actually determines whether the connection is available (Channel.isActive), And the timeout connection will be closed . From historical experience , This happens when the connection is half open ( The client is shut down unexpectedly ) There are many cases --- There is no problem if both parties do not conduct data communication .


According to the above conjecture , The test environment is reproduced and tested .

1) Simulate the client cluster , And establish a connection with the long connection server , Set the firewall of the client node , Simulate the scenario of server and client network exceptions ( That is to simulate Channel.isActive Successful call , But the data can't be actually sent out ).


2) Reduce off heap memory , Continue to send test messages to previous clients . Message size (1K about ).


3) according to 128M Memory to calculate , Actually call 9W It will appear many times .



5、 ... and 、 Problem solving


5.1 Enable autoRead Mechanism


When channel When not writable , close autoRead;

public void channelReadComplete(ChannelHandlerContext ctx) throws Exception { if (!ctx.channel().isWritable()) { Channel channel = ctx.channel(); ChannelInfo channelInfo = ChannelManager.CHANNEL_CHANNELINFO.get(channel); String clientId = ""; if (channelInfo != null) { clientId = channelInfo.getClientId(); }
LOGGER.info("channel is unwritable, turn off autoread, clientId:{}", clientId); channel.config().setAutoRead(false); }}


Turns on when data is writable autoRead;

@Overridepublic void channelWritabilityChanged(ChannelHandlerContext ctx) throws Exception{ Channel channel = ctx.channel(); ChannelInfo channelInfo = ChannelManager.CHANNEL_CHANNELINFO.get(channel); String clientId = ""; if (channelInfo != null) { clientId = channelInfo.getClientId(); } if (channel.isWritable()) { LOGGER.info("channel is writable again, turn on autoread, clientId:{}", clientId); channel.config().setAutoRead(true); }}


explain :


autoRead The function of is more accurate rate control , If you open it Netty Will help us register to read Events . When a read event is registered , If the network is readable , be Netty It will start from channel Reading data . Then if autoread Turn it off , be Netty Will not register read Events .


In this way, even if the data sent by the opposite end comes, the read event will not be triggered , So it won't start from channel Read to data . When recv_buffer Full hour , No more data will be received .


5.2 Set high and low water levels

serverBootstrap.option(ChannelOption.WRITE_BUFFER_WATER_MARK, new WriteBufferWaterMark(1024 * 1024, 8 * 1024 * 1024));
notes : High and low water levels cooperate with the following isWritable Use


5.3 increase channel.isWritable() The judgment of the


channel Whether it is available except for verification channel.isActive() I need to add channel.isWrite() The judgment of the ,isActive Just make sure the connection is active , And whether it can be written by isWrite To decide .

private void writeBackMessage(ChannelHandlerContext ctx, MqttMessage message) { Channel channel = ctx.channel(); // increase channel.isWritable() The judgment of the  if (channel.isActive() && channel.isWritable()) { ChannelFuture cf = channel.writeAndFlush(message); if (cf.isDone() && cf.cause() != null) { LOGGER.error("channelWrite error!", cf.cause()); ctx.close(); } }}
notes : isWritable Can control ChannelOutboundBuffer, Don't let it expand indefinitely . The mechanism is to use the set channel Judging by high and low water levels .


5.4 Problem verification


Test after modification , Send to 27W It doesn't report an error once ;



6、 ... and 、 Solution analysis


commonly Netty The data processing flow is as follows : The read data is processed by the business thread , Send it after processing ( The whole process is asynchronous ),Netty In order to improve the throughput of the network , At the business level and socket There's an extra ChannelOutboundBuffer.


Calling channel.write When , All the written data is not actually written to socket, It's about writing first ChannelOutboundBuffer. When calling channel.flush When you really want to socket Write . Because there's one in the middle buffer, There is a rate match , And this buffer Still unbounded ( Linked list ), That is, if you don't control channel.write The speed of , There will be a lot of data in this buffer Pile up in , If you encounter socket When you can't write data (isActive At this time, the judgment is invalid ) Or write slowly .


The most likely result is the depletion of resources , And if the ChannelOutboundBuffer Deposit is

DirectByteBuffer, This will make the problem more difficult to troubleshoot .


The process can be abstracted as follows :



From the above analysis, we can see that , Step one is written too fast ( Too fast to handle ) Or the downstream can't send data, which will cause problems , This is actually a rate matching problem .


7、 ... and 、Netty Source code description


Above high water level

When ChannelOutboundBuffer After the capacity of exceeds the set threshold of high water level ,isWritable() return false, Set up channel Don't write (setUnwritable), And trigger fireChannelWritabilityChanged().

private void incrementPendingOutboundBytes(long size, boolean invokeLater) { if (size == 0) { return; }
long newWriteBufferSize = TOTAL_PENDING_SIZE_UPDATER.addAndGet(this, size); if (newWriteBufferSize > channel.config().getWriteBufferHighWaterMark()) { setUnwritable(invokeLater); }}private void setUnwritable(boolean invokeLater) { for (;;) { final int oldValue = unwritable; final int newValue = oldValue | 1; if (UNWRITABLE_UPDATER.compareAndSet(this, oldValue, newValue)) { if (oldValue == 0 && newValue != 0) { fireChannelWritabilityChanged(invokeLater); } break; } }}


Below low water level

When ChannelOutboundBuffer After the capacity of is lower than the set threshold of low water level ,isWritable() return true, Set up channel Can write , And trigger fireChannelWritabilityChanged().

private void decrementPendingOutboundBytes(long size, boolean invokeLater, boolean notifyWritability) { if (size == 0) { return; }
long newWriteBufferSize = TOTAL_PENDING_SIZE_UPDATER.addAndGet(this, -size); if (notifyWritability && newWriteBufferSize < channel.config().getWriteBufferLowWaterMark()) { setWritable(invokeLater); }}private void setWritable(boolean invokeLater) { for (;;) { final int oldValue = unwritable; final int newValue = oldValue & ~1; if (UNWRITABLE_UPDATER.compareAndSet(this, oldValue, newValue)) { if (oldValue != 0 && newValue == 0) { fireChannelWritabilityChanged(invokeLater); } break; } }}


8、 ... and 、 summary


When ChannelOutboundBuffer After the capacity of exceeds the set threshold of high water level ,isWritable() return false, Indicates that the message is generated , Need to reduce write speed .


When ChannelOutboundBuffer After the capacity of is lower than the set threshold of low water level ,isWritable() return true, Indicates that there are too few messages , Need to increase write speed . After modification through the above three steps , No problems occurred during the deployment of online observation for half a year .


END

Guess you like

版权声明
本文为[Vivo Internet technology]所创,转载请带上原文链接,感谢
https://cdmana.com/2021/10/20211002145641033y.html

Scroll to Top