Mobile Internet Era , Most of the mobile services need to go through App and Server To achieve data interaction between , So most App The business functions provided need to use network request . If the network request is slow or the request fails , As a result, users cannot use business functions smoothly , Will have a huge impact on the user experience .
Besides ,EMAS Externally provided APM Network monitoring was not included before , Network performance monitoring is an important part of mobile terminal performance monitoring , We urgently need to complete this part of the ability to improve APM Product functions of , Further meet the needs of customers .
“ Alibaba application R & D platform EMAS It is the leading cloud native application research and development platform in China （ Move App、H5 application 、 Applet 、Web Application etc. ）, Based on a wide range of native cloud technologies （Backend as a Service、Serverless、DevOps、 Low code, etc ）, Committed to the enterprise 、 Developers provide one-stop application R & D management services , Covering development 、 test 、 Operation and maintenance 、 Operation and other applications lifecycle .”
Problems and challenges
Network performance monitoring mainly includes data acquisition and data reporting . We hope to collect as much useful information as possible to help customers discover 、 Locate and solve network performance problems . We face the following problems and challenges ：
• The first thing to solve is in the process of network request , Which phases affect request performance , If you find a problem with network performance , What data need to be collected to help users locate and solve problems .
• android The mainstream network framework is okhttp2、okhttp3、okhttp4、volley、retrofit、httpclient And the system provides httpurlconnection etc. , When we are not sure which version of the network library the customer is using , How to collect useful information as much as possible .
• The data collection in each stage of network request is discrete , How to ensure that each discrete monitoring data of a single request can be connected in series , Not mixed with other requested monitoring data .
• Due to the weak network environment, the network request log is often more valuable , It is necessary to report the abnormal network request log data to the server as much as possible .
• When concurrent network requests , It is necessary to ensure that the normal business of customers is not affected when uploading logs .
The implementation of network performance monitoring in the end mainly includes two modules ：
• Data collection
• Data reporting
Among them, data acquisition is the whole SDK The core of the framework .
Overview of the overall architecture ：
Access layer ：
Network monitoring is part of a highly available product , High availability unified access is adopted to access .
Plug in layer ：
The current framework of high availability is to integrate various businesses in a plug-in way , Realization networkmonitor plugin Integrated into the APM in , Add APM The network monitoring part of the .
Logic layer ：
Mainly responsible for collection control 、 Data management 、 Cache management and data escalation .
Interceptor layer ：
The core of the whole network monitoring . To gather more information , We choose to use bytecode injection technology to monitor network requests . Yes OkHttp、HttpClient and HttpUrlConnection, respectively Interceptor To collect the data of each stage of network request in different network libraries , At the end of the request, the collection is completed and reported . Besides , By customizing gradle plugin The way , For each network library implementation Injector And the switch , Control in the application build phase will Interceptor Each collection method is injected into the buried point of the corresponding network library bytecode , So as to realize the collection of data needed in each stage of network request at runtime .
What data to collect
First of all, we need to determine the range of collected data to help us find the performance and exception of network requests in time , On the other hand, additional data is needed to assist in troubleshooting . So the data we collect mainly includes four parts ：
• Basic data .
• The performance data .
• Abnormal information .
• Event sequence data .
• request url： Aggregate requests .
• The goal is IP Address ： For multiple exports IP The customer , Support IP Data analysis of address dimension .
• dns Analysis results ： request url Domain name resolution ip list , It is used to analyze whether there is domain name hijacking .
• http code： according to http code Determine request status .
• Uplink traffic ： Including the whole request uplink header and body The total flow of , Uplink traffic with retrying and redirection . Used to monitor uplink traffic overhead .
• Downstream traffic ： Including the entire request down header and body The total flow of , Downstream traffic including retrying and redirection . Used to monitor the down stream overhead .
• Type and version of Network Library ： For customers to change the network library or upgrade the network library version , It can provide the difference of network data before and after .
The performance data
The main performance of the request in each stage of the request is time consuming . The figure below shows http The various stages in which a request may occur .
Therefore, the performance data section needs to collect the time-consuming data of the following stages ：
- The entire network request takes time
- dns Time consuming
- It takes time to build a company
- TLS It takes time to build a company
- Data uplink takes time
- header It takes time to go up
- body It takes time to go up
- Data downlink takes time
- header It takes time to go down
- body It takes time to go down
The exception information is mainly to collect the abnormal stack information when the network request is abnormal in each stage . For example, the common java.net.UnknownHostException、java.net.SocketTimeoutException etc. .
Event sequence data
Event sequence data is mainly used to collect the monitoring event information of each stage of network request , In addition, for the specific network library of some special events monitoring , such as okhttp Connection reuse of 、 Mechanisms such as automatic redirection and failed retrying have an impact on network time consumption . Finally, put these events in chronological order .
For example okhttp On dns The hijacked scene , We go through the targets in the underlying data IP Address to judge dns Hijacking , This goal IP The address is collected when the connection is established . If the first request happens dns Hijacking situation , Well, this request we can normally identify dns The hijacking has happened . If subsequent network requests reuse the connection , Because there will be no more connections , So there is no target in the basic data IP Address , In this case, we need to use the connection in the event sequence data to reuse the connection in the event url And the target IP Address to determine if it's a hijacked request .
How to collect data
Byte code insertion principle
Byte code insertion involves Android Packaging and construction process of . First of all, let's take a look Android Packaging process for applications , Here's the picture ：
It can be seen from the figure above , All we need to do is javac after dex Before traversing all bytecode files , And according to certain rules to filter and modify, you can achieve bytecode instrumentation .
from Android Gradle 1.5.0 Start ,Google The official provided Transform API. adopt Transform API, Allow third parties in the form of plug-ins , stay Android The application is packaged as dex Operation during compilation before the file .class file .
Android In the compiler TaskManager Each one Transform String together , first Transform Receive from javac The result of compilation , And third parties that have already been brought to the local market sdk（jar、aar）, also resource resources . These compiled intermediate products , stay Transform Flow on the chain , Every Transform Nodes can be used for class Process it and pass it on to the next one Transform. Common confusion 、Desugar The implementation of etc. is packaged in one by one Transform in . And custom Tranform It will be inserted into this Transform The front of the chain , So turn on confusion by customizing Transform To modify bytecode is to modify bytecode first and then mix it up .
Network library research
Except for the network library that the system has HttpUrlConnection, stay android The platform also has many excellent third-party network libraries , Most of the App The development will use a third-party network library to initiate network requests .
From the bottom implementation of the mainstream network library in the table above , We just need to support OkHttp、HttpUrlConnection and HttpClinet It can meet the performance monitoring requirements of mainstream network library .
We are on the application market Top1000 Of App It has been analyzed , According to the number of integration, the order is okhttp3&okhttp4、volley（HttpUrlConnection）、okhttp2 and httpclient. among okhttp The proportion of network library is close to 80%, So we have priority to achieve okhttp The monitoring and implementation of Network Library .
okhttp The monitoring and implementation of Network Library
okhttp The family of Web libraries mainly includes okhttp2、okhttp3 and okhttp4. among okhttp3 There are many versions , The bottom level implementation changes most , and okhttp2 The underlying implementation of and okhttp3 The early version of is similar to ,okhttp4 yes okhttp3 Of kotlin Implementation of version . So we mainly introduce okhttp3 On the implementation of monitoring .
Above, okhttp3.12.0 Version implementation framework , We inject code into the specific logic of the network library to collect the required data .
okhttp3 There are many versions , from 3.0.0-3.14.9 There has been more than 40 A version , For each code injection location, you need to make sure that it works on all versions . So realize okhttp3 No trace buried point of , Version adaptation takes a lot of work .
Data reporting , Besides the need to consider encryption 、 authentication 、 Compression and so on , It also needs to be able to ensure that as few logs are lost as possible , At the same time, we also need to control the occupation of resources to reduce the impact on the upper business . The concrete realization mainly includes two aspects ：
• cache ： Support memory cache and disk cache two levels of cache . Business isolation is needed , When multiple businesses use the caching function, they can not affect each other .
• Report ： because APM There are more logs generated , To control concurrency and memory , We use a shared thread pool and scheduling queue . Scheduling queues cache at most 10 Batch logs , If more than 10 The log will be immediately put into the disk cache . In addition, an open interface for log preprocessing is provided before reporting to facilitate the business layer to process the log , Like sampling 、 Aggregation and other functions .
The above code is from ： Swish Download