编程人 cdmana.com

Illustration of Linux network packet receiving process

Python The actual combat community

Java The actual combat community

Long press to identify the QR code below , Add as needed

Scan code, pay attention to add customer service

Into the Python community ▲

Scan code, pay attention to add customer service

Into the Java community

By Zhang Yanfei allen

It comes from the development of internal cultivation


Because it's a million dollars 、 Ten million 、 There are more than 100 million Internet users , Therefore, one of the key requirements of the front-line Internet enterprises to try and promote back-end development students is to support high concurrency , To understand the performance overhead , Will perform performance optimization . And many times , If you are right about Linux If the understanding of the bottom is not deep , When you encounter a lot of online performance bottlenecks, you will think that the dog takes the hedgehog , Do not know how to start .

We're going to do it graphically today , Let's have a deep understanding of Linux Next network packet receiving process . Or borrow the simplest code as usual to start thinking . For the sake of simplicity , We use it udp For example , as follows :

int main(){
    int serverSocketFd = socket(AF_INET, SOCK_DGRAM, 0);
    bind(serverSocketFd, ...);

    char buff[BUFFSIZE];
    int readCount = recvfrom(serverSocketFd, buff, BUFFSIZE, 0, ...);
    buff[readCount] = '\0';    printf("Receive from client:%s\n", buff);
}

The code above is a paragraph udp server The logic of receiving receipts . When you look at it from a development perspective , As long as the client has the corresponding data sent , Server side execution recv_from Then you can receive it , And print it out . What we want to know now is , When the network packet reaches the network card , Until our recvfrom Receive the data , In between , What happened ?

Through this paper , You will have a deeper understanding Linux How is the internal network system realized , And how the parts interact . I believe that this will be of great help to your work . This article is based on Linux 3.10, Source code see https://mirrors.edge.kernel.org/pub/linux/kernel/v3.x/, Network card driver adopts Intel Of igb Network card example .

Friendship tips , This article is a little longer , You can start with Mark Back view !

One

Linux Overview of network packet collection

stay TCP/IP In the network layered model , The whole protocol stack is divided into physical layer 、 The link layer 、 The network layer , Transport layer and application layer . The physical layer corresponds to the network card and network cable , The application layer corresponds to our common Nginx,FTP And so on .Linux The implementation is the link layer 、 Network layer and transport layer .

stay Linux Kernel Implementation , Link layer protocol is implemented by network card driver , The core protocol stack is used to implement the network layer and transport layer . The kernel provides for the upper application layer socket Interface for user processes to access . We use it Linux From the perspective of TCP/IP The network layered model should look like this .

chart 1 Linux View of the network protocol stack

stay Linux In the source code of , The logic corresponding to the network device driver is located in the driver/net/ethernet, among intel The driver of serial network card is driver/net/ethernet/intel Under the table of contents . The protocol stack module code is located in kernel and net Catalog .

Kernel and network device drivers are handled by interrupt . When data arrives on the device , Will give CPU A voltage change is triggered on the related pin of the , To inform CPU To process data . For the network module , Due to the complexity and time consuming of the process , If all processing is done in the interrupt function , Will cause the interrupt handling function ( The priority is too high ) Will overoccupy CPU, Will lead to CPU Unable to respond to other devices , For example, mouse and keyboard messages . therefore Linux Interrupt handling function is divided into upper part and lower part . The first half is the simplest work , Process quickly and then release CPU, next CPU You can allow other interrupts to come in . Most of the rest of the work is in the second half , It can be handled slowly and calmly .2.4 The second half of kernel version is soft interrupt , from ksoftirqd Kernel thread takes full responsibility . Unlike hard interrupts , Hard interrupt is achieved by giving CPU Change of applied voltage of physical pin , The soft interrupt is to notify the soft interrupt handler by giving the binary value of a variable in memory .

Okay , Have a general understanding of the network card driver 、 Hard interrupt 、 Soft interrupt and ksoftirqd After the thread , On the basis of these concepts, we give a path of receiving packets in the kernel :

chart 2 Linux Overview of packet receiving in kernel network

When data is received on the network card ,Linux The first working module in is the network driver . Network drivers will use DMA Write the frame received on the network card to the memory . Again to CPU Initiate an interrupt , To inform CPU Data arrives . second , When CPU After receiving the interrupt request , Will call the interrupt handling function registered by the network driver . The interrupt handling function of network card does not do too much work , Issue soft interrupt request , And release it as soon as possible CPU.ksoftirqd A soft interrupt request has been detected , call poll Start polling packet reception , After receiving, it will be sent to all levels of protocol stack for processing . about UDP Come on , Will be put on the user socket In the receive queue of .

We can see from the picture above that Linux Processing of data packets . But to learn more about how the network module works , We have to look down .

Two

Linux start-up

Linux drive , Before the modules such as kernel protocol stack have the ability to receive network card packets , It takes a lot of preparation . For example, it should be created in advance ksoftirqd Kernel thread , To register the corresponding processing function of each protocol , Network equipment subsystem should be initialized in advance , The network card should be started well . These are the only ones Ready after , We can actually start receiving packets . So let's take a look at how these preparations are done .

2.1 establish ksoftirqd Kernel thread

Linux The soft interrupts are all in dedicated kernel threads (ksoftirqd) In the , So it's important to see how these processes are initialized , In this way, we can understand the process of receiving packages more accurately later . The number of processes is not 1 individual , It is N individual , among N It's equal to the core of your machine .

When the system is initialized, the kernel/smpboot.c Called in smpboot_register_percpu_thread, The function is further executed to spawn_ksoftirqd( be located kernel/softirq.c) To create softirqd process .

chart 3 establish ksoftirqd Kernel thread

The relevant code is as follows :

//file: kernel/softirq.c
static struct smp_hotplug_thread softirq_threads = {
    .store          = &ksoftirqd,
    .thread_should_run  = ksoftirqd_should_run,
    .thread_fn      = run_ksoftirqd,
    .thread_comm        = "ksoftirqd/%u",};
static __init int spawn_ksoftirqd(void){
    register_cpu_notifier(&cpu_nfb);

    BUG_ON(smpboot_register_percpu_thread(&softirq_threads));    return 0;
}
early_initcall(spawn_ksoftirqd);

When ksoftirqd After it was created , It will enter its own thread loop function ksoftirqd_should_run and run_ksoftirqd 了 . Constantly judge whether there are soft interrupts to be processed . One thing to note here is that , Soft interrupt is not only network soft interrupt , There are other types .

//file: include/linux/interrupt.h
enum{
    HI_SOFTIRQ=0,
    TIMER_SOFTIRQ,
    NET_TX_SOFTIRQ,
    NET_RX_SOFTIRQ,
    BLOCK_SOFTIRQ,
    BLOCK_IOPOLL_SOFTIRQ,
    TASKLET_SOFTIRQ,
    SCHED_SOFTIRQ,
    HRTIMER_SOFTIRQ,
    RCU_SOFTIRQ,  
};

2.2 Network subsystem initialization

chart 4 Network subsystem initialization

linux The kernel calls subsys_initcall To initialize the subsystems , In the source code directory, you can grep There are many calls to this function . Here we are going to talk about the initialization of the network subsystem , Will perform to the net_dev_init function .

//file: net/core/dev.c
static int __init net_dev_init(void){
    ......

    for_each_possible_cpu(i) {
        struct softnet_data *sd = &per_cpu(softnet_data, i);

        memset(sd, 0, sizeof(*sd));
        skb_queue_head_init(&sd->input_pkt_queue);
        skb_queue_head_init(&sd->process_queue);
        sd->completion_queue = NULL;
        INIT_LIST_HEAD(&sd->poll_list);
        ......
    }
    ......
    open_softirq(NET_TX_SOFTIRQ, net_tx_action);    open_softirq(NET_RX_SOFTIRQ, net_rx_action);
}
subsys_initcall(net_dev_init);

In this function , For every one of them CPU Apply for one softnet_data data structure , In this data structure poll_list Is waiting for the driver to poll Function registration , We can see this process later when the network card driver is initialized .

in addition open_softirq Register every function for interrupt handling .NET_TX_SOFTIRQ The processing function of is net_tx_action,NET_RX_SOFTIRQ For the net_rx_action. Continue tracking open_softirq It was found that the registration method was recorded in the softirq_vec In variables . Back ksoftirqd When a thread receives a soft interrupt , This variable is also used to find the processing function for each type of soft interrupt .

//file: kernel/softirq.c
void open_softirq(int nr, void (*action)(struct softirq_action *)){
    softirq_vec[nr].action = action;
}

2.3 Protocol stack registration

The kernel implements the ip agreement , It also realizes the tcp The protocol and udp agreement . The corresponding implementation functions of these protocols are ip_rcv(),tcp_v4_rcv() and udp_rcv(). What's different from the way we usually write code , The kernel is implemented by registration .Linux Kernel fs_initcall and subsys_initcall similar , It is also the entry point of initialization module .fs_initcall call inet_init After that, network protocol stack registration is started . adopt inet_init, These functions are registered with the inet_protos and ptype_base The data structure is in . Here's the picture :

chart 5 AF_INET Protocol stack registration

The relevant code is as follows

//file: net/ipv4/af_inet.c
static struct packet_type ip_packet_type __read_mostly = {
    .type = cpu_to_be16(ETH_P_IP),
    .func = ip_rcv,};static const struct net_protocol udp_protocol = {
    .handler =  udp_rcv,
    .err_handler =  udp_err,
    .no_policy =    1,
    .netns_ok = 1,};static const struct net_protocol tcp_protocol = {
    .early_demux    =   tcp_v4_early_demux,
    .handler    =   tcp_v4_rcv,
    .err_handler    =   tcp_v4_err,
    .no_policy  =   1,    .netns_ok   =   1,
};
static int __init inet_init(void){
    ......
    if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
        pr_crit("%s: Cannot add ICMP protocol\n", __func__);
    if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)
        pr_crit("%s: Cannot add UDP protocol\n", __func__);
    if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
        pr_crit("%s: Cannot add TCP protocol\n", __func__);
    ......    dev_add_pack(&ip_packet_type);
}

We can see in the code above ,udp_protocol In structure handler yes udp_rcv,tcp_protocol In structure handler yes tcp_v4_rcv, adopt inet_add_protocol It's initialized. Come in .

int inet_add_protocol(const struct net_protocol *prot, unsigned char protocol){
    if (!prot->netns_ok) {
        pr_err("Protocol %u is not namespace aware, cannot register.\n",
            protocol);
        return -EINVAL;
    }

    return !cmpxchg((const struct net_protocol **)&inet_protos[protocol],            NULL, prot) ? 0 : -1;
}

inet_add_protocol Function will tcp and udp The corresponding processing functions are registered in the inet_protos In the array . Look again dev_add_pack(&ip_packet_type); This business ,ip_packet_type In structure type It's the name of the agreement ,func yes ip_rcv function , stay dev_add_pack Will be registered with ptype_base In the hash table .

//file: net/core/dev.c
void dev_add_pack(struct packet_type *pt){
    struct list_head *head = ptype_head(pt);    ......
}
static inline struct list_head *ptype_head(const struct packet_type *pt){
    if (pt->type == htons(ETH_P_ALL))
        return &ptype_all;
    else        return &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK];
}

Here we need to remember inet_protos Record udp,tcp Processing function address of ,ptype_base It stores ip_rcv() The processing address of the function . Later, we will see that the soft interrupt will pass ptype_base find ip_rcv Function address , And then ip The package arrived correctly ip_rcv() In the implementation of . stay ip_rcv Will pass inet_protos find tcp perhaps udp Processing function of , Then forward the packet to udp_rcv() or tcp_v4_rcv() function .

Extend the , If you look at ip_rcv and udp_rcv Function code can see a lot of protocol processing . for example ,ip_rcv Will handle netfilter and iptable Filter , If you have many or complex ones netfilter or iptables The rules , These rules are executed in the context of soft interrupts , Will increase network latency . Another example is ,udp_rcv Will judge socket Is the receive queue full . The corresponding kernel parameters are net.core.rmem_max and net.core.rmem_default. If you are interested , I suggest you read it carefully inet_init The code for this function .

2.4 Network card driver initialization

Every driver ( It's not just network card drivers ) Will use module_init Register an initialization function with the kernel , When the driver is loaded , The kernel calls this function . such as igb The network card driver code is located in drivers/net/ethernet/intel/igb/igb_main.c

//file: drivers/net/ethernet/intel/igb/igb_main.c
static struct pci_driver igb_driver = {
    .name     = igb_driver_name,
    .id_table = igb_pci_tbl,
    .probe    = igb_probe,
    .remove   = igb_remove,    ......
};
static int __init igb_init_module(void){
    ......
    ret = pci_register_driver(&igb_driver);    return ret;
}

Driven pci_register_driver After the call is complete ,Linux The kernel knows about the driver , such as igb Network card driven igb_driver_name and igb_probe Function address and so on . When the network card device is identified , The kernel will call its driver probe Method (igb_driver Of probe The method is igb_probe). drive probe The purpose of method execution is to make the device ready, about igb network card , Its igb_probe be located drivers/net/ethernet/intel/igb/igb_main.c Next . The main operations are as follows :

chart 6 Network card driver initialization

The first 5 We see in the step , The network card driver is implemented ethtool Required interfaces , Also register here to complete the function address registration . When ethtool After a system call is initiated , The kernel will find the callback function for the corresponding operation . about igb Network card , The implementation functions are all in drivers/net/ethernet/intel/igb/igb_ethtool.c Next . I believe you can fully understand this time ethtool How does it work ? The reason why this command can view the network card receiving and sending statistics 、 Can modify the network card adaptive mode 、 It can be adjusted RX Number and size of queues , Because ethtool The command is finally called to the corresponding method of network card driver , instead of ethtool It has this super ability .

The first 6 Step registered igb_netdev_ops It contains igb_open Such as function , This function will be called when the network card is started .

//file: drivers/net/ethernet/intel/igb/igb_main.c
static const struct net_device_ops igb_netdev_ops = {
  .ndo_open               = igb_open,
  .ndo_stop               = igb_close,
  .ndo_start_xmit         = igb_xmit_frame,
  .ndo_get_stats64        = igb_get_stats64,
  .ndo_set_rx_mode        = igb_set_rx_mode,
  .ndo_set_mac_address    = igb_set_mac,
  .ndo_change_mtu         = igb_change_mtu,  .ndo_do_ioctl           = igb_ioctl,
 ......

The first 7 In step , stay igb_probe During initialization , Also called to igb_alloc_q_vector. He registered one NAPI Necessary for the mechanism poll function , about igb Network card driver , The function is igb_poll, As shown in the following code .

static int igb_alloc_q_vector(struct igb_adapter *adapter,
                  int v_count, int v_idx,
                  int txr_count, int txr_idx,
                  int rxr_count, int rxr_idx){
    ......
    /* initialize NAPI */
    netif_napi_add(adapter->netdev, &q_vector->napi,               igb_poll, 64);
}

2.5 Start the network card

When the above initialization is completed , You can start the network card . Recall the previous network card driver initialization , We mentioned that drivers register with the kernel structure net_device_ops Variable , It contains the network card enabled 、 Contract awarding 、 Set up mac Address and other callback functions ( A function pointer ). When a network card is enabled ( for example , adopt ifconfig eth0 up),net_device_ops Medium igb_open Method will be called . It usually does the following things :

chart 7 Start the network card
//file: drivers/net/ethernet/intel/igb/igb_main.c
static int __igb_open(struct net_device *netdev, bool resuming){
    /* allocate transmit descriptors */
    err = igb_setup_all_tx_resources(adapter);

    /* allocate receive descriptors */
    err = igb_setup_all_rx_resources(adapter);

    /*  Register interrupt handler  */
    err = igb_request_irq(adapter);
    if (err)
        goto err_req_irq;

    /*  Enable NAPI */
    for (i = 0; i < adapter->num_q_vectors; i++)
        napi_enable(&(adapter->q_vector[i]->napi));    ......
}

on top __igb_open The function is called igb_setup_all_tx_resources, and igb_setup_all_rx_resources. stay igb_setup_all_rx_resources In this step , Allocated RingBuffer, And set up memory and Rx Mapping of queues .(Rx Tx The number and size of queues can be accessed by ethtool To configure ). Let's move on to interrupt function registration igb_request_irq:

static int igb_request_irq(struct igb_adapter *adapter){
    if (adapter->msix_entries) {
        err = igb_request_msix(adapter);
        if (!err)
            goto request_done;
        ......    }
}
static int igb_request_msix(struct igb_adapter *adapter){
    ......
    for (i = 0; i < adapter->num_q_vectors; i++) {
        ...
        err = request_irq(adapter->msix_entries[vector].vector,
                  igb_msix_ring, 0, q_vector->name,
    }

Trace function calls in the code above , __igb_open => igb_request_irq => igb_request_msix, stay igb_request_msix We've seen it , For multi queue network card , Interrupts are registered for each queue , The corresponding interrupt handling function is igb_msix_ring( This function is also in the drivers/net/ethernet/intel/igb/igb_main.c Next ). We can also see that ,msix Under way , Every RX Queues have independent MSI-X interrupt , From the hardware interrupt level of the network card, you can set the received packets to be different CPU Handle .( Can pass irqbalance , Or modify /proc/irq/IRQ_NUMBER/smp_affinity Ability to modify and CPU Binding behavior of ).

When the above preparations are made , You can open the door ( Data packets ) 了 !

3、 ... and

Welcome the arrival of data

3.1 Hard interrupt handling

First of all, when the data frame from the network cable to the network card , The first station is the receiving queue of the network card . The network card is assigned to its own RingBuffer Find available memory locations in , After finding DMA The engine will take the data DMA To the memory associated with the network card , This is the time CPU It's all senseless . When DMA After the operation , Network card will be like CPU Initiate a hard interrupt , notice CPU Data arrives .

chart 8 Processing process of network card data hard interrupt
Be careful : When RingBuffer When it's full , New packets will be discarded .ifconfig When checking the network card , Yes, there is one in it overruns, Indicates that the ring queue is full of dropped packets . If any packet loss is found , May need to pass ethtool Command to increase the length of the ring queue .

In the section of starting network card , We talked about the processing function of hard interrupt registration of network card igb_msix_ring.

//file: drivers/net/ethernet/intel/igb/igb_main.c
static irqreturn_t igb_msix_ring(int irq, void *data){
    struct igb_q_vector *q_vector = data;

    /* Write the ITR value calculated from the previous interrupt. */
    igb_write_itr(q_vector);

    napi_schedule(&q_vector->napi);    return IRQ_HANDLED;
}

igb_write_itr Just record the hardware interrupt frequency ( It is said that the aim is to reduce the number of CPU When the interrupt frequency of ). Along napi_schedule Call it and follow it all the way ,__napi_schedule=>____napi_schedule

/* Called with irq disabled */
static inline void ____napi_schedule(struct softnet_data *sd,
                     struct napi_struct *napi){
    list_add_tail(&napi->poll_list, &sd->poll_list);    __raise_softirq_irqoff(NET_RX_SOFTIRQ);
}

Here we see ,list_add_tail Revised CPU Variable softnet_data Inside poll_list, Will drive napi_struct From here poll_list Added in . among softnet_data Medium poll_list It's a two-way list , The devices have input frames waiting to be processed . Then __raise_softirq_irqoff A soft interrupt was triggered NET_RX_SOFTIRQ, This so-called trigger process is only one or operation on a variable .

void __raise_softirq_irqoff(unsigned int nr){
    trace_softirq_raise(nr);    or_softirq_pending(1UL << nr);
}
//file: include/linux/irq_cpustat.h
#define or_softirq_pending(x)  (local_softirq_pending() |= (x))

We said ,Linux Only simple and necessary work can be done in hard interrupt , Most of the rest of the processing is transferred to soft interrupts . You can see through the above code , Hard interrupt processing is really very short . It just records a register , It has been revised CPU Of poll_list, Then issue a soft interrupt . It's that simple , Hard interruptions are done .

3.2 ksoftirqd Kernel thread processing soft interrupt

chart 9 ksoftirqd Kernel thread

When the kernel thread is initialized , We introduced ksoftirqd Two thread functions in ksoftirqd_should_run and run_ksoftirqd. among ksoftirqd_should_run The code is as follows :

static int ksoftirqd_should_run(unsigned int cpu){    return local_softirq_pending();
}
#define local_softirq_pending() \    __IRQ_STAT(smp_processor_id(), __softirq_pending)

Here we see the same function called in hard interrupt. local_softirq_pending. The difference is that the hard interrupt position is used to write the mark , This is just reading . If the hard interrupt is set in NET_RX_SOFTIRQ, You can read it here . The next step is to actually enter the thread function run_ksoftirqd Handle :

static void run_ksoftirqd(unsigned int cpu){
    local_irq_disable();
    if (local_softirq_pending()) {
        __do_softirq();
        rcu_note_context_switch(cpu);
        local_irq_enable();
        cond_resched();
        return;
    }    local_irq_enable();
}

stay __do_softirq in , The judgment is based on the current situation CPU Soft interrupt type of , Call its registered action Method .

asmlinkage void __do_softirq(void){
    do {
        if (pending & 1) {
            unsigned int vec_nr = h - softirq_vec;
            int prev_count = preempt_count();
            ...
            trace_softirq_entry(vec_nr);
            h->action(h);
            trace_softirq_exit(vec_nr);
            ...
        }
        h++;
        pending >>= 1;    } while (pending);
}

In the network subsystem initialization section , We see that we are NET_RX_SOFTIRQ Registered processing function net_rx_action. therefore net_rx_action The function will be executed .

Here's a detail , Setting soft interrupt flag in hard interrupt , and ksoftirq Whether there is soft interrupt arrival , It's all based on smp_processor_id() Of . This means that as long as the hard interrupt is CPU Was responded to on , So soft interrupt is also in this CPU I'm dealing with it . So , If you find your Linux Soft interrupt CPU If the consumption is concentrated on one core , The practice is to break the adjustment hard CPU Affinity , To break hard interrupts into different ones CPU Nuclear .

Let's focus on this core function again net_rx_action come up .

static void net_rx_action(struct softirq_action *h){
    struct softnet_data *sd = &__get_cpu_var(softnet_data);
    unsigned long time_limit = jiffies + 2;
    int budget = netdev_budget;
    void *have;

    local_irq_disable();
    while (!list_empty(&sd->poll_list)) {
        ......
        n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list);

        work = 0;
        if (test_bit(NAPI_STATE_SCHED, &n->state)) {
            work = n->poll(n, weight);
            trace_napi_poll(n);
        }
        budget -= work;    }
}

At the beginning of the function time_limit and budget It's used to control net_rx_action Function active exit , The purpose is to ensure that network packet reception is not preemptive CPU Don't put . The next time the network card has a hard interrupt, it will process the remaining received packets . among budget It can be adjusted by kernel parameters . The rest of the core logic in this function is to get to the current CPU Variable softnet_data, For its poll_list Traversal , Then execute the registration to the network card driver poll function . about igb Network card , Namely igb Driving force igb_poll Function .

static int igb_poll(struct napi_struct *napi, int budget){
    ...
    if (q_vector->tx.ring)
        clean_complete = igb_clean_tx_irq(q_vector);

    if (q_vector->rx.ring)
        clean_complete &= igb_clean_rx_irq(q_vector, budget);    ...
}

In read operation ,igb_poll The key work is to igb_clean_rx_irq Call to .

static bool igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget){
    ...
    do {
        /* retrieve a buffer from the ring */
        skb = igb_fetch_rx_buffer(rx_ring, rx_desc, skb);

        /* fetch next buffer in frame if non-eop */
        if (igb_is_non_eop(rx_ring, rx_desc))
            continue;
        }

        /* verify the packet layout is correct */
        if (igb_cleanup_headers(rx_ring, rx_desc, skb)) {
            skb = NULL;
            continue;
        }

        /* populate checksum, timestamp, VLAN, and protocol */
        igb_process_skb_fields(rx_ring, rx_desc, skb);

        napi_gro_receive(&q_vector->napi, skb);
}

igb_fetch_rx_buffer and igb_is_non_eop The function of the data frame from the RingBuffer Take it off . Why do you need two functions ? Because it's possible that frames take up more than one RingBuffer, So it's in a loop , Until the end of the frame . Get a data frame down with a sk_buff To express . After collecting the data , Some verification is carried out , Then set up sbk Variable timestamp, VLAN id, protocol Etc . Next go to napi_gro_receive in :

//file: net/core/dev.c
gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb){
    skb_gro_reset_offset(skb);    return napi_skb_finish(dev_gro_receive(napi, skb), skb);
}

dev_gro_receive This function represents the network card GRO characteristic , It can be simply understood as the combination of relevant small packages into a large package , The goal is to reduce the number of packets sent to the network stack , This helps to reduce CPU Usage of . Let's ignore it for the moment , Look directly at napi_skb_finish, This function is mainly called netif_receive_skb.

//file: net/core/dev.c
static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb){
    switch (ret) {
    case GRO_NORMAL:
        if (netif_receive_skb(skb))
            ret = GRO_DROP;
        break;    ......
}

stay netif_receive_skb in , The packet will be sent to the protocol stack . Statement , Following 3.3, 3.4, 3.5 Also belongs to the soft interrupt processing process , It's just because it's too long , Take it out separately and form a small section .

3.3 Network protocol stack processing

netif_receive_skb Function is based on the protocol of the package , If so udp package , The packages will be sent to ip_rcv(),udp_rcv() In the protocol processing function .

chart 10 Network protocol stack processing
//file: net/core/dev.c
int netif_receive_skb(struct sk_buff *skb){
    //RPS Processing logic , Ignore first     ......    return __netif_receive_skb(skb);
}
static int __netif_receive_skb(struct sk_buff *skb){
    ......  
    ret = __netif_receive_skb_core(skb, false);}static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc){
    ......

    //pcap Logic , This will send the data to the packet capture point .tcpdump This is where you get the package     list_for_each_entry_rcu(ptype, &ptype_all, list) {
        if (!ptype->dev || ptype->dev == skb->dev) {
            if (pt_prev)
                ret = deliver_skb(skb, pt_prev, orig_dev);
            pt_prev = ptype;
        }
    }
    ......
    list_for_each_entry_rcu(ptype,
            &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
        if (ptype->type == type &&
            (ptype->dev == null_or_dev || ptype->dev == skb->dev ||
             ptype->dev == orig_dev)) {
            if (pt_prev)
                ret = deliver_skb(skb, pt_prev, orig_dev);
            pt_prev = ptype;
        }    }
}

stay __netif_receive_skb_core in , I looked at the one I used a lot tcpdump Bag catching point , Very excited. , It seems that reading the source code time is really no waste . next __netif_receive_skb_core Take out protocol, It takes the protocol information out of the packet , Then traverse the list of callback functions registered on this protocol .ptype_base  It's a hash table, We mentioned it in the agreement registration section .ip_rcv This is where the function address exists hash table Medium .

//file: net/core/dev.c
static inline int deliver_skb(struct sk_buff *skb,
                  struct packet_type *pt_prev,
                  struct net_device *orig_dev){
    ......    return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
}

pt_prev->func This line calls the processing function registered in the protocol layer . about ip Come on , It will enter into ip_rcv( If it is arp Bag words , Will enter the arp_rcv).

3.4 IP Protocol layer processing

Let's take a look at it in general linux stay ip What does the protocol layer do , How was the bag delivered further udp or tcp In protocol processing function .

//file: net/ipv4/ip_input.c
int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev){
    ......
    return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL,               ip_rcv_finish);
}

here NF_HOOK It's a hook function , When the registered hook is executed, the function pointed to by the last parameter will be executed ip_rcv_finish.

static int ip_rcv_finish(struct sk_buff *skb){
    ......
    if (!skb_dst(skb)) {
        int err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
                           iph->tos, skb->dev);
        ...
    }
    ......    return dst_input(skb);
}

track ip_route_input_noref  After that, it calls again  ip_route_input_mc. stay ip_route_input_mc in , function ip_local_deliver It's assigned to dst.input, as follows :

//file: net/ipv4/route.c
static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr,u8 tos, struct net_device *dev, int our){
    if (our) {
        rth->dst.input= ip_local_deliver;
        rth->rt_flags |= RTCF_LOCAL;    }
}

So back to ip_rcv_finish Medium return dst_input(skb);.

/* Input packet from network to transport.  */
static inline int dst_input(struct sk_buff *skb){
    return skb_dst(skb)->input(skb);
}

skb_dst(skb)->input Called input The method is assigned by the routing subsystem ip_local_deliver.

//file: net/ipv4/ip_input.c
int ip_local_deliver(struct sk_buff *skb){
    /*     *  Reassemble IP fragments.     */
    if (ip_is_fragment(ip_hdr(skb))) {
        if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER))
            return 0;
    }

    return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL,               ip_local_deliver_finish);
}
static int ip_local_deliver_finish(struct sk_buff *skb){
    ......
    int protocol = ip_hdr(skb)->protocol;
    const struct net_protocol *ipprot;

    ipprot = rcu_dereference(inet_protos[protocol]);
    if (ipprot != NULL) {
        ret = ipprot->handler(skb);    }
}

As shown in the agreement registration section inet_protos There is tcp_rcv() and udp_rcv() Function address of . This will be distributed according to the protocol type selection in the package , ad locum skb The package will be further dispatched to a higher level protocol ,udp and tcp.

3.5 UDP Protocol layer processing

We said that in the section of agreement registration ,udp The processing function of the protocol is udp_rcv.

//file: net/ipv4/udp.c
int udp_rcv(struct sk_buff *skb){
    return __udp4_lib_rcv(skb, &udp_table, IPPROTO_UDP);
}
int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
           int proto){
    sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);

    if (sk != NULL) {
        int ret = udp_queue_rcv_skb(sk, skb
    }    icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);
}

__udp4_lib_lookup_skb It's based on skb To find the corresponding socket, When found, put the packet into socket In the cache queue . If not found , Then send a target unreachable icmp package .

//file: net/ipv4/udp.c
int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb){  
    ......
    if (sk_rcvqueues_full(sk, skb, sk->sk_rcvbuf))
        goto drop;

    rc = 0;

    ipv4_pktinfo_prepare(skb);
    bh_lock_sock(sk);
    if (!sock_owned_by_user(sk))
        rc = __udp_queue_rcv_skb(sk, skb);
    else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) {
        bh_unlock_sock(sk);
        goto drop;
    }
    bh_unlock_sock(sk);    return rc;
}

sock_owned_by_user The judgment is whether the user is in this socker System call on (socket occupied ), without , Then you can put it directly socket In the receive queue of . If there is , Then go through sk_add_backlog Add packets to backlog queue . When the user releases the socket When , The kernel checks backlog queue , If any data is moved to the receiving queue .

sk_rcvqueues_full If the receive queue is full , The packet will be discarded directly . Receive queue size is determined by kernel parameters net.core.rmem_max and net.core.rmem_default influence .

Four

recvfrom system call

Two flowers , Each table a . We've finished the whole thing Linux The process of receiving and processing data packets by kernel , Finally, put the packet into the socket Is in the receive queue for . Let's go back to the user process, so recvfrom Then what happened . We call it in the code recvfrom It's a glibc Library function , This function will put the user into kernel state after execution , Enter into Linux Implemented system call sys_recvfrom. Understanding Linux Yes sys_revvfrom Before , Let's have a brief look at socket This core data structure . This data structure is too big , We're just going to draw what's relevant to our topic today , as follows :

chart 11 socket Kernel data organization

socket In data structure const struct proto_ops This corresponds to the method set of the protocol . Each protocol implements a different set of methods , about IPv4 Internet For protocol family , Each protocol has a corresponding processing method , as follows . about udp Come on , It's through inet_dgram_ops To define the , It's registered inet_recvmsg Method .

//file: net/ipv4/af_inet.c
const struct proto_ops inet_stream_ops = {
    ......
    .recvmsg       = inet_recvmsg,
    .mmap          = sock_no_mmap,    ......
}
const struct proto_ops inet_dgram_ops = {
    ......
    .sendmsg       = inet_sendmsg,
    .recvmsg       = inet_recvmsg,    ......
}

socket Another data structure in a data structure struct sock *sk It's a very big one , Very important substructure . Among them sk_prot Second level processing function is defined . about UDP Agreement for , It's going to be set to UDP Method set of protocol implementation udp_prot.

//file: net/ipv4/udp.c
struct proto udp_prot = {
    .name          = "UDP",
    .owner         = THIS_MODULE,
    .close         = udp_lib_close,
    .connect       = ip4_datagram_connect,
    ......
    .sendmsg       = udp_sendmsg,
    .recvmsg       = udp_recvmsg,
    .sendpage      = udp_sendpage,    ......
}

It's over socket After a variable , Let's see sys_revvfrom Implementation process .

chart 12 recvfrom Function internal implementation process

stay inet_recvmsg Called sk->sk_prot->recvmsg.

//file: net/ipv4/af_inet.c
int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,size_t size, int flags){  
    ......
    err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT,
                   flags & ~MSG_DONTWAIT, &addr_len);
    if (err >= 0)
        msg->msg_namelen = addr_len;    return err;
}

As we said above, this is for udp Agreed socket Come on , This sk_prot Namely net/ipv4/udp.c Under the struct proto udp_prot. So we found out udp_recvmsg Method .

//file:net/core/datagram.c:EXPORT_SYMBOL(__skb_recv_datagram);
struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned int flags,int *peeked, int *off, int *err){
    ......
    do {
        struct sk_buff_head *queue = &sk->sk_receive_queue;
        skb_queue_walk(queue, skb) {
            ......
        }

        /* User doesn't want to wait */
        error = -EAGAIN;
        if (!timeo)
            goto no_packet;    } while (!wait_for_more_packets(sk, err, &timeo, last));
}

Finally we found the point we wanted to see , Above we see the so-called read process , Is access sk->sk_receive_queue. If there is no data , And users are allowed to wait , Will call wait_for_more_packets() Perform wait operation , It will put the user process into sleep .

5、 ... and

summary

The network module is Linux The most complex module in the kernel , It seems that a simple packet receiving process involves the interaction between many kernel components , Such as network card driver 、 Protocol stack , kernel ksoftirqd Threads, etc . It looks complicated , This text wants to pass the way of diagram , Try to explain the kernel packet receiving process in an easy to understand way . Now let's go through the whole process .

When the user finishes executing recvfrom After calling , The user process will work in kernel mode through system call . If the receive queue has no data , The process goes to sleep and is suspended by the operating system . This is relatively simple , Most of the rest of the show was done by Linux Other modules of the kernel come to perform .

First of all, before you start to collect the package ,Linux A lot of preparatory work has to be done :

  • 1. establish ksoftirqd Threads , Set up its own thread function for it , It is expected to handle soft interrupts later

  • 2. Protocol stack registration ,linux There are a lot of protocols to implement , such as arp,icmp,ip,udp,tcp, Each protocol will register its own handler , Convenience package comes, quickly find the corresponding processing function

  • 3. Network card driver initialization , Each driver has an initialization function , The kernel will initialize the driver . In this initialization process , Put your own DMA Get ready , hold NAPI Of poll Function address tells kernel

  • 4. Start the network card , Distribute RX,TX queue , Register the corresponding processing function of interrupt

The above is an important work before the kernel is ready to receive packets , When it's all over ready after , You can turn on the hard interrupt , Waiting for the packet to arrive .

When the data comes , The first one to greet it is the network card ( I went to , Isn't that bullshit ):

  • 1. The network card will frame the data DMA To memory RingBuffer in , And then to CPU Initiate interrupt notification

  • 2. CPU Respond to interrupt request , Call the interrupt handling function registered when the network card is started

  • 3. Interrupt handling functions do little , A soft interrupt request is initiated

  • 4. Kernel thread ksoftirqd The thread found a soft interrupt request coming , Turn off hard interrupt first

  • 5. ksoftirqd Thread starts calling driven poll Function packet

  • 6. poll Function to send the received packet to the ip_rcv Function

  • 7. ip_rcv Function and package delivery udp_rcv Function ( about tcp The bag will be delivered tcp_rcv)

Now we can go back to the opening question , The simple line we see in the user layer recvfrom,Linux The kernel does so much for us , So we can get the data . It's still simple UDP, If it is TCP, The kernel has more to do , I can't help but sigh that the developers of the kernel are really well intentioned .

After understanding the whole process of packet collection , We'll know for sure Linux Take a bag CPU It costs . First of all, the first block is the cost of system call in kernel state . The second is CPU Response packet hard interrupt CPU expenses . The third piece is ksoftirqd Cost of the kernel thread's soft interrupt context . In the future, we will send an article to actually observe these expenses .

In addition, there are a lot of details in the network sending and receiving, we did not start to say , for instance no NAPI, GRO,RPS etc. . Because I think what I said is too right, it will affect everyone's grasp of the whole process , So try to keep only the main frame , Less is more !

 Programmer column   Scan code and pay attention to customer service   Press and hold to recognize the QR code below to enter the group 

Recent highlights are recommended :  

  bye , Dewey ! bye , Yellow car !

  Stop learning , These languages are going to be eliminated !

  bye Win10! The next generation of operating system is coming ..

  Wall cracks suggest collection ,100 Avenue Python Training topic


Here's a look Good articles to share with more people ↓↓

Scroll to Top