编程知识 cdmana.com

Linux file system and file cache knowledge

Python The actual combat community Java In the actual combat community, long press to identify the QR code below , Add scanning code according to demand, pay attention to add customer service into Python community ▲ Scan code, pay attention to add customer service into Java community ▲


 from :luozhiyun
 link :https://www.cnblogs.com/luozhiyun/p/13061199.html

Linux File system

File system features

  1. The file system should have a strict organization form , Enables files to be stored in blocks .

  2. There should also be an index area in the file system , It is used to find the location where the multiple blocks of a file are stored .

  3. If some files in the file system are hot files , Recently, it is often read and written , The file system should have a cache layer .

  4. Files should be organized in the form of folders , Easy to manage and query .

  5. Linux The kernel maintains a set of data structures in its own memory , To save which files are opened and used by which processes .

    On the whole , The main functions of file system are as follows :

ext Series of file system formats

inode With block storage

The hard disk is divided into units of the same size , We call it blocks (Block). The size of a block is an integral multiple of the sector size , The default is 4K. When formatting , This value can be set .

A big hard disk is divided into small blocks , The part of data used to hold a file . thus , If we're like storing a file , You don't have to assign him a continuous space . We can store them in small pieces . It's much more flexible , It's also easier to add 、 Delete and insert data .

inode File index means file index , We have a file for each of us inode; A folder is a file , It also corresponds to a inode.

inode The data structure is as follows :

struct ext4_inode {
  __le16  i_mode;    /* File mode */
  __le16  i_uid;    /* Low 16 bits of Owner Uid */
  __le32  i_size_lo;  /* Size in bytes */
  __le32  i_atime;  /* Access time */
  __le32  i_ctime;  /* Inode Change time */
  __le32  i_mtime;  /* Modification time */
  __le32  i_dtime;  /* Deletion Time */
  __le16  i_gid;    /* Low 16 bits of Group Id */
  __le16  i_links_count;  /* Links count */
  __le32  i_blocks_lo;  /* Blocks count */
  __le32  i_flags;  /* File flags */
......
  __le32  i_block[EXT4_N_BLOCKS];/* Pointers to blocks */
  __le32  i_generation;  /* File version (for NFS) */
  __le32  i_file_acl_lo;  /* File ACL */
  __le32  i_size_high;
......
};

inode There are read and write permissions for files i_mode, To which user i_uid, Which group i_gid, What's the size i_size_io, How many blocks does it take i_blocks_io,i_atime yes access time, It's the last time the file was accessed ;i_ctime yes change time, It's the last change inode Time for ;i_mtime yes modify time, It's the last time the file was changed .

All the files are saved in i_block Inside . The specific preservation rules are as follows EXT4_N_BLOCKS decision ,EXT4_N_BLOCKS It has the following definition :

#define  EXT4_NDIR_BLOCKS    12
#define  EXT4_IND_BLOCK      EXT4_NDIR_BLOCKS
#define  EXT4_DIND_BLOCK      (EXT4_IND_BLOCK + 1)
#define  EXT4_TIND_BLOCK      (EXT4_DIND_BLOCK + 1)
#define  EXT4_N_BLOCKS      (EXT4_TIND_BLOCK + 1)

stay ext2 and ext3 in , The top 12 The item directly stores the location of the block , in other words , We can go through i_block[0-11], Directly get the block to save the contents of the file .

however , If a file is large ,12 The block won't fit . When we use i_block[12] When , You can't put the data block directly , otherwise i_block It's going to run out soon .

Then you can make i_block[12] Point to a block , There is no data block in this block , It's the location of the data block , This block is called the indirect block . If the file is bigger ,i_block[13] It points to a block , We can use quadratic indirect blocks . The location of the indirect block is stored in the secondary indirect block , The indirect block contains the location of the data block , Data blocks hold real data . If the file is bigger , that i_block[14] Empathy .

There is a very significant problem , For big files , We have to read the hard disk many times to find the corresponding block , In this way, the access speed will be slower .

To solve this problem ,ext4 Made some changes . It introduces a new concept , called Extents. For example , A file size is 128M, If you use 4k The size of the block to store , need 32k Block . If according to ext2 perhaps ext3 It's spread out like that , It's too much . however Extents Can be used to store contiguous blocks , in other words , We can 128M In a Extents Inside . In this case , Read and write performance for large files has been improved , File fragmentation has also been reduced .

Exents It's a tree structure :

Each node has a head ,ext4_extent_header It can be used to describe a node .

struct ext4_extent_header {
  __le16  eh_magic;  /* probably will support different formats */
  __le16  eh_entries;  /* number of valid entries */
  __le16  eh_max;    /* capacity of store in entries */
  __le16  eh_depth;  /* has tree real underlying blocks? */
  __le32  eh_generation;  /* generation of the tree */
};

eh_entries Indicates how many items are in this node . There are two kinds of items here , If it's a leaf node , This entry points directly to the address of the contiguous blocks on the hard disk , We call it data nodes ext4_extent; If it's a branch node , This item points to the branch node or leaf node at the next level , We call it the inode ext4_extent_idx. The size of both types of items is 12 individual byte.

/*
 * This is the extent on-disk structure.
 * It's used at the bottom of the tree.
 */
struct ext4_extent {
  __le32  ee_block;  /* first logical block extent covers */
  __le16  ee_len;    /* number of blocks covered by extent */
  __le16  ee_start_hi;  /* high 16 bits of physical block */
  __le32  ee_start_lo;  /* low 32 bits of physical block */
};
/*
 * This is index on-disk structure.
 * It's used at all the levels except the bottom.
 */
struct ext4_extent_idx {
  __le32  ei_block;  /* index covers logical blocks from 'block' */
  __le32  ei_leaf_lo;  /* pointer to the physical block of the next *
         * level. leaf or next index could be there */
  __le16  ei_leaf_hi;  /* high 16 bits of physical block */
  __u16  ei_unused;
}; If the file is not big ,inode Inside i_block in , It can hold the next one ext4_extent_header and 4 term ext4_extent. So at this point ,eh_depth by 0, That is to say inode Inside is the leaf node , The height of the tree is 0.

If the file is large ,4 individual extent can't let go , It's about to split into a tree ,eh_depth>0 The node of is the index node , The root node has the largest depth , stay inode in . At the bottom eh_depth=0 It's the leaf node .

Except for the root node , The other nodes are stored in a block 4k Inside ,4k deduction ext4_extent_header Of 12 individual byte, The rest can put 340 term , Every extent Maximum energy means 128MB The data of ,340 individual extent Will make your presentation file reach 42.5GB.

inode Bitmaps and block bitmaps

inode The bitmap size of is 4k, Each bit corresponds to one inode. If it is 1, Express this inode It's been used ; If it is 0, It means that it is not used .block It's the same thing with bitmap .

stay Linux In the operating system , Want to create a new file , Would call open function , And the parameters will have O_CREAT. This means that when the file can't be found , We need to create a . that open The call procedure of the function is roughly : To open a file , First find the folder according to the path . If you find that there is no file under the folder , At the same time, it sets O_CREAT, That means we need to create a file under this folder .

Create a file , Then you need to create a inode, Then it will read from the file system inode Bitmap , And find the next one for 0 Of inode, Just idle inode. about block Bitmap , When writing to a file , There will also be this process .

File system format

The bitmap of a data block is placed in a block , common 4k. Each represents a block of data , A total can mean $4 * 1024 * 8 = 2{15}$ Data blocks . If each block is also by default 4K, The maximum space can be expressed as $2{15} * 4 * 1024 = 2^{27}$ individual byte, That is to say 128M, So obviously it's not enough .

In this case, we need to use the block group , The data structure is ext4_group_desc, This is for a block group of inode Bitmap bg_inode_bitmap_lo、 Block bitmap bg_block_bitmap_lo、inode list bg_inode_table_lo, There are corresponding member variables .

Such a block by block , It basically constitutes the structure of our entire file system . Because there are multiple blocks , Block group descriptors also form a list , We call these block group descriptor tables .

We also need to have a data structure , Describe the whole file system , This is the super block ext4_super_block. There's a whole file system in it, how many inode,s_inodes_count; How many pieces are there ,s_blocks_count_lo, How many... Per block group inode,s_inodes_per_group, How many pieces per block group ,s_blocks_per_group etc. . These are global information of this kind .

Final , The entire file system format looks like this .

By default , A copy of the superblock and block group descriptor tables is stored in each block group . To prevent this data from being lost , The entire file system can't be opened .

Because if there is a complete block group descriptor table in each block group , On the one hand, it's a waste of space ; Another aspect , Because a block group is the largest 128M, How many items are in the block group descriptor table , This limits how many blocks there are ,128M * The total number of block groups is the size of the entire file system , It's limited .

So introduce Meta Block Groups characteristic .

First , The block group descriptor table does not hold all block group descriptors , Instead, we divide the block groups into groups , We call them chunks (Meta Block Group). The list of block group descriptors in each meta block group only includes its own , A chunk group contains 64 Block groups , Such a metablock group has the largest number of block group descriptors 64 term .

Let's assume that there are 256 Block groups , It turns out to be a whole list of block group descriptors , There are 256 term , If you want to back up, back up all of them , Now it's divided into 4 Block groups , The list of block group descriptors in each meta block group is only 64 The item , It's much smaller , And the four chunk groups back up their own .

According to the picture , Each chunk group contains 64 Block groups , The block group descriptor table is also 64 term , Make three copies of it , At the first of the chunk groups , The beginning of the second and last block group .

If it's on sparse_super characteristic , Copies of the superblock and block group descriptor tables are only saved in the block group index of 0、3、5、7 In the integer power of . So the superblock in the figure above is only indexed to 0、3、5、7 In the integer power of .

The storage format of the directory

In fact, the directory itself is a file , Also have inode.inode It also points to some blocks . Different from ordinary documents , The file data is stored in the block of ordinary file , The block of the directory file stores the file information of each item in the directory . This information we call ext4_dir_entry.

In the block of the catalog file , The simplest format to save is a list , Each item will save the file name of the next level file in this directory and the corresponding file name inode, Through this inode, You can find the real file . The first is “.”, Represents the current directory , The second is “…”, Represents the directory above , Next are the file names and inode.

If in inode Set in EXT4_INDEX_FL sign , Then it means to find the file according to the index . The index entry maintains a mapping relationship between the hash value of a file name and the data block .

If we want to find the file name under a directory , You can hash by name . If the hash matches , It means that the information of this file is in the corresponding block . And then open this block , If it's no longer an index , It's the leaf node of the index tree , There is still ext4_dir_entry A list of , We just need to find the file names one by one . Through the index tree , We can put a directory under N Many files are scattered into many blocks , It's quick to find .

Linux File cache in

ext4 File system layer

about ext4 File system , The kernel defines a ext4_file_operations.

const struct file_operations ext4_file_operations = {
......
  .read_iter  = ext4_file_read_iter,
  .write_iter  = ext4_file_write_iter,
......
}

ext4_file_read_iter Would call generic_file_read_iter,ext4_file_write_iter Would call __generic_file_write_iter.

ssize_t
generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
{
......
    if (iocb->ki_flags & IOCB_DIRECT) {
......
        struct address_space *mapping = file->f_mapping;
......
        retval = mapping->a_ops->direct_IO(iocb, iter);
    }
......
    retval = generic_file_buffered_read(iocb, iter, retval);
}




ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
......
    if (iocb->ki_flags & IOCB_DIRECT) {
......
        written = generic_file_direct_write(iocb, from);
......
    } else {
......
    written = generic_perform_write(file, from, iocb->ki_pos);
......
    }
}generic_file_read_iter and __generic_file_write_iter There's a similar logic , Is to distinguish whether to use cache or not . therefore , Depending on whether memory is used for caching , We can put the file of I/O There are two types of operations .

The first type is caching I/O. Default for most file systems I/O Operations are all caching I/O. For reading operations , The operating system will check , Does the kernel buffer have the required data . If it's already cached , It's going to come back directly from the cache ; Otherwise read from disk , It is then cached in the operating system's cache . For write operations , The operating system will first copy the data from user space to kernel space cache . Now for the user program , The write operation is complete . It's up to the operating system to decide when to write to disk again , Unless you explicitly call sync Synchronization command .

The second type is direct IO, That is, applications access disk data directly , Without passing through the kernel buffer , This reduces data replication between the kernel cache and the user program .

If you're writing logic __generic_file_write_iter Inside , Found that the settings IOCB_DIRECT, Call generic_file_direct_write, It also calls address_space Of direct_IO Function of , Write data directly to the hard disk .

Write operations with cache

Let's first look at functions with cache writes generic_perform_write.

ssize_t generic_perform_write(struct file *file,
        struct iov_iter *i, loff_t pos)
{
  struct address_space *mapping = file->f_mapping;
  const struct address_space_operations *a_ops = mapping->a_ops;
  do {
    struct page *page;
    unsigned long offset;  /* Offset into pagecache page */
    unsigned long bytes;  /* Bytes to write to page */
    status = a_ops->write_begin(file, mapping, pos, bytes, flags,
            &page, &fsdata);
    copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
    flush_dcache_page(page);
    status = a_ops->write_end(file, mapping, pos, bytes, copied,
            page, fsdata);
    pos += copied;
    written += copied;




    balance_dirty_pages_ratelimited(mapping);
  } while (iov_iter_count(i));
}

These are the main things that are done in the loop :

  • For every page , First call address_space Of write_begin Do some preparation ;

  • call iov_iter_copy_from_user_atomic, Copy the contents written from user mode to kernel state page ;

  • call address_space Of write_end Complete the write operation ;

  • call balance_dirty_pages_ratelimited, See if there are too many dirty pages , Need to write back to the hard disk . Dirty pages , It's writing to the cache , But it hasn't been written to the hard disk yet .

For the first step , It's called ext4_write_begin Come on , There are two main things to do :

First, do log related work .

ext4 It's a log file system , It is to prevent data loss in case of sudden power failure , The introduction of the Journal (Journal) Pattern . There is one more log file system than a non log file system Journal Area . The file in ext4 There are two parts to store , Part of it is the metadata of the file , The other part is data . Metadata and data operation log Journal It's also managed separately . You can mount ext4 When , choice Journal Pattern . This mode before writing data to the file system , You have to wait for the metadata and the log of the data to go to disk before it can work . This performance is poor , But the safest .

Another model is order Pattern . This mode does not log data , Log metadata only , But before writing the metadata log , You have to make sure that the data is on the disk . This compromise , It's the default mode .

There's another pattern writeback, Logs that don't record data , Log metadata only , And there is no guarantee that the data will be put on the disk before the metadata . This is the best performance , But the least safe .

The second call grab_cache_page_write_begin Come on , Get the cache page that should be written .

struct page *grab_cache_page_write_begin(struct address_space *mapping,
          pgoff_t index, unsigned flags)
{
  struct page *page;
  int fgp_flags = FGP_LOCK|FGP_WRITE|FGP_CREAT;
  page = pagecache_get_page(mapping, index, fgp_flags,
      mapping_gfp_mask(mapping));
  if (page)
    wait_for_stable_page(page);
  return page;
}

In kernel , The cache is placed in memory on a page by page basis , Every open file has a struct file structure , Every struct file Every structure has a struct address_space Used to associate files and memory , It's in this structure , There is a tree. , Used to save all cache pages associated with this file .

For the second step , call iov_iter_copy_from_user_atomic. Call the assigned page first kmap_atomic Mapping to a virtual address in the kernel , Then copy the user mode data to the virtual address of the kernel state page , call kunmap_atomic Remove the mapping in the kernel .

size_t iov_iter_copy_from_user_atomic(struct page *page,
    struct iov_iter *i, unsigned long offset, size_t bytes)
{
  char *kaddr = kmap_atomic(page), *p = kaddr + offset;
  iterate_all_kinds(i, bytes, v,
    copyin((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
    memcpy_from_page((p += v.bv_len) - v.bv_len, v.bv_page,
         v.bv_offset, v.bv_len),
    memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
  )
  kunmap_atomic(kaddr);
  return bytes;
}

In the third step , call ext4_write_end Write complete . It will call ext4_journal_stop Finish writing log , Would call block_write_end->__block_commit_write->mark_buffer_dirty, Mark the modified cache as dirty . It can be seen that , In fact, the so-called completion of writing , It's not really written to the hard disk , Just after writing to the cache , Marked as Dirty page .

Step four , call balance_dirty_pages_ratelimited, It's a dirty page .

/**
 * balance_dirty_pages_ratelimited - balance dirty memory state
 * @mapping: address_space which was dirtied
 *
 * Processes which are dirtying memory should call in here once for each page
 * which was newly dirtied.  The function will periodically check the system's
 * dirty state and will initiate writeback if needed.
  */
void balance_dirty_pages_ratelimited(struct address_space *mapping)
{
  struct inode *inode = mapping->host;
  struct backing_dev_info *bdi = inode_to_bdi(inode);
  struct bdi_writeback *wb = NULL;
  int ratelimit;
......
  if (unlikely(current->nr_dirtied >= ratelimit))
    balance_dirty_pages(mapping, wb, current->nr_dirtied);
......
}

stay balance_dirty_pages_ratelimited Inside , Found that the number of dirty pages exceeds the specified number , Just call balance_dirty_pages->wb_start_background_writeback, Start a back thread and start writing back .

There are also several scenarios that trigger copybacks :

  • The user calls sync, Brush the cache onto the hard disk , Will eventually call wakeup_flusher_threads, Synchronize dirty pages ;

  • When memory is very tight , When you can't allocate pages , Would call free_more_memory, Will eventually call wakeup_flusher_threads, Release dirty pages ;

  • Dirty pages have been updated for a long time , The time exceeds the set time , Need to write back in time , Keep data consistency between memory and disk .

Read operations with cache

Read with cache , It's a function generic_file_buffered_read.

static ssize_t generic_file_buffered_read(struct kiocb *iocb,
    struct iov_iter *iter, ssize_t written)
{
  struct file *filp = iocb->ki_filp;
  struct address_space *mapping = filp->f_mapping;
  struct inode *inode = mapping->host;
  for (;;) {
    struct page *page;
    pgoff_t end_index;
    loff_t isize;
    page = find_get_page(mapping, index);
    if (!page) {
      if (iocb->ki_flags & IOCB_NOWAIT)
        goto would_block;
      page_cache_sync_readahead(mapping,
          ra, filp,
          index, last_index - index);
      page = find_get_page(mapping, index);
      if (unlikely(page == NULL))
        goto no_cached_page;
    }
    if (PageReadahead(page)) {
      page_cache_async_readahead(mapping,
          ra, filp, page,
          index, last_index - index);
    }
    /*
     * Ok, we have the page, and it's up-to-date, so
     * now we can copy it to user space...
     */
    ret = copy_page_to_iter(page, offset, nr, iter);
    }
}

stay generic_file_buffered_read Function , We need to find page cache Is there a cache page in it . If not found , Not only read this page , And a preview , This needs to be in page_cache_sync_readahead Function . After the preview , Try again to find the cache page .

If you look for the cache page for the first time, you will find , We still have to judge , Should we continue to preview ; if necessary , Just call page_cache_async_readahead Initiate an asynchronous read ahead .

Last ,copy_page_to_iter The contents will be copied from the kernel cache page to the user memory space .

 Programmer column   Scan code and pay attention to customer service   Press and hold to recognize the QR code below to enter the group 

Recent highlights are recommended :  

  My girlfriend thinks the annual salary is 50 Ten thousand is the average level , What do I do ?

  The sexy goddess of the biggest straight men forum in China overturned

 IntelliJ IDEA Fully optimized settings , Efficiency bars !

  Very useful Python skill


Here's a look Good articles to share with more people ↓↓

版权声明
本文为[osc_mpdsal]所创,转载请带上原文链接,感谢

Scroll to Top