编程知识 cdmana.com

Today, I got an elasticsearch question in my interview, which made me confused

1、 Real interview questions

screwing Elasticsearch Technology group friends leave a message :

2、 The interviewer is asking Elasticsearch range When filtering , What's on his mind ?

The most basic : I see you don't understand range Inquire about ?

secondly : I see you don't understand range Inquire about What data types are supported ?

Again : See you do not understand the corresponding data type underlying storage data structure or algorithm ?

Last : You don't know the corresponding data structure range Query principle ?

The most fundamental , I want to see if your underlying principles are not solid ?

When we are interviewed by the interviewer , It's actually our own interview “ interviewer ”.

Think from the interviewer's point of view , Ask a few more questions about what ? You can basically figure out his inner thoughts .

At least , Can you ask this question , Interviewers are not simple CRUD Business interviewers , Or the underlying principles , Side confirmed that the interviewers team technology has a certain technical content .

3、Elasticsearch range Filter Problem disassembly

3.1 range Inquire about Introduce

Think about it in your mind : I used range Inquire about .

  • Query data for a given period of time (date);

  • Query the numerical type data of a given interval range (integer,long etc. );

  • Query the string type data of a given interval range (text/keyword).

Because it's the second round, summary and explanation , I'll make a list Demo Read about the :

DELETE my_index_0002
PUT my_index_0002
{
  "mappings": {
    "properties": {
      "post_date": {
        "type": "date"
      },
      "user": {
        "type": "keyword"
      },
      "name": {
        "type": "text"
      },
      "age": {
        "type": "integer"
      }
    }
  }
}

PUT my_index_0002/_bulk
{"index":{"_id":1}}
{"post_date":"2019-08-05T15:03:02", "user":"lucy", "name":"lucy",  "age":18}
{"index":{"_id":2}}
{"post_date":"2017-03-02T01:08:35", "user":"haimeimei", "name":"hanmeimei", "age":25}
{"index":{"_id":3}}
{"post_date":"2020-08-05T15:03:02", "user":"hank", "name":"hank", "age":7}

POST my_index_0002/_search
{"profile": "true",
  "query": {
    "range": {
      "post_date": {
        "gte": "2017-01-01T00:00:00",
        "lte": "2019-12-31T23:59:59"
      }
    }
  }
}


POST my_index_0002/_search
{    "profile": "true",
  "query": {
    "range": {
      "user": {
        "gte": "aaaa",
        "lte": "zzzz"
      }
    }
  }
}

POST my_index_0002/_search
{
    "profile": "true",
  "query": {
    "range": {
      "age": {
        "gte": 1,
        "lte": 20
      }
    }
  }
}

POST my_index_0002/_search
{
  "profile": "true",
  "query": {
    "range": {
      "name": {
        "gte": "aaaa",
        "lte": "zzzz"
      }
    }
  }
}

I used it in actual combat Elasticsearch Of , They've used the above range Inquire about .

but , What the interviewer wants to hear is not the rudimentary content .

3.2 range Query supported data types

The above example has been listed .

  • value type give an example :integer、long etc.

  • The date type give an example :date etc.

  • String type give an example :keyword、 text etc.

3.3 range Query the underlying storage data structure of the corresponding data type

In the two most typical types of disassembly .

3.3.1 Numerical types correspond to data structures —— Block K-d Tree

Wood Uncle's related articles have a detailed interpretation :

Elasticsearch from 5.0 Start introducing Block K-d Tree This new numeric type index data structure , No more inverted indexes , This structure is more suitable for range lookup .

Block K-d Tree The basic idea :

Will a N Dimensional numerical space , Constantly select the dimension that contains the most values 2 Cut it apart , Repeated iteration , Until we cut out the space unit (cell) Contains less than a certain number of values .

For single dimensional data , It's actually a simple sort of all the values , And then cut it from the middle again and again , Generate a similar to B-tree Such a structure .

And traditional B-tree The difference is , His leaf nodes don't store single values , It's a set of values , It's the so-called one Block . Every Block The number of internally contained values is controlled within 512- 1024 individual , The quantity of guarantee value is in block Try to distribute evenly between them .

https://elasticsearch.cn/article/446

The data structure is shown in the figure below ( roughly ):

picture source :wood uncle

3.3.2 String type text Corresponding data structure ——LSM Trees

Elasticsearch,HBase, Cassandra, RockDB And so on LSM Tree To build the index .

LSM tree It's a kind of layering 、 Orderly 、 Disk oriented data structures , Its core idea is that it makes full use of the disk batch of sequential write performance is much higher than the random write performance .

LSM tree Core features of :

  • First of all : Divide the index into two parts: memory and disk , And start tree merging when the memory reaches the threshold (Merge Trees);

  • second : Replace random write with batch write , And write the log in advance WAL technology (Elasticsearch In Chinese, it means translog Transaction log ) Guarantee memory data , Can be recovered after a system crash ;

  • Third : Data is written in a way similar to log append write (Log Structured) disk , In order to improve the efficiency of writing .

3.4 range principle

in the light of 3.1 demo Several of the range Inquire about , Add profile: true You can see the underlying logic :

  • plastic age as well as The date type post_date Of range Inquire about Used in the background Lucene Of IndexOrDocValuesQuery .

    post_date date The type is actually converted to a timestamp :

"description" : "post_date:[1483228800000 TO 1577836799999]",

Elasticsearch 5.4 New version of indexOrDocValuesQuery take Range The sequential access task in the query process is thrown to Block k-d Tree Indexes , Give the random visit to doc values .

  • String type user And name Of range Inquire about The backstage uses Lucene Of MultiTermQueryConstantScoreWrapper.

    MultiTermQueryConstantScoreWrapper The essence is : Multiple term query The combination of .

3.4.1 Of numerical type range principle

Lucene After segmenting the single dimension data repeatedly from the middle , B-tree The non leaf node part of is stored in memory , The leaf nodes are stored next to each other on the disk .

treat as range When querying , In memory B-tree The location of the leaf block on the disk can satisfy the query condition quickly , After that, the reading of leaf node blocks is almost sequential .

The above sentence is actually what the interviewer wants to hear .

This is like B+ Trees (Mysql Index data structure ) It's very similar ,B+ Tree's index and data separation mechanism make it suitable for range query .

If you want to B+ Tree making Range Inquire about , For example, query (a,b) Between all the elements , The actual operation is as follows :

  • step 1: Find by binary search  a Elements ;

  • step 2: Traverse the list elements corresponding to leaf nodes backward , Until the number > b stop it .

In this way, we can get (a, b) Between the values of all elements .

3.4.2 String type text Of range principle

in the light of LSM-Tree , The core data structure is SSTable , The full name is Sorted String Table ,SSTable In fact, the concept of "is also derived from Google Of Bigtable The paper .

SSTable It's about having persistence , An ordered and immutable key value storage structure , its key and value It's an arbitrary array of bytes , And provided as specified key Find and specify the scope of key The function of interval iteration traversal .

SSTable Contains a series of configurable sizes Block block , The typical size is 64 KB, About these Block The index of the block is stored in SSTable Tail of , Used to help quickly find specific Block.

  • The principle of query : When one SSTable When it's opened , The index will be loaded into memory , And then according to key Do a binary search in the memory index , We found that key Of the corresponding disk offset after , Then go to the disk and read the response block data .

  • range Principle : On the basis of query , be based on SStable file ( There is a preface ) Traverse backward until it finds a value greater than the right interval and stops traversing .

Of course, if the memory is big enough , You can directly SSTable Directly through MMap Technology mapping to memory , To provide faster search .

4、 Summary

A small problem , truth .“ The underlying technology is never out of date “.

It involves the underlying logic of the algorithm , Behind any knowledge point are top papers of top journals in the industry , For more details, you need to look at more information .

This paper does not go into the details of the algorithm , It's hard to avoid mistakes , Welcome to correct and exchange feedback .

Be with you , screwing Elastic!


Reference resources :https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf 

https://elasticsearch.cn/article/446 

https://blog.yugabyte.com/a-busy-developers-guide-to-database-storage-engines-the-basics/

https://cloud.tencent.com/developer/article/1441835 

《 Search technology core 20 speak 》


More recommendations :

The first 100000 in the whole network + Reading Elasticsearch The column was born !


Join China to get Elastic The number of officially certified engineers is the largest (> 11 people ) Circle !

版权声明
本文为[Mingyi world]所创,转载请带上原文链接,感谢

Scroll to Top