编程知识 cdmana.com

Deep interpretation of internal data structure in elasticsearch

.

Recently, several questions have been asked in the knowledge planet doc values、store field、fielddata The concept of "etc" .

problem 1:” The group leader has introduced doc value, field data, store fields A better article ? I always feel a little fuzzy “

problem 2:“ Ask the astrologer about ES Storage related issues , A document may be stored in the following places :

  • Inverted index .

  • Source Field .

  • store Storage ( If open )

  • doc_values.

I don't know if I understand it correctly ?

If all these places store , Can that be interpreted as a general expansion of data 4 times ?

screwing Elasticsearch Knowledge of the planet (http://t.cn/RmwM3N9)

It's very necessary to sort it out , Hence the article .

Elasticsearch Understanding and rational use of data structures , For a deeper understanding of Elasticsearch be of great advantage !

1、 Data storage cognitive premise

just as Elastic Official documents say :

Elasticsearch One of the characteristics is : Distributed document storage .

Elasticsearch Information is not stored as rows in a column like database (row), It's stored as serialized as JSON The complex data structure of a document .

When there are multiple Elasticsearch Node time , Stored documents are distributed across the cluster , And can be accessed immediately from any node .

After storing the document , Will be in 1 In seconds ( The default refresh rate is 1s) Index and search it almost in real time .

How to achieve fast index and full-text retrieval ? 

Elasticsearch Data structure using inverted index , This structure supports very fast full text search .

The inverted index lists each unique word that appears in any document , And identifies all documents that appear in each word .

Indexes can be considered as an optimized collection of documents , Each document is a collection of fields , These fields are key value pairs containing data .

By default ,Elasticsearch Index all the data in each field , And each index field has a dedicated optimized data structure .

for example , Text fields are stored in the inverted index , The number field and the geographic field are stored in BKD In the tree .

data type data structure
text/keyword Inverted index
Numbers / Location BKD Trees

Different fields have specific optimized data structures of their own field types , And has the ability to quickly respond to return search results, making Elasticsearch Fast search !

1、Inverted Index Inverted index

1.1 Inverted index definition

Facing massive content , How to quickly find the content containing user query words , Inverted index plays a key role .

Inverted index is the best form of word to document mapping .

The picture below is : The index structure of the last page of the book , It shows the corresponding relationship between the core keywords and the page code .

Just imagine , There is no index page , How slow it is to find from the book according to keywords , You can see the beauty of index directly !

1.2 Example of inverted index

Take an example of an official document :

Suppose we have two documents , Of each document content The domain contains the following :

- 1、The quick brown fox jumped over the lazy dog
- 2、Quick brown foxes leap over lazy dogs in summer

Indexing is subject to Markedness and standardization analysis.

Data indexing constraints : Word segmentation is analyzer The selection of .

Inverted index ( be based on Default Standard Standard word participator word segmentation ) As shown below :

Term Doc_1 Doc_2
Quick
X
The X
brown X X
dog X
dogs
X
fox X
foxes
X
in
X
jumped X
lazy X X
leap
X
over X X
quick X
summer
X
the X

As shown above , For every word in the document , All contain a list of the documents that it is in .

1.3 Inverted index features

  • Created at index time

  • Serialize to disk

  • Full text search is very fast

  • Not suitable for sorting

  • Default on

1.4 Inverted index application scenarios

  • Inquire about

  • Full text search

2、Doc Values Forward index

2.1 Doc Values Definition

stay Elasticsearch in ,Doc Values It's a columnar storage structure , By default,... For each field Doc Values It's all activated ( except text type ),Doc Values Created at index time , When the field is indexed ,Elasticsearch In order to be able to quickly retrieve , The value of the field will be added to the inverted index , It also stores the Doc Values.

Different from the definition of inverted index ,Doc Values Is defined as :“ Forward index ”.

2.2 Doc Values Example

still With 1.2 Document as an example ,Doc Values The structure is as follows ( Just for example ):

Doc Terms
Doc_1 brown, dog, fox, jumped, lazy, over, quick, the
Doc_2 brown, dogs, foxes, in, lazy, leap, over, quick, summer

Doc values By transposing the relationship between the two to solve the applicable inverted index aggregation efficiency is low 、 Problems that are difficult to expand .

The comparison shows that : The inverted index maps the terms to the document containing them ,doc values Map documents to the terms they contain .

2.3 Doc Values characteristic

  • Created at index time

  • Serialize to disk

  • Suitable for sorting operations

  • Store all the values of a single field together in a single data column

  • By default , except text All field types other than are enabled Doc Values.

2.4 Doc Values Applicable scenario

Elasticsearch Medium Doc Values It is often used in the following scenarios :

  • Sort a field

  • Aggregate a field

  • Some filters , For example, geographical location filtering

  • Some script calculations related to fields

Be careful :

Because the document value is serialized to disk , We can rely on the help of the operating system to quickly access .

  • When The working set (working set) Far less than the available memory of the node , The system will automatically save all document values in memory , It makes reading and writing very fast ;

  • When it is much larger than available memory , The operating system will automatically turn Doc Values Load into the system's page cache , Thus avoiding jvm Heap memory overflow exception .

2. 5 Doc Values Precautions for use

For not needed : Sort 、 polymerization 、 Script calculation 、 Business scenarios for geographic location filtering , Consider disabling :Doc Values, To save storage .

PUT my_index
{
  "mappings": {
      "properties": {
        "title": {
          "type": "keyword",
          "doc_values": false 
        }
    }
  }
}

3、fielddata

3.1 fielddata Definition

As before 1、2 Summary :

  • Search needs answers “ Which document contains the word ?” The problem of . With the help of : Implementation of inverted index .

  • Sorting and summarizing need to answer a different question :“ What is the value of this field for this document ?” . With the help of : Forward index implementation .

text Type fields do not support Doc Values Forward index ,text Field usage is : The in memory data structure created at query time (query-time in-memory data structure) fielddata.

fielddata take text Fields are used to aggregate 、 When sorting or using in scripts , This data structure will be built on demand .

Realization mechanism : It's by reading the entire reverse index of each segment from disk , Reverse term ︎ Document relationships and store results in JVM Built in memory in the heap .

3.2 fielddata Example

Strictly speaking ,2.2 An example of , It would be more appropriate here .

DELETE test_001
PUT test_001
{
  "mappings": {
    "properties": {
      "body":{
        "type":"text",
        "analyzer": "standard",
        "fielddata": true
      }
    }
  }
}

POST test_001/_bulk
{"index":{"_id":1}}
{"body":"The quick brown fox jumped over the lazy dog"}
{"index":{"_id":2}}
{"body":"Quick brown foxes leap over lazy dogs in summer"}

GET test_001/_search
{
  "size": 0,
  "query": {
    "match": {
      "body": "brown"
    }
  },
  "aggs": {
    "popular_terms": {
      "terms": {
        "field": "body"
      }
    }
  }
}

3.3 fielddata characteristic

  • It is applicable to operations such as documents

  • But only for text Text field type

  • Create at query time

  • In memory data structure

  • No serialization to disk

  • Disabled by default ( It's expensive to build them , And preset it in the heap )

3.4 fielddata Applicable scenario

  • Full text statistics word frequency

  • Full text generating word cloud

  • text type : polymerization 、 Sort 、 Script calculation

3.5 fielddata Precautions for use

  • Before enabling field data , Consider why text fields are used for aggregation 、 Sort or use in scripts .

  • Enable fielddata Usually it doesn't make any sense , Because it's very memory intensive .

  • Just the application of full-text search , It doesn't need to be enabled fielddata.

4、_source Field interpretation

4.1 _source Definition

_source Field contains the original passed at index time JSON Document body .

_source The field itself is not indexed ( Therefore, it is not searchable ), But the field is already stored , Request for execution in ( Such as get or search) You can return it to .

4.2 _source Precautions for use

First of all : Although very convenient , however source Fields do cause storage overhead within the index . therefore , It can be disabled .

PUT my-index-000001
{
  "mappings": {
    "_source": {
      "enabled": false
    }
  }
}

second : The following measures should be taken before banning Ban _source after , The following operations will not be available :

  1. update, update_by_query and reindex API

  2. Highlight operation

therefore , In the storage space 、 Select after weighing the advantages and disadvantages of business scenarios .

5、store Field interpretation

5.1 store Definition

By default , Index the field value so that it can be searched ( The first 1 Chaste Inverted index ), But don't store them .

This means that you can query the field , But the original field value cannot be retrieved .

Usually it doesn't matter . The field value is already _source Part of the field , By default... Is stored .

but , In some special situations , If you only want to retrieve the values of a single field or several fields , Not the whole thing _source Value , Then you can use source filtering to implement .

This is the time , store That comes in handy .

5.2 store Example

DELETE news-000001
PUT news-000001
{
  "mappings": {
    "_source": {
      "enabled": false
    },
    "properties": {
      "title": {
        "type": "text",
        "store": true
      },
      "date": {
        "type": "date",
        "store": true
      },
      "content": {
        "type": "text"
      }
    }
  }
}

PUT news-000001/_doc/1
{
  "title":   "Some short title",
  "date":    "2021-01-01",
  "content": "A very long content field..."
}

GET news-000001/_search

GET news-000001/_search
{
  "stored_fields": [ "title", "date" ] 
}

5.3 store Applicable scenario

Such as 5.2 Example , In some cases , Storing fields can make sense . for example , The news data collected is : With title 、 Documents with date and large content fields ,

You may just want to retrieve the title and date , Instead of having to go from the bigger _source Extract these fields from the fields .

6、 Summary

Back to the two questions at the beginning of the article :

  • problem 1: After reading this article ,doc values , field data , store fields   It's very clear .

  • problem 2: The field types are different , Storage is different . Default : Inverted index all fields are enabled by default , Forward index Doc Values Not text Type enabled by default , source ( Storage of the original document Of all fields json Structural data ) and store ( To store the specified field json data ) Whether or not the enabling of the system needs to be combined with the business reality . hypothesis : Forward index 、 Inverted index 、_source 、store It's all on , Storage is bound to increase , But it's not linear 4 times .

For questions you don't understand , Read the official documents over and over again , copy to kibana Dev tool To practice , Until I find out .

The article tries to refer to the official documents , For all that , It is hard to avoid mistakes in expression , Welcome to correct and exchange .

Be with you , screwing Elasticsearch !

Reference resources : 

  1. https://t.zsxq.com/Baq3nmE 

  2. https://t.zsxq.com/meAyrzN 

  3. https://t.zsxq.com/IaunyrZ 

  4. https://t.zsxq.com/AIYJiE6 

  5. https://medium.com/datadriveninvestor/elastic-search-what-is-inside-5d61f1a681df 

  6. http://alexander.holbreich.org/elasticsearch-datastructures/ 

  7. Elastic Official documents

recommend :

dried food | Elasticsearch Develop a list of commonly used commands in actual combat

Yours Elasticsearch problem , Official documents already have the answer ......

dried food | Elasticsearch Best practice guide for developers

Elasticsearch Develop the core of operation and maintenance Tips

dried food | On Elasticsearch The importance of data modeling

dried food | Elasticsearch Index design practical guide

dried food | Elasticsearch Multi table Association Design Guide

more short time more Learn quickly more More dry !

China 40%+Elastic Certified engineers come from !

版权声明
本文为[Mingyi world]所创,转载请带上原文链接,感谢

Scroll to Top