编程知识 cdmana.com

Elasticsearch aggregate data results are not accurate, how to break?

1、 Actual combat development encounters aggregation problems

Ask a question ,ES There is a strange phenomenon in the aggregation of statements size Set to 10 And greater than 10 This leads to an inconsistent number of aggregations , This size Isn't that the number of items returned ? Will it affect the statistical results ?dsl Statement summary ( It's not convenient to use the mobile phone , Double quotation marks are not written ):

aggs:{topcount:{terms:{field:xx,size:10}}}

This is this. size, Set up 10 And greater than 10 It will lead to different aggregation results , Is it es5.x Of bug Do you ?

These are the real problems in actual combat , Based on this question , With this article .

The polymerization discussed in this paper mainly refers to :terms In barrels . The picture below shows the bucket terms Aggregation diagram .

Aggregate from a bunch of multi category products TOP 3 Product classification and quantity of .TOP3 result :

 product  Y:4
 product  X:3
 product  Z:2

2、 Premise cognition :Elasticsearch terms The result of barrel polymerization is not accurate

2.1 Elasticsearch Fragmentation and copy

Elasticsearch An index consists of one or more primary partitions and zero or more replica partitions .

The size of the index exceeds the hardware limit of a single node , It can be solved by slicing .

A fragment contains a subset of the index data , And it has complete function and independence , You can think of shards as “ Independent index ”.

The core idea of fragmentation

  • Slicing can split and expand the amount of data .

    If the amount of data continues to increase , There will be a storage bottleneck . give an example : Yes 1TB The data of , But there are only two nodes ( A single node 512GB Storage )? It can't be stored alone , After slicing , The problem can be solved easily .

  • Operations can be distributed over multiple nodes , This can be parallelized to improve performance .

The primary shard : In the writing process, the main partition is written first , After successful writing, write the replica partition , The recovery phase is also dominated by the main fragmentation .

A copy of the shard Purpose :

  • Provide high availability in the event of node or slice failure .

Replica shards are never assigned to the same node as the primary partition .

  • Improve the performance of search queries .

Because it can be in all the Lord 、 Parallel search on replica 、 Aggregation operation .

2.2 Partition allocation mechanism

Elasticsearch How to know which slice to store the new document on , And through ID How to find it when you retrieve it ?

By default , Documents should be evenly distributed among nodes , So there won't be one shard that has more documents than another .

The mechanism for determining which slice a given document should be stored in is called : route .

In order to make Elasticsearch As easy to use as possible , Routing is handled automatically by default , And most users don't need to manually reroute Deal with it .

Elasticsearch Use the simple formula shown in the figure below to determine the appropriate slice .

shard = hash(routing) % total_primary_shards
  • routing: file id, It can be specified or generated by the system UUID.

  • total_primary_shards: Number of main segments .

Here is an interview question : Once the index is created , Why can't I change the number of primary partitions in the index

Consider the above routing formula , We can find the answer .

If we want to change the number of slices , So for documents , The result of running the routing formula will change .

hypothesis : Set up a 5 The document is already stored in the fragment when it is fragmented A On , Because that was the result of the routing formula at that time .

Later, we change the main fragment to 7 individual , If you try to pass again ID Search for documents , The result of the routing formula may be different .

Now? , Even if the document is actually stored in Shard A On , The formula may also route to ShardB. This means that the document will never be found .

It can be concluded that : The conclusion that cannot be changed after the main partition is created .

More serious students , You may say : Not yet. Split Cut into slices and Shrink Compression fragmentation mechanism ?

After all Split  and Shrink  The processing of fragmentation is conditional ( Such as : You need to set the shard to read-only first ).

therefore , In the long run, I suggest : Plan the number of main partitions in advance according to the capacity scale and increment scale .

2.3 Elasticsearch How to retrieve / Aggregate data ?

The node receiving the client request is : Coordinate nodes . As shown in the figure below 1 .

At the coordination node , The search task is broken down into two phases :query and fetch .

The nodes that actually search or aggregate tasks are called : Data nodes . As shown in the figure below : node 2、3、4.

Aggregation steps :

  • The client sends the request to the coordination node .

  • The coordinator pushes the request to each data node .

  • Each data node is designated to participate in the data collection work .

  • Coordinate the nodes to summarize the results .

2.4 example The aggregation result is not accurate

colony :3 Nodes ,3 Main segments , Each slice has 5 Product data . Users expect to return Top 3 give the result as follows :

 product X:40
 product A:40
 product Y:35

The user performs the following terms polymerization , Expect to return to the cluster prodcuts Indexes Top3 result .

POST products/_search
{
  "size":0,
  "aggs": {
    "product_aggs": {
      "terms": {
        "field":"name.keyword",
        "size":3
      }
    }
  }
}

The actual implementation is shown in the figure below : The fragmentation of each node : Take your own Top3 Back to coordination node . After the coordination nodes are collected, the result is :

 product Y:35, 
 product X: 35,
 product A:30.

This leads to the inconsistency between the actual aggregation result and the expected aggregation result , That is, the aggregation result is not accurate .

Analysis of the causes of inaccurate aggregation :

  • Efficiency factor : The value of each slice Top X, It's not the whole thing TOP X.

  • Performance factors :ES It's OK not to slice each piece Top X, It's total polymerization , But there are bound to be big performance problems .

3、 How to improve the polymerization accuracy ?

Thinking questions ——terms Aggregating size and shard_size What's the difference? ?

  • size: Is the return value of the aggregate result , Customers expect to return to the top three in aggregate ,size The value is 3.

  • shard_size: The number of data aggregated on each slice .shard_size In principle, it should be greater than or equal to size( If the setting is less than size, It doesn't make sense ,elasticsearch Will default to size)

Requested size The higher the value , The more accurate the results will be , But the higher the cost of calculating the final result .

So how to provide aggregate accuracy ? Here are four schemes for reference :

programme 1: Set the main partition to 1

Be careful 7.x The version has been default to 1.

Applicable scenario : Small amount of data, cluster scale business scenario .

programme 2: turn up shard_size value

Set up shard_size For a larger value , The official recommendation :size*1.5+10

Applicable scenario : Large amount of data 、 Cluster business scenarios with multiple partitions .

shard_size The bigger the value is. , The results are closer to the precise aggregation results .

Besides , You can also use show_term_doc_count_error Parameter displays the worst case error value , Used to assist in determining shard_size size .

programme 3: take size Set to full value , To solve the problem of precision

take size Set to 2 Of 32 To the power of 1 That is, the maximum value of fragmentation support , To solve the problem of precision .

reason :1.x edition ,size be equal to 0 Represent all , High version cancelled 0 value , So the maximum value is set ( Greater than the total value of the business ).

The total amount of disadvantages Namely : If the amount of slicing data is huge , It's going to be costly CPU Resources to sort , And it could block the network .

Applicable scenario : Business scenarios that require high aggregation accuracy , Because of performance problems , It is not recommended to use .

programme 4: Use Clickhouse Precise aggregation

In the wechat group of the planet , Zhang Chao pointed out that : The analysis system runs the full amount of group by I think it's a reasonable demand , clickhouse Very good at doing this kind of thing ,es If it is not strengthened in this respect , A lot of scenarios will be analyzed clickhouse Replace it .

Tencent pointed out that : Gather this, compare and see the scene . Because I have some businesses doing aggregation , That is to say olap scene , Multidimensional analysis ,ES Not really . If there are rich multidimensional analysis scenarios , There are also relatively high performance requirements . I suggest we can investigate clickhouse. We have evaluated open source and internal Most of the scenes clickhouse Billions of levels , It's basically returned in seconds or even milliseconds .

Besides , except chlickhouse, spark It also has a function similar to aggregation .

Applicable scenario : Very large amount of data 、 High polymerization precision is required 、 Fast response business scenarios .

4、 Summary

Back to the question at the beginning , Set up 10 And greater than 10 It will lead to different aggregation results due to Elasticsearch Aggregation implementation mechanism determines , No Bug.Elasticsearch It does not provide precise barrel separation polymerization . To improve the polymerization accuracy , Refer to several schemes mentioned in the article .

We have a better precision improvement plan , Welcome to leave a message .


Reference resources :

https://codingexplained.com/coding/elasticsearch/understanding-sharding-in-elasticsearch

https://codingexplained.com/coding/elasticsearch/understanding-replication-in-elasticsearch

https://medium.com/swlh/does-elasticsearch-lie-how-does-elasticsearch-work-f2d4e2bf92c9

https://t.zsxq.com/v7i27ma、

《Elasticsearch actual combat 》

《Elasticsearch Source code analysis and optimization practice 》

shorter Time faster acquisition more dried food !

China 1/5 Of Elastic Certified engineers come from here .

版权声明
本文为[Mingyi world]所创,转载请带上原文链接,感谢

Scroll to Top