编程知识 cdmana.com

Eight common mistakes and best practices of elasticsearch

.

Elasticsearch The community has a lot about Elasticsearch Problems with errors and anomalies .

Dig deep into the reasons behind these mistakes , Accumulate common mistakes into your own practical experience and even tools , Not only can we save our development and operation and maintenance time , And it can help ensure that Elasticsearch The long-term healthy operation of the cluster .

Common exceptions 、 The reasons and general best practices are broken down as follows , These best practices can help us identify... More effectively 、 Minimize locating and handling exception problems .

1、 Mapper_parsing_exception

Elasticsearch Mapping depends on (Mapping) Defined data types handle data .

The mapping defines the fields in the document and specifies their corresponding data types , For example, date type Date、 Long integer type long and   String type text.

If the index document contains a new field without a defined data type ,Elasticsearch Dynamic mapping will be used to estimate the type of field , And, if necessary, convert it from one type to another .

If Elasticsearch This conversion cannot be performed , It will trigger “ mapper_parsing_exception Unable to resolve ” abnormal .

If there are too many such exceptions, the index throughput will be reduced .

Examples of actual combat are as follows :

DELETE mytest_0001
PUT mytest_0001/_doc/1
{
  "name":"John"
}

PUT mytest_0001/_doc/2
{
  "name": {
    "firstname": "John",
    "lastname": "doe"
  }
}

To avoid this problem , The definition can be displayed when the index is created Mapping, Specify the type of field to be typed . Or you can use _mapping Add new field mapping dynamically .

Dynamic update index combat :

PUT mytest_0001/_mapping
{
  "properties": {
    "title": {
      "type": "text"
    }
  }
}

Please note that : Although you can add fields dynamically with the above command , However, you cannot change the existing field mapping .

If you want to modify the field type , Need to redefine Mapping combination reindex and alias Alias Realization .

2、BulkIndexError

It is often more efficient to index large datasets in bulk .

for example , You can perform a batch operation to index 1,000 A document , Instead of using 1,000 Index operations .

Batch operation can be done by bulk API complete .

Batch operation practice :

PUT my_index_0003/_bulk
{"index":{"_id":1}}
{"myid":"c12345"}
{"index":{"_id":2}}
{"myid":"C12456"}
{"index":{"_id":3}}
{"myid":"C31268"}

however , This process is error prone . In the process of batch operation , You need to check carefully : Data type mismatch and null value matching .

For batch API , You need to be extra vigilant , Because even if there are hundreds of positive responses , Some index requests in the batch may also fail .

Batch operation to capture errors :

 @Override
  public void afterBulk(long executionId, BulkRequest request, BulkResponse response) {
 if (response.hasFailures()) {
  for (int i = 0; i < response.getItems().length; i++) {
   BulkItemResponse item = response.getItems()[i];
   if (item.isFailed()) {
      IndexRequest ireq = (IndexRequest) request.requests().get(i);
      logger.error("Failed while indexing to " + item.getIndex() + " type " + item.getType() + " " +
          "request: [" + ireq + "]: [" + item.getFailureMessage() + "]");
   }
  }
 }
  }

In addition to setting batches with all appropriate conditions in advance API outside , Also browse and respond to each list , To ensure that all data is indexed as expected .

3、 Search timeout error :ConnectionTimeout,ReadTimeoutError,RequestTimeout etc.

If no response is received within the specified search time , The request will fail with an error message . This is called search timeout .

Search timeouts are common , A variety of reasons can cause search timeout , for example : Large data sets or memory intensive queries .

To eliminate search timeouts , It can be solved by the following implementation :

3.1 increase elasticsearch.requestTimeout

Set attention to : belong HTTP The client is not Elasticsearch It is specified in timeout value ,Elasticsearch The client has no request timeout parameter .

kibana Request display timeout , The optimization scheme is as follows :

kibana The default request wait time is 30 second , Can be in kibana.yml Adjust the value in .

elasticsearch.requestTimeout: 90000

3.2 Reduce the number of documents returned per request

Don't put the requested size The value setting is too large , combination :from、size Deep paging mechanism to achieve .

Total ergodicity with the aid of scroll Realization .

3.3 Narrow the time frame

The longer the request time range ( such as Time span period 1   More than years of data ), The larger the amount of data requested , The more likely it is to time out .

3.4 Adjust memory settings

Limit the memory usage of a single query by configuring a memory circuit breaker for a single query .

Such as : take index.breaker.request.limit Limit to 40%, The default is 60%.

Set the request to fuse memory at the cluster level :

PUT /_cluster/settings
{
  "persistent": {
    "indices.breaker.request.limit": "40%"
  }
}

By way of search.max_buckets Set to 5000 ( The default value is :10000) To limit the number of buckets used for aggregation .

PUT _cluster/settings
{
  "transient": {
    "search.max_buckets": 5000
  }
}

3.5 Optimized query 、 Index and slice .

3.6 Enable slow search log

Monitor search run time , Scanning heavy searches and so on .

Slow log opens the actual combat :

PUT /_settings
{
  "index.search.slowlog.threshold.query.debug": "30s",
  "index.search.slowlog.threshold.fetch.debug": "30s",
  "index.indexing.slowlog.threshold.index.debug": "30s"
}

4、 All Shards Failed

stay Elasticsearch When searching , May come across “All Shards Failed” Error messages for .

happen All Shards Failed Several situations of :

  • When a read request cannot get a response from the shard

  • When the data cannot be searched because the cluster or node is still in the initial startup process

  • When fragmentation is lost or in recovery mode and the cluster is red

cause All Shards Failed Probable cause :

  • The node may have been disconnected or reconnected

  • The partition being queried may be recovering , So not available

  • The disk may be damaged

  • Search for query There may be something wrong with the statement . for example , Reference field with wrong field type .

  • Configuration errors may cause the operation to fail .

Practical examples of troubleshooting :

GET /_cat/health
GET /_cat/indices?v
GET _cluster/health/?level=shards
GET _cluster/allocation/explain

5、 Process memory lock failed :“memory locking requested for elasticsearch process but memory is not locked”

To keep nodes healthy , You have to make sure that you don't put JVM Memory out to disk .

The generating system swapping ( In exchange for ) When Elasticsearch The performance of nodes will be very poor , It also affects the stability of the nodes .

So avoid... At all costs swapping .swapping It can lead to Java GC The cycle delay of is reduced from millisecond to minute , What's more, it will cause node response delay or even out of cluster .

Limit elasticsearch Memory usage , Choose to use less swap. and : Enable bootstrap.memory_lock It's one of the three ways to limit exchange .

stay elasticsearch.yml in start-up memory_lock practice :

bootstrap.memory_lock: true

The error is reproduced as follows :

[,260][INFO ][o.e.n.Node               ] [node-1] starting ...
[,529][INFO ][o.e.t.TransportService   ] [node-1] publish_address {172.17.0.5:9300}, bound_addresses {172.17.0.5:9300}
[,537][INFO ][o.e.b.BootstrapChecks    ] [node-1] bound or publishing to a non-loopback address, enforcing bootstrap checks
[,565][ERROR][o.e.b.Bootstrap          ] [node-1] node validation exception
[1] bootstrap checks failed
[1]: memory locking requested for elasticsearch process but memory is not locked
[,575][INFO ][o.e.n.Node               ] [node-1] stopping ...
[,596][INFO ][o.e.n.Node               ] [node-1] stopped
[,597][INFO ][o.e.n.Node               ] [node-1] closing ...
[,615][INFO ][o.e.n.Node               ] [node-1] closed

centos 7.x Solution : stay /etc/security/limits.conf Add the following to the file , And keep , And then restart elasticsearch that will do .

elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited

Verification of best practice start-up success :

GET _nodes?filter_path=**.mlockall

The correct return result is as follows :

{
  "nodes" : {
    "gJUT-E48u_nUw" : {
      "process" : {
        "mlockall" : true
      }
    }
  }
}

6、 Boot check failed Bootstrap Checks Failed

Bootstrap The inspection will be in Elasticsearch Check the various settings and configurations before you start , To ensure that it can operate safely .

If the boot check fails , Then they can stop Elasticsearch start-up ( If you're in production mode ) Or issue a warning log in development mode .

It is recommended that you familiarize yourself with the settings enforced by the guidance check , And notice that they are different in development and production patterns . By setting the system properties

es.enforce.bootstrap.checks Set to true, The boot check can be enforced .

The main inspection contents include but are not limited to :

  • Heap size check

  • File descriptor

  • Maximum number of threads

  • File size limit

  • Maximum virtual memory

  • The maximum number of mappings

  • client jvm Check

  • Garbage collection check

  • OnError and OnOutOfMemoryError Check ......

Best practices : stay jvm.option After adding the following configuration in, restart Elasticsearch.

-Des.enforce.bootstrap.checks=true

7、TransportError

stay Elasticsearch in , The core function of the transmission module is : Communication between nodes in a cluster .

Transmission error Transport errors Often appear , Failure may be caused by the following reasons :

  • Fragmentation lost

  • Setting conflicts

  • Data modeling is unreasonable

  • Network failure

  • .....

common Transport errors Error is as follows :

TransportError(403, u'cluster_block_exception', u'blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];')

Cause analysis :

When there is not enough available disk space for Elasticsearch When allocating between nodes , This could happen .

Solution :

  • Increase disk space

  • Delete old data to free up space

  • Update index read only mode .

Be careful : When disk usage >=95%,index.blocks.read_only_allow_delete Settings are the last resort to prevent nodes from running out of disk space . Writing is no longer allowed , Delete only .

The following command resets the read-only index block on the index :

PUT /_all/_settings
{
  "index.blocks.read_only_allow_delete": null
}

Before allocating all tiles , When you try to use the index you just created , There may be another transmission error .

under these circumstances , An error is as follows :

TransportError(503, u”). 

Transmission errors can also be associated with  Mapping The problem is relevant .

for example , When you try to index a field that has a different data type from its mapping , The possible errors are as follows :

TransportError (400, u’mapper_pasing_exception’) 

8、 initialization / Boot failure Initialization/Startup Failures

occasionally , The fragmentation problem may prevent Elasticsearch start-up .

for example , When using conflicting Elasticsearch version , You may make a mistake as follows :

“ Elasticsearch java client initialization fails” 

or

 “\Common was unexpected at this time.”

Best practices :

Check the version well , Ensure that the jar The package version is consistent with the deployment version .

9、 How to minimize errors and exceptions ? Explore the underlying logic of errors and solutions

If you don't want to process one error message at a time , When you deal with more problems , You'll find that : Many errors and exceptions are related to three deeper problems :

  • Installation and configuration issues

  • Indexing new data

  • The problem of slow down of cluster operation

In depth, the disassembly is as follows :

9.1   Installation and configuration issues

Fast installation Elasticsearch be prone to , But make sure it runs at the production level , You need to check the configuration carefully .

This can help avoid all kinds of errors and anomalies , for example : Boot check failed  bootstrap checks failure problem .

9.2 Indexing new data

stay Elasticsearch in , You have to be very careful about naming fields 、 Use templates correctly template、 Data modeling Standardization .

Check these parameter configurations carefully , Can help you avoid things like : mapping mapping Exception and batch index error ( bulk index errors) And so on .

9.3 The problem of slow cluster speed

As the size of the data grows , The degree of expansion and frequent operations ,Elasticsearch Sometimes there are accidents, resulting in slow retrieval response , And it may pop up in error .

therefore , You must continuously monitor the following indicators of the cluster :

  • With the help of kibana perhaps cerebro Error rate and visualization tools

  • Monitoring error logs

  • Check the indicators of rejection

To nip possible errors in the cradle in advance , And make sure that everything in the cluster is OK .

10、 Conclusion

Elasticsearch Operation and maintenance or development practice will inevitably encounter errors or exceptions .

Although we can't completely avoid , But there are some best practices that can help reduce errors or exceptions , And solve problems more effectively when problems arise .

Fast and effective solutions to complex problems such as slow clusters are inseparable from the following three points :

First of all : Pay close attention to settings and configurations ;

second : Be careful when indexing new data ;

Third : Ensure that the indicators of the cluster can be monitored and visualized .

In short , You should think of errors and exceptions as optimizations  Elasticsearch Opportunities for cluster infrastructure , And don't worry too much about their appearance .


Reference resources :

https://opster.com/blogs/common-elasticsearch-errors-and-exceptions/

Elasticsearch  Official documents  

https://discuss.elastic.co/t/how-to-identify-message-causing-error-in-bulk-request/42885/5

more short time more Learn quickly more More dry !

China near 1/4 Of Elastic Certified engineers come from !

版权声明
本文为[Mingyi world]所创,转载请带上原文链接,感谢

Scroll to Top