编程知识 cdmana.com

List of common commands for dry goods | elasticsearch operation and maintenance

background

The actual problem of feedback from players :

About es Operation and maintenance related , Some problems !

  • The first question is : It's about cluster migration , Current needs in the light of 20 Million data migration , If the file migrates , It takes too long to shut down , Except for refilling , Ignorance Is there a better way for Tao ?

  • The second question is : We es The read and write of the cluster are very frequent , How to control each other without affecting performance , The current situation is that there will be interaction !

  • Third question : Before doing the version upgrade , Some partitions are not available after upgrade , But I don't know what caused ?

  • Last : It's about data expansion , Backup , High availability ...... Expansion is actually   One of the problems is that before you es  mapping How to build , If this isn't planned , It doesn't make much sense to add nodes

  • The other is to face the current cluster status of yellow and red , There is no systematic way to find out where the problem is ?

It's more of a point-to-point solution , The accumulated knowledge is fragmented .

You bet , Similar questions are often asked , It's time to integrate .

1、 Cluster status non green checklist

1.1 The meaning of cluster state

  • Red : At least one primary partition was not allocated successfully ;

  • yellow : At least one replica partition was not allocated successfully ;

  • green : All the Lord & All copies were allocated successfully .

1.2 Check the actual battle

1.2.1 View the cluster status

GET _cluster/health

Return status example :"status" : "red", Red , At least one primary partition was not allocated successfully .

1.2.2 Which node has a red or yellow problem ?

GET _cluster/health?level=indices

In the following ways , It's more straightforward

GET /_cat/indices?v&health=yellow
GET /_cat/indices?v&health=red

Find the corresponding index .

1.2.3 Which partition of the index is red or yellow ?

GET _cluster/health?level=shards

1.2.4 What causes clusters to turn red or yellow ?

GET _cluster/allocation/explain

Return to the core information and interpret examples :

"current_state" : "unassigned",—— Not allocated 
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",—— reason , Index creation phase 
    "at" : "2020-01-29T07:32:39.041Z",
    "last_allocation_status" : "no"
  },
  "explanation" : """node does not match index setting [index.routing.allocation.require] filters [box_type:"hot"]"""
        }

The root cause ,shard Fragmentation and node filtering types are inconsistent Here we are , Found the root cause , So we know the corresponding solution .

1.3 Expand the thinking : similar "current_state" : "unassigned",—— Not allocated What else ?

actual combat :

GET _cat/shards?h=index,shard,prirep,state,unassigned.reason

Official website :https://www.elastic.co/guide/en/elasticsearch/reference/7.2/cat-shards.html

Unallocated status and reasons :

(1)INDEX_CREATED
Unassigned as a result of an API creation of an index.
(2)CLUSTER_RECOVERED
Unassigned as a result of a full cluster recovery.
(3)INDEX_REOPENED
Unassigned as a result of opening a closed index.
(4)DANGLING_INDEX_IMPORTED
Unassigned as a result of importing a dangling index.
(5)NEW_INDEX_RESTORED
Unassigned as a result of restoring into a new index.
(6)EXISTING_INDEX_RESTORED
Unassigned as a result of restoring into a closed index.
(7)REPLICA_ADDED
Unassigned as a result of explicit addition of a replica.
(8)ALLOCATION_FAILED
Unassigned as a result of a failed allocation of the shard.
(9)NODE_LEFT
Unassigned as a result of the node hosting it leaving the cluster.
(10)REROUTE_CANCELLED
Unassigned as a result of explicit cancel reroute command.
(11)REINITIALIZED
When a shard moves from started back to initializing, for example, with shadow replicas.
(12)REALLOCATED_REPLICA
A better replica location is identified and causes the existing replica allocation to be cancelled.

2、 The nodes move in pieces

Applicable scenario : Manually move the distribution partition . Move the started Shard from one node to another .

POST /_cluster/reroute
{
  "commands": [
    {
      "move": {
        "index": "indexname",
        "shard": 1,
        "from_node": "nodename",
        "to_node": "nodename"
      }
    }
  ]
} 

3、 Cluster node graceful offline

Applicable scenario : On the premise that the cluster color is green , Take a node offline gracefully .

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude._ip": "122.5.3.55"
  }
}

4、 Forced to refresh

Applicable scenario : Refreshing the index ensures that all data currently only stored in the transaction log is also permanently stored in Lucene Index .

POST /_flush

Be careful : This sum 7.6 Synchronous refresh before version ( future 8 edition + Will discard synchronous refresh ) Agreement .

POST /_flush/synced

5、 Change the number of concurrent shards to balance the cluster

Applicable scenario :

Control how many concurrent shards are allowed to rebalance within the cluster . The default value is 2.

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.cluster_concurrent_rebalance": 2
  }
}

6、 Change the number of shards restored at the same time for each node

Applicable scenario :

If the node has been disconnected from the cluster , Then all its partitions will become unallocated . After a certain delay , Tiles will be assigned to other locations . The number of concurrent partitions to be recovered by each node is determined by this setting .

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.node_concurrent_recoveries": 6
  }
}

7、 Adjust the recovery rate

Applicable scenario :

To avoid cluster overload ,Elasticsearch Limits the speed allocated to recovery . You can change the settings carefully , To make it recover faster .

If this value is set too high , The ongoing recovery may consume too much bandwidth and other resources , This can make the cluster unstable .

PUT /_cluster/settings
{
  "transient": {
    "indices.recovery.max_bytes_per_sec": "80mb"
  }
}

8、 Clear cache on node

Applicable scenario : If the node reaches a higher JVM value , Can be called at the node level API In order to make Elasticsearch Clean cache .

This reduces performance , But it can get you out of OOM( Out of memory ) Trouble .

POST /_cache/clear

9、 Adjust the circuit breaker

Applicable scenario : In order to avoid Elasticsearch Into OOM, The settings on the circuit breaker can be adjusted . This will limit the search memory , And discard all searches that are estimated to consume more memory than required .

Be careful : This is a very precise setting , You need to calibrate it carefully .

PUT /_cluster/settings
{
  "persistent": {
    "indices.breaker.total.limit": "40%"
  }
}

10、 Cluster migration

Applicable scenario : Cluster data migration 、 Index data migration, etc .

Scheme 1 、 For indexing part or all of the data ,reindex

POST _reindex
{
  "source": {
    "index": "my-index-000001"
  },
  "dest": {
    "index": "my-new-index-000001"
  }
}

Option two : Migrate indexes or clusters with third-party tools

  • elasticdump

  • elasticsearch-migration

The nature of tools :scroll + bulk Realization .

11、 Cluster data backup and recovery

Applicable scenario : High availability business scenarios , Periodic increments 、 Full data backup , In case of emergency .

PUT /_snapshot/my_backup/snapshot_hamlet_index?wait_for_completion=true
{
  "indices": "hamlet_*",
  "ignore_unavailable": true,
  "include_global_state": false,
  "metadata": {
    "taken_by": "mingyi",
    "taken_because": "backup before upgrading"
  }
}

POST /_snapshot/my_backup/snapshot_hamlet_index/_restore

Summary

Several operation and maintenance problems at the beginning of the article have been solved , Other performance related issues , There will be another blog post to sort out .

The operation and maintenance work covers everything , The content of the article is just to throw a brick to attract jade , Make a start .

Powerful cluster operation and maintenance needs to be combined with visualization tools ( Such as :kibana,cerebro,elastic-hd,Prometheus + grafana, Combined with business self-developed tools such as Alibaba cloud Eyou etc. ) Can greatly improve efficiency .

Yours Elasticsearch Operation and maintenance experience 、 Experience 、 experience , Welcome to leave a message , Let's complete the list together .


Reference resources :

Elasticsearch Official documents  

https://logz.io/blog/elasticsearch-cheat-sheet/ 

More recommendations :

Yan Xuan | Elasticsearch List of the most complete and commonly used tools in history

dried food | Elasticsearch Top10 Monitoring indicators

more short time more Learn quickly more More dry !

China near  1/4  Of Elastic Certified engineers come from !

版权声明
本文为[Mingyi world]所创,转载请带上原文链接,感谢

Scroll to Top