编程知识 cdmana.com

Prometheus of elasticsearch learning notes monitor elasticsearch core indicators

0x00 summary

Prometheus adopt expoeter monitor Elasticsearch colony , Than traditional APM perhaps zabbix More in-depth monitoring ; coordination exporter But it's been monitored in detail es Cluster information .

This article focuses on usage Prometheus monitor ES, Sort out the core monitoring indicators and build Dashboard , When the cluster is abnormal or the node fails , You can diagnose problems in an efficient way based on performance charts , Then add alarm to the core index screening .

according to 《How to monitor Elasticsearch performance》 The introduction in the article :

Elasticsearch It provides a lot of indicators , It can help us to pre check the fault , And in case of node unavailable 、JVM OutOfMemoryError And garbage collection time is too long, take necessary measures . The key areas that you usually need to monitor are :

  1. Query and index (indexing) performance
  2. Memory allocation and garbage collection
  3. Host level system and network metrics
  4. Cluster health status and node availability
  5. Resource saturation and related errors

carding promethues Officially provided ElasticSearch exporter The core of the offer metrics Monitor these key areas

0x01 Cluster health and node availability

adopt cluster healthAPI You can get the health status of the cluster , The health status of the cluster can be regarded as an important signal for the smooth operation of the cluster , Once the state changes, attention should be paid to it ;API Return some important parameters and corresponding prometheus The monitoring items are as follows

Returns the parameter

remarks

metric name

status

State of the cluster ,green( All primary and replica shards are running normally )、yellow( All the main segments are in normal operation , But not all copy sharding works )red( There are main segments that fail to operate normally )

elasticsearch_cluster_health_status

number_of_nodes/number_of_data_nodes

Number of cluster nodes / Number of data nodes

elasticsearch_cluster_health_number_of_nodes/data_nodes

active_primary_shards

Total number of active primary tiles

elasticsearch_cluster_health_active_primary_shards

active_shards

The total number of active segments ( Including copying and slicing )

elasticsearch_cluster_health_active_shards

relocating_shards

The number of slices that the current node is migrating to other nodes , Usually it is 0, This value will increase when a node in the cluster joins or exits

elasticsearch_cluster_health_relocating_shards

initializing_shards

Initializing fragment

elasticsearch_cluster_health_initializing_shards

unassigned_shards

Number of unallocated tiles , Usually it is 0, This value increases when the replica fragmentation of a node is lost

elasticsearch_cluster_health_unassigned_shards

number_of_pending_tasks

Only the master node can handle cluster level metadata changes ( Create index , Update mapping , Distribution, fragmentation, etc ), adopt pending-tasks API You can view the tasks waiting in the queue , In most cases, the queue for metadata changes remains basically zero

elasticsearch_cluster_health_number_of_pending_tasks

According to the above monitoring items , Configure cluster state Singlestat panel , The state of health is clear at a glance

0x02 Host level system and network metrics

metric name

description

elasticsearch_process_cpu_percent

Percent CPU used by process CPU Usage rate

elasticsearch_filesystem_data_free_bytes

Free space on block device in bytes Disk free space

elasticsearch_process_open_files_count

Open file descriptors ES Process open file descriptor

elasticsearch_transport_rx_packets_total

Count of packets receivedES Network traffic between nodes

elasticsearch_transport_tx_packets_total

Count of packets sentES Network traffic between nodes

If CPU Usage continues to grow , It's usually a load caused by a lot of search or indexing work . More nodes may need to be added to redistribute the load .

File descriptors are used for communication between nodes 、 Client connection and file manipulation . If the open file descriptor reaches the system limit ( commonly Linux Running each process has 1024 File descriptors , It is suggested to increase the production environment 65535), New connections and file operations will not be available , Until old ones are shut down .

If ES Clusters are write loaded , It is recommended to use SSD disc , Focus on disk space usage . When segment Be created 、 When querying and merging ,Elasticsearch Read and write to a large number of disks .

Communication between nodes is one of the key indicators to measure whether the cluster is balanced , The rate of bytes that can be sent and received by , To see how much traffic the cluster's network is receiving .

0x03 JVM Memory and garbage collection

metric name

description

elasticsearch_jvm_gc_collection_seconds_count

Count of JVM GC runs Garbage collection number

elasticsearch_jvm_gc_collection_seconds_sum

GC run time in seconds Garbage collection time

elasticsearch_jvm_memory_committed_bytes

JVM memory currently committed by area Maximum memory limit

elasticsearch_jvm_memory_used_bytes

JVM memory currently used by area Memory usage

Main concern JVM Heap The memory used and JVM GC The proportion of time , Whether the positioning has GC problem .Elasticsearch Rely on garbage collection to free stack memory , Default when JVM Stack usage reaches 75% When you start garbage collection , Adding stack setting alarm can judge whether the current garbage collection speed is faster than the generation speed , If you can't meet the demand , You can adjust the stack size or add nodes .

0x04 Search and index performance

The search request

metric name

description

elasticsearch_indices_search_query_total

query total

elsticsearch_indices_search_query_time_seconds

query Time

elasticsearch_indices_search_fetch_total

fetch total

elasticsearch_indices_search_fetch_time_seconds

fetch Time

Index request

metric name

description

elasticsearch_indices_indexing_index_total

Total index calls Indexes index Count

elasticsearch_indices_indexing_index_time_seconds_total

Cumulative index time in seconds Cumulative index Time

elasticsearch_indices_refresh_total

Total time spent refreshing in second refresh Time

elasticsearch_indices_refresh_time_seconds_total

Total refreshess refresh Count

elasticsearch_indices_flush_total

Total flushes flush Count

elasticsearch_indices_flush_time_seconds

Cumulative flush time in seconds Cumulative flush Time

Draw time and operands on the same graph , On the left y Axis display time , On the right y The axis shows the corresponding operation count ,ops/time Check the average operation time to determine whether the performance is abnormal . Get the average index latency by calculation , If the delay keeps increasing , It could be one-off bulk Too many documents .

Elasticsearch adopt flush Operation to persist data to disk , If flush The delay is increasing , It could be a disk IO Lack of ability , If it continues, it will eventually result in the inability to index the data .

0x05 Resource saturation

metric name

description

elasticsearch_thread_pool_queue_count

Thread Pool operations queued The number of threads queued in the thread pool

elasticsearch_thread_pool_rejected_count

Thread Pool operations rejected The number of rejected threads in the thread pool

elasticsearch_indices_fielddata_memory_size_bytes

Field data cache memory usage in bytes fielddata Cache size

elasticsearch_indices_fielddata_evictions

Evictions from filter cache fielddata The number of evictions of the cache

elasticsearch_indices_filter_cache_memory_size_bytes

Filter cache memory usage in bytes The size of the filter cache

elasticsearch_indices_filter_cache_evictions

Evictions from filter cache The number of evictions of the filter cache

elasticsearch_cluster_health_number_of_pending_tasks

Cluster level changes which have not yet been executed Number of tasks to be processed

elasticsearch_indices_get_missing_total

Total get missing Number of requests for missing files

elasticsearch_indices_get_missing_time_seconds

Total time of get missing in seconds Request time for missing document

Configure the view by collecting the above indicators ,Elasticsearch Nodes use thread pools to manage thread pair memory and CPU Use . This can be done through the request queue and the situation where the request is rejected , To determine whether the nodes are sufficient .

Every Elasticsearch Nodes maintain many types of thread pools . In general , The most important thread pools are search (search), Indexes (index), Merge (merger) And batch processing (bulk).

The size of each thread pool queue represents how many requests are waiting for service in the current node . Once the thread pool reaches the maximum queue size ( Different types of thread pools have different default values ), All subsequent requests will be rejected by the thread pool .

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

版权声明
本文为[Jetpropelledsnake21]所创,转载请带上原文链接,感谢
https://cdmana.com/2020/12/20201225111921907e.html

Scroll to Top