ES Deployment ：
5 node （ To configure ：8 nucleus 64 G 1T）, A total of 320 G,5 T.
about 10+ Indexes ,5 Fragmentation , The daily new data volume is about 2G,4000w strip . Record keeping 30 God .
# filesystem cache
You go es The data in , In fact, it's all written to the disk file , When querying , The operating system will automatically cache the data in the disk file to
filesystem cache Go inside .
es Our search engine is heavily dependent on the underlying
filesystem cache , If you give
filesystem cache More memory , Try to make the memory hold all the
idx segment file Index data files , So when you search, you basically use memory , The performance will be very high .
How big can the performance gap be ？ We used to do a lot of testing and pressure testing , If you walk the disk, it will take seconds , Search performance is definitely second level ,1 second 、5 second 、10 second . But if you go
filesystem cache , It's pure memory , Generally speaking, the performance is one order of magnitude higher than that of disk , It's basically milliseconds , From a few milliseconds to a few hundred milliseconds .
Here's a real case . A company es Nodes have 3 Taiwan machine , Each machine seems to have a lot of memory ,64G, The total memory is
64 * 3 = 192G . For each machine es jvm heap yes
32G , Then leave it to
filesystem cache That's every machine
32G , I'll give it to you in the cluster
filesystem cache is
32 * 3 = 96G Memory . And then , Index data files on the entire disk , stay 3 It's taken up on a machine
1T Disk capacity of ,es The amount of data is
1T , So the amount of data per machine is
300G . Is this a good performance ？
filesystem cache Of memory 100G, One tenth of the data can be stored in memory , Everything else is on disk , Then you perform the search operation , Most operations are on disk , The performance must be poor .
in the final analysis , You have to let es Better performance , In the best case , It's the memory of your machine , It can hold at least half of your total data .
According to our own practical experience in production environment , In the best case , It's just in es There's a small amount of data in , It's you The indexes used to search , If memory is left
filesystem cache Yes. 100G, Then you control the index data to
100G within , In this case , Almost all of your data is searched in memory , Very high performance , Generally, it can be in 1 Within seconds .
For example, you now have a row of data .
id,name,age .... 30 A field . But now you search for , Only according to
id,name,age Three fields to search . If you are stupid es Write all the fields in a row of data , Will lead to saying
90% We don't need to search for the data , As a result, it just occupied es On the machine
filesystem cache Space , The larger the amount of data in a single piece of data , It will lead to
filesystem cahce The less data you can cache . Actually , Just write es For retrieval A few fields That's all right. , For example, write es
id,name,age Three fields , Then you can put other field data in mysql/hbase in , We usually suggest using
es + hbase Such an architecture .
hbase Is characterized by It is suitable for online storage of massive data , That's right hbase Can write massive data , But don't do complex searches , Do some very simple basis id Or range query operation . from es According to the name and age Search for , The result may be 20 individual
doc id , And then according to
doc id To hbase Go inside and look for each
doc id Corresponding Complete data , Find out , Back to the front end .
write in es The data of is better less than or equal to , Or slightly larger than es Of filesystem cache The memory capacity of . Then you go from es Retrieval may cost 20ms, And then based on es Back to id Go to hbase Look in , check 20 Data , It may cost 30ms, Maybe you used to play like that ,1T Put all the data es, Every time I look up 5~10s, Now maybe the performance will be very high , Every query is 50ms.
# Data preheating
If say , Even if you follow the above plan ,es Each machine in the cluster still writes more data than
filesystem cache Double , For example, you write to a machine 60G data , result
filesystem cache Just 30G, Still have 30G Data left on disk .
In fact, we can do Data preheating .
for instance , Take Weibo for example , You can put some big V, People usually watch a lot of data , You make a system in the background ahead of time , Every once in a while , My background system to search for hot data , Brush to
filesystem cache In go to , When later users actually look at this hot data , They just search directly from memory , Soon .
Or e-commerce , You can view some of the most commodities , for instance iphone 8, Hot data ahead of the background to do a program , every other 1 I will visit you once , Brush to
filesystem cache In go to .
For those you think are hot 、 People often access the data , best Make a special cache preheating subsystem , It's about thermal data every once in a while , Just visit in advance , Let the data enter
filesystem cache Go inside . So the next time someone visits , The performance will be much better .
# Separation of heat and cold
es You can do something similar to mysql The horizontal split of , That is to say, there will be a lot of visits and few 、 Low frequency data , Write a separate index , Then write a separate index for the hot data that is frequently accessed . Better be Cold data is written into an index , Then the hot data is written to another index , This ensures that the thermal data is pre heated , Try to keep them in
filesystem os cache in , Don't let the cold data wash away .
You see , Suppose you have 6 Taiwan machine ,2 An index , A cooling data , An exothermic data , Every index 3 individual shard.3 Machine heat release data index, in addition 3 Machine cooling data index. And then in that case , You spend a lot of time accessing hot data index, Thermal data may account for 10%, At this time, the amount of data is very small , Almost all remain in
filesystem cache Inside the , To ensure that the hot data access performance is very high . But for cold data , It's something else index Inside , With the heat data index Not on the same machine , There's no connection between people . If someone accesses cold data , Maybe a lot of data is on disk , At this time, the performance is poor , Just 10% People go to visit cold data ,90% People are accessing hot data , It doesn't matter .
# document Model design
about MySQL, We often have some complex association queries . stay es How to play in ,es Try not to use the complicated association query inside , Once used, the performance is generally not very good .
It's better to be first Java In the system, the connection is completed , Write the related data directly to es in . When searching , You don't need to use es Search syntax to complete join And so on .
document Model design is very important , Many operations , Don't want to perform all kinds of complicated operations when searching .es There are so many operations that can be supported , Don't think about using es Do something that is not easy to operate . If there is really that kind of operation , As far as possible in document When designing the model , It's done when it's written . In addition, for some too complex operations , such as join/nested/parent-child Search to avoid as much as possible , Very poor performance .
# Paging performance optimization
es The pagination of is more pit , Why? ？ Let's give you an example , If you are 10 Data , Now you need to check the 100 page , It's actually going to put each shard Before storage 1000 All data are found on a coordination node , If you have a 5 individual shard, Then there is 5000 Data , Then coordinate the nodes to this 5000 Some merging of data 、 Handle , And get to the final 100 page 10 Data .
A distributed , You need to check 100 page 10 Data , It's impossible to say from 5 individual shard, Every shard Just check 2 Data , Finally, the coordination node is merged into 10 Bar data ？ you must From each shard All check 1000 Here comes the data , Then sort it according to your needs 、 Screening and so on , Finally, page again , Take it inside 100 Pages of data . When you turn the page , The deeper you turn , Every shard The more data is returned , And the longer the coordination node takes to process , I'm very sorry . So use es When doing paging , You'll find that the more you turn back , The slower .
We've had this problem before , use es Make a page break , The first few pages are just tens of milliseconds , Flip to 10 Page or dozens of pages , Basically 5~10 Seconds to find a page of data .
Is there any solution ？
# Deep paging is not allowed （ The default deep paging performance is poor ）
Tell the product manager , Your system doesn't allow you to turn pages that deep , By default, the deeper you turn , The worse the performance .
# Be similar to app The recommended products in are constantly pulled out page by page
It's similar to Weibo , Pull down and swipe Weibo , Brush out page by page , You can use it.
scroll api , On how to use , Search on the Internet by yourself .
scroll Will generate... For you at one time A snapshot of all the data , And then every time you slide back and turn the page, you go through The cursor
scroll_id Move , Get what it looks like on the next page , The performance will be much higher than the paging performance mentioned above , It's basically milliseconds .
however , The only thing is , This is suitable for the kind of micro blog like pull-down page turning , Can't jump to any scene on any page . in other words , You can't go to the 10 page , Then go to the 120 page , And then back to the 58 page , Don't jump at random . So now a lot of products , You are not allowed to turn pages at will ,app, There are also some websites , All you have to do is pull down , Turn page by page .
Initialization must specify
scroll Parameters , tell es How long do you want to save the context of this search . You need to make sure that users don't keep turning pages for hours , Otherwise, it may fail due to timeout .
In addition to using
scroll api , You can also use it
search_after To do it ,
search_after The idea is to use the results from the previous page to help retrieve the data from the next page , obviously , This way, you are not allowed to turn pages at will , You can only turn back one page . On initialization , You need to use a unique value field as sort Field .
1.1、 Design phase tuning
（1） According to the incremental demand of business , Take index creation based on date template , adopt roll over API Scroll index ;
（2） Use aliases for index management ;
（3） Do the index regularly every morning force_merge operation , To free up space ;
（4） Take the cold and hot separation mechanism , Hot data stored in SSD, Improve retrieval efficiency ; Cold data is done on a regular basis shrink operation , To reduce storage ;
（5） take curator Index life cycle management ;
（6） Only for fields that need word segmentation , Set up the word breaker reasonably ;
（7）Mapping The stage fully combines the properties of each field , Need to retrieve 、 Need to store etc .……..
1.2、 Write tuning
（1） The number of copies before writing is set to 0;
（2） Close before writing refresh_interval Set to -1, Disable refresh mechanism ;
（3） While writing ： take bulk Batch write ;
（4） Number of recovery copies after write and refresh interval ;
（5） Try to use auto generated id.
1.3、 Query tuning
（1） Ban wildcard;
（2） Disable batch terms（ Hundreds of scenes ）;
（3） Make full use of inverted index mechanism , can keyword Type as much as possible keyword;
（4） When there's a lot of data , The index can be determined based on time before retrieval ;
（5） Set up a reasonable routing mechanism .
1.4、 Other tuning
Deployment tuning , Business promotion, etc .
Part of the above mentioned , The interviewer will basically evaluate your previous practice or operation and maintenance experience .
# working principle
# es The process of writing data
The client selects one node Send request past , This node Namely
coordinating node（ Coordinate nodes ）.
coordinating nodeYes document Conduct route , Forward the request to the corresponding node（ Yes primary shard）.
Actually node Upper
primary shardProcessing requests , Then synchronize the data to
coordinating nodeIf you find that
primary nodeAnd all
replica nodeWhen it's all done , Return the response result to the client .
# es The process of reading data
doc id To query , Will be based on
doc id Conduct hash, Judge out that at that time
doc id To which shard The above to , From that shard Go to query .
Client sends request to arbitrarily One node, Become
doc idHash route , Forward the request to the corresponding node, The
round-robinRandom polling algorithm , stay
primary shardAnd all of it replica Choose one randomly , Load balance read requests .
To receive a request node return document to
coordinate nodereturn document To the client .
# es Search data process
es The most powerful is to do full-text search , For example, you have three pieces of data ：
java Key words to search , Will include
document Search it out .es I'll get back to you ：java It's fun ,java It's hard to learn .
The client sends the request to a
The coordinator forwards the search request to all Of shard Corresponding
replica shard, Fine .
query phase： Every shard Put your own search results （ In fact, it's just some
doc id） Back to coordination node , Data consolidation by coordination nodes 、 Sort 、 Paging and other operations , Produce the final result .
fetch phase： Then the coordination node according to
doc idGo to each node Pull the actual Of
documentdata , Finally back to the client .
Write request is to write primary shard, Then sync it to all replica shard; Read requests can be made from primary shard or replica shard Read , It uses random polling algorithm .
# Write the underlying principle of data
Write to memory first buffer, stay buffer The data can't be searched in ; At the same time write the data to translog Log files .
If buffer It's almost full , Or at a certain time , It will put the memory buffer data
refresh To a new one
segment file in , But at this time, the data does not directly enter
segment file Disk files , But to enter first
os cache . This process is
every other 1 Second ,es take buffer Write the data in the new
segment file , There's one per second New disk file
segment file , This
segment file The most recent 1 Seconds buffer Data written in .
But if buffer There is no data in it at this time , Of course not refresh operation , If buffer There's data , Default 1 Once per second refresh operation , Brush in a new segment file in .
In the operating system , There is one thing about disk files , be called
os cache , The operating system cache , That is, before the data is written to the disk file , Will enter first
os cache , First into the operating system level of a memory cache to . as long as
buffer The data in is refresh Operate brush in
os cache in , This data can be searched .
Why call es yes Quasi real-time Of ？
NRT , Full name
near real-time . The default is every time 1 second refresh One time , therefore es It's quasi real time , Because of the data written 1 Seconds before it can be seen . Can pass es Of
restful api perhaps
java api , Manual Do it once refresh operation , It's just manually turning buffer Swipe the data in
os cache in , Let the data be searched immediately . As long as the data is input
os cache in ,buffer It will be emptied , Because there's no need to keep buffer 了 , The data is in translog It has been persisted to a disk .
Repeat the above steps , New data keeps coming in buffer and translog, Keep going
buffer Data is written one after another
segment file In the middle , Every time
refresh End buffer Empty ,translog Retain . As the process goes on ,translog It's going to get bigger and bigger . When translog When it reaches a certain length , It will trigger
commit operation .
commit The first step of the operation , Will be buffer Data available in
os cache In the middle , Empty buffer. then , Will a
commit point Write to disk file , It's marked with this
commit point All corresponding
segment file , At the same time, force
os cache All the current data in
fsync Go to the disk file . Last Empty existing translog Log files , Restart one translog, here commit Operation is completed .
This commit The operation is called
flush . Default 30 Minutes automatically
flush , But if translog Too big , It will also trigger
flush .flush Operation corresponds to commit The whole process , We can go through es api, Do it manually flush operation , Manual will os cache Data in fsync Force brush to disk .
translog What is the purpose of the log file ？ You execute commit Before the operation , The data is either stuck in buffer in , Or stay in os cache in , Whether it's buffer still os cache It's all memory , Once the machine dies , All the data in memory is lost . So you need to write the data corresponding operation to a special log file
translog in , Once the machine goes down , When it restarts again ,es Will automatically read translog Data in the log file , Restore to memory buffer and os cache In the middle .
translog In fact, it is written first os cache Of , Default every 5 One second to the disk , So by default , There may be 5 Seconds of data will only stay in buffer perhaps translog Of documents os cache in , If the machine hangs up at this time , Meeting The loss of 5 Seconds of data . But the performance is better , Lose at most 5 Second data . Can also be translog Set to direct every write operation
fsync To disk , But the performance will be much worse .
Actually you're here , If the interviewer doesn't ask you es The problem of losing data , You can dazzle the interviewer here , You say? , Actually es The first is quasi real time , Data writing 1 You can search for ; You may lose data . Yes 5 Second data , Stay at buffer、translog os cache、segment file os cache in , Not on disk , If the machine goes down at this time , It can lead to 5 Of a second Data loss .
To sum up , Data is written to memory first buffer, Then every 1s, Put the data refresh To os cache, here we are os cache Data can be searched （ That's why we said es From writing to being searchable , There is 1s Delay of ）. every other 5s, Write data to translog file （ So if the machine goes down , No memory data , There will be at most 5s Data loss for ）,translog To a certain extent , Or default every 30mins, Will trigger commit operation , Put all the data in the buffer flush To segment file In the disk file .
Data writing segment file after , At the same time, the inverted index is established .
# Delete / Update the underlying principle of data
If it's a delete operation ,commit It will generate a
.del file , There will be some doc The label is
deleted state , Then search according to
.del The document knows this doc Is it deleted .
If it's an update operation , It's about putting the original doc The label is
deleted state , And then write a new piece of data .
buffer Every time refresh once , There will be a
segment file , So the default is 1 Seconds, one.
segment file , So down
segment file More and more , At this time, it will be carried out regularly merge. Every time merge When , Will be more than one
segment file Merge into one , At the same time, it will be marked as
deleted Of doc to Physical delete , Then the new
segment file Write to disk , There will be a
commit point , Mark all new
segment file , Then open the
segment file For search , Also delete old
segment file .
# Bottom lucene
Simply speaking ,lucene It's just one. jar package , It contains all kinds of encapsulated algorithm codes for building inverted index . We use it Java When developing , introduce lucene jar, Then based on lucene Of api Just go and develop .
adopt lucene, We can index the existing data ,lucene It will be on the local disk , Give us the data structure of the organization index .
# Inverted index
In search engines , Each document has a corresponding document ID, Document content is represented as a set of keywords . for example , file 1 After participle , Extracted 20 Key words , Each keyword records how many times it appears in the document and where it appears .
that , The inverted index is Keywords to documents ID Mapping , Each keyword corresponds to a series of documents , The key words appear in all these documents .
Take a chestnut .
There are the following documents ：
|1||The father of Google Maps job hopping Facebook|
|2||The father of Google Maps joined Facebook|
|3||Google Maps founder Lars left Google to join Facebook|
|4||The father of Google Maps job hopping Facebook And Wave Project cancellation is about|
|5||Lars, the father of Google maps, joins social networking sites Facebook|
After segmenting the document , Get the following Inverted index .
|1||1, 2, 3, 4, 5|
|2||Map||1, 2, 3, 4, 5|
|3||The father of||1, 2, 4, 5|
|5||1, 2, 3, 4, 5|
|6||To join in||2, 3, 5|
in addition , The practical inverted index can also record more information , For example, document frequency information , Indicates how many documents in the document collection contain a word .
that , With inverted index , Search engine can easily respond to user's query . For example, user input query
Pay attention to two important details of the inverted index ：
All terms in the inverted index correspond to one or more documents ;
Invert the words in the index In ascending order according to dictionary order
It's just a simple chestnut , Not strictly in ascending dictionary order .
# elasticsearch What is the inverted index of
interviewer ： Want to understand your understanding of basic concepts .
answer ： A popular explanation will do .
Our traditional retrieval is through articles , One by one to find the location of the corresponding keywords .
And inverted index , It's through word segmentation , It forms the mapping table of words and articles , This kind of Dictionary + The mapping table is the inverted index . With inverted index , Can achieve o（1） Time complexity efficiency of retrieving articles , Greatly improve the efficiency of retrieval .
The academic solution ：
Inverted index , On the contrary, what words does an article contain , It starts with words , It records the documents in which the word appeared , It's made up of two parts —— Dictionaries and inverted tables .
pluses ： The underlying implementation of inverted index is based on ：FST（Finite State Transducer） data structure .
lucene from 4+ The data structures that have been widely used since the release are FST.FST There are two advantages ：
（1） Small space occupation . Through the reuse of prefixes and suffixes in dictionaries , Compressed storage space ;
（2） Fast query speed .O(len(str)) The query time complexity of .
# 3、elasticsearch What to do if there is too much index data , How to tune , Deploy
interviewer ： Want to understand the operation and maintenance capacity of large data volume .
answer ： Planning of index data , We should make a plan in the early stage , As the saying goes “ Design first , Code after ”, In this way, we can effectively avoid the impact on online customer retrieval or other businesses caused by the sudden data explosion resulting in the lack of cluster processing power .
How to tune , Just like the question 1 said , Let's zoom in here ：
3.1 Dynamic index level
Based on the template + Time +rollover api Scroll to create index , give an example ： Design phase definition ：blog The template format of the index is ：blogindex The form of timestamps , Increasing data every day . The benefits of doing this ： It is not necessary to increase the data volume so that the data volume of a single index is very large , Close to online 2 Of 32 The next power -1, Index storage has reached TB+ Even larger .
Once a single index is large , Storage and other risks come with it , So think ahead + Avoid... Early .
3.2 Storage level
Separate storage of hot and cold data , Thermal data （ Like recently 3 Days or weeks of data ）, The rest is cold data .
No new data will be written for cold data , Consider regular force_merge Add shrink The compact operation , Save storage space and retrieval efficiency .
3.3 Deployment level
Once there's no plan , This is the emergency strategy .
combination ES Its own features of supporting dynamic expansion , Dynamic addition of machines can relieve the cluster pressure , Be careful ： If the previous master node planning is reasonable , Dynamic addition can be completed without restarting the cluster .
# 4、elasticsearch How to achieve master Elected
interviewer ： Want to know ES The underlying principle of clustering , No longer focusing on the business side .
The premises ：
（1） Only candidate primary nodes （master：true） The node of can become the master node .
（2） Minimum number of master nodes （min_master_nodes） The aim is to prevent brain crack .
Check the code , The core entrance is findMaster, Select the master node to return the corresponding Master, Otherwise return to null. The election process is roughly described as follows ：
First step ： Confirm that the number of candidate main nodes is up to the standard ,elasticsearch.yml Set the value of the
The second step ： Compare ： First, determine whether there is master Qualifications , Priority return with candidate master node qualification ;
If both nodes are candidate primary nodes , be id A small value will be the primary node . Notice the id by string type .
Digression ： Access to the node id Methods .
# Describe in detail Elasticsearch The process of Indexing Documents
interviewer ： Want to know ES The underlying principle of , No longer focusing on the business side .
The index document here should be understood as document writing ES, The process of creating an index .
Document write contains ： Single document writing and batch bulk write in , Here's just an explanation ： Single document writing process .
Remember this picture in the official document .
First step ： The client writes data to a node of the cluster , Send a request .（ If no route is specified / Coordinate nodes , The requested node acts as the routing node .）
The second step ： node 1 After receiving the request , Using document _id To make sure that the document belongs to fragment 0. The request will be transferred to another node , Suppose the node 3. So slice 0 The main partition of is assigned to the node 3 On .
The third step ： node 3 Perform write operations on the main shard , If it works , Then forward the request to the node in parallel 1 And nodes 2 Copy on shard , Wait for the result to return . All copies are reported as successful , node 3 Will be directed to the coordination node （ node 1） Report success , node 1 Report write success to requesting client .
If the interviewer asks again ： In the second step, the process of document segmentation ？
answer ： Get... By routing algorithm , Routing algorithm is based on Routing and documents id Calculate the slice of the target id The process of .
# Describe in detail Elasticsearch The search process ？
interviewer ： Want to know ES The underlying principle of search , No longer focusing on the business side .
The search is broken down into “query then fetch” Two phases .
query The purpose of the phase ： Positioning to position , But not .
The steps are as follows ：
（1） Suppose an index data has 5 Lord +1 copy common 10 Fragmentation , A request will hit （ In the main or copy shards ） One of the .
（2） Each segment is queried locally , The result is returned to the local ordered priority queue .
（3） The first 2） The result of the step is sent to the coordination node , The coordination node generates a global sort list .
fetch The purpose of the phase ： Take the data .
The routing node gets all the documents , Return to the client .
# Elasticsearch At deployment time , Yes Linux What are the optimization methods for the setting of
interviewer ： Want to know right ES The operation and maintenance capacity of the cluster .
（1） Turn off caching swap;
（2） Heap memory is set to ：Min（ Node memory /2, 32GB）;
（3） Set the maximum number of file handles ;
（4） Thread pool + The queue size is adjusted according to the business needs ;
（5） Disk storage raid The way —— Storage conditional use RAID10, Increase single node performance and avoid single node storage failure .
# lucence What is the internal structure ？
interviewer ： Want to know the breadth and depth of your knowledge .
Lucene There are two processes of indexing and searching , Include index creation , Indexes , Search for three main points . It can be based on this context .
# Elasticsearch How to achieve Master Elected ？
（1）Elasticsearch The main candidate of is ZenDiscovery The module is responsible for , It mainly includes Ping（ Between nodes through this RPC To find each other ） and Unicast（ The unicast module contains a list of hosts to control which nodes need ping through ） These two parts ;
（2） For all that can be master The node of （node.master: true） according to nodeId Dictionary sort , Every time we elect each node, we rank all the nodes we know , Then choose the first （ The first 0 position ） node , Think of it as master node .
（3） If the number of votes for a node reaches a certain value （ Can be a master Number of nodes n/2+1） And the node itself elects itself , So this node is master. Otherwise re-election will continue until the above conditions are met .
（4） Add ：master The responsibilities of nodes mainly include clustering 、 Node and index management , Not responsible for document level management ;data Nodes can be turned off http function *.
# 10、Elasticsearch The nodes in the （ For example 20 individual ）, Among them 10 individual
Chose one master, in addition 10 One chose the other master, What do I do ？
（1） When clustering master The number of candidates is not less than 3 Time , You can set the minimum number of votes to pass （discovery.zen.minimum_master_nodes） More than half of all candidate nodes are used to solve the brain crack problem ;
（3） When the number of candidates is two , It can only be modified to be the only one master The candidate , Other things data node , Avoid brain crack problems .
# When the client is connected to the cluster , How to select a specific node to execute the requested ？
TransportClient utilize transport The module is remotely connected to a elasticsearch colony . It doesn't join the cluster , Just get one or more initializations transport Address , And polling To communicate with these addresses .
# Describe in detail Elasticsearch The process of Indexing Documents .
The coordination node uses the document by default ID Participate in calculation （ Also support passing routing）, In order to provide appropriate fragmentation for routing .
（1） When the node where the partition is located receives the request from the coordination node , Will write the request to MemoryBuffer, Then timing （ The default is every time 1 second ） Write to Filesystem Cache, This from MomeryBuffer To Filesystem Cache The process is called refresh;
（2） Of course, in some cases , There is Momery Buffer and Filesystem Cache Data may be lost ,ES It's through translog To ensure the reliability of data . The implementation mechanism is to receive the request , It will also be written to translog in , When Filesystem cache When data in is written to disk , To get rid of , This process is called flush;
（3） stay flush In the process , The buffer in memory will be cleared , The content is written into a new segment , Part of the fsync A new submission point will be created , And refresh the contents to disk , old translog Will be deleted and a new translog.
（4）flush The timing of the trigger is timing （ Default 30 minute ） perhaps translog Get too big （ The default is 512M） when ;
Add ： About Lucene Of Segement：
（1）Lucene The index is made up of multiple segments , The segment itself is a fully functional inverted index .
（2） Segments are immutable , allow Lucene Incrementally add new documents to the index , Instead of re indexing from scratch .
（3） For every search request , All segments in the index are searched , And each segment consumes CPU The clock week of 、 File handle and memory . This means that the more segments there are , The lower the search performance .
（4） To solve this problem ,Elasticsearch Will merge small segments into a larger segment , Commit new merge segments to disk , And delete those old paragraphs .
# Describe in detail Elasticsearch The process of updating and deleting documents .
（1） Delete and update are also write operations , however Elasticsearch The documents in are immutable , So it can't be deleted or altered to show its changes ;
（2） Each segment on the disk has a corresponding .del file . When the delete request is sent , The document was not really deleted , But in .del The file is marked for deletion . The document still matches the query , But it will be filtered out in the results . When segments merge , stay .del Documents marked for deletion in the file will not be written to new segments .
（3） When a new document is created ,Elasticsearch A version number will be specified for the document , When an update is performed , The old version of the document is in .del The file is marked for deletion , The new version of the document is indexed to a new segment . Old versions of documents still match queries , But it will be filtered out in the results .
# Describe in detail Elasticsearch The search process .
（1） Search is carried out as a two-stage process , We call it Query Then Fetch;
（2） In the initial query phase , The query will be broadcast to every fragment copy in the index （ Main slice or copy slice ）. Each fragment performs a search locally and builds a matching document of the size of from + size Priority queue for .
PS： When searching, it will query Filesystem Cache Of , But some of the data is still MemoryBuffer, So search is near real time .
（3） Each slice returns to its own priority queue Of all documents ID And sort values Give coordination nodes , It merges these values into its own priority queue to produce a globally sorted list of results .
（4） The next step is Retrieval stage , The coordination node identifies which documents need to be retrieved and submits multiple to the relevant Shards GET request . Each slice is loaded and Feng rich file , If necessary , Then return the document to the coordination node . Once all the documents have been retrieved , The coordination node returns the result to the client .
（5） Add ：Query Then Fetch When scoring document relevance, the search type of refers to the data of this segment , This may not be accurate enough when the number of documents is small ,DFS Query Then Fetch Added a pre query processing , inquiry Term and Document frequency, This score is more accurate , But the performance will get worse .*
# stay Elasticsearch in , How to find the corresponding inverted index according to a word ？
（1）Lucene The indexing process , According to the basic process of full-text retrieval , The process of writing an inverted table into this file format .
（2）Lucene Search process for , It is to read out the information indexed according to this file format , Then calculate the score of each document (score) The process of .
# Elasticsearch At deployment time , Yes Linux What are the optimization methods for the setting of ？
（1）64 GB Memory machines are ideal , however 32 GB and 16 GB Machines are also very common . Less than 8 GB Will backfire .
（2） If you want to be in the faster CPUs And more core choices , Choosing more cores is better . The extra concurrency provided by multiple cores is far better than a slightly faster clock rate .
（3） If you can afford SSD, It will go far beyond any rotating medium . be based on SSD The node of , Query and index performance are improved . If you can afford ,SSD Is a good choice .
（4） Even if the data centers are close , Also avoid clustering across multiple data centers . It is absolutely necessary to avoid clusters spanning large geographical distances .
（5） Please make sure to run your application's JVM And the server JVM It's exactly the same . stay Elasticsearch A few places , Use Java Local serialization of .
（6） By setting gateway.recover_after_nodes、gateway.expected_nodes、gateway.recover_after_time It can avoid too many partition exchanges when the cluster is restarted , This can reduce data recovery from hours to seconds .
（7）Elasticsearch By default, it is configured to use unicast discovery , To prevent nodes from inadvertently joining the cluster . Only nodes running on the same machine are automatically clustered . It's better to use unicast instead of multicast .
（8） Don't modify the garbage collector at will （CMS） And the size of each thread pool .
（9） Put your memory in （ Less than ） Half to Lucene（ But don't exceed 32 GB！）, adopt ES_HEAP_SIZE Environment variable Settings .
（10） Swapping memory to disk is fatal to server performance . If memory is swapped to disk , One 100 Microsecond operations can become 10 millisecond . Think about it again 10 Microseconds of operation delay are accumulated . It's not hard to see. swapping How terrible for performance .
（11）Lucene Used big The amount The file of . meanwhile ,Elasticsearch At nodes and HTTP A large number of sockets are also used for communication between clients . All of this requires enough file descriptors . You should add your file descriptor , Set a big value , Such as 64,000.
Add ： Index phase performance improvement method
（1） Use bulk requests and resize them ： Each batch of data 5–15 MB Big is a good starting point .
（2） Storage ： Use SSD
（3） Segment and merge ：Elasticsearch The default value is 20 MB/s, It should be a good setting for mechanical disks . If you're using a SSD, It can be considered to improve to 100–200 MB/s. If you are doing bulk import , Don't care about search at all , You can turn off the merge restriction completely . In addition, you can add index.translog.flush_threshold_size Set up , From the default 512 MB To a greater value , such as 1 GB, This can accumulate larger segments in the transaction log when a clear trigger is triggered .
（4） If your search results don't need near real-time accuracy , Consider putting the index.refresh_interval Change to 30s.
（5） If you are doing mass import , Consider setting up index.number_of_replicas: 0 Close copy .
# about GC aspect , In the use of Elasticsearch What should we pay attention to ？
（1） The index of inverted dictionary needs to be memory resident , unable GC, Need to monitor data node On segmentmemory Growth trend .
（2） All kinds of cache ,field cache, filter cache, indexing cache, bulk queue wait , Set a reasonable size , And in the worst case scenario heap Is it enough for , That is, when all kinds of caches are full , also heap Can space be allocated to other tasks ？ Avoid using clear cache etc. “ Deceive oneself and others ” To free up memory .
（3） Avoid searches and aggregations that return a large number of result sets . Scenarios that really need a lot of pull data , May adopt scan & scroll api To achieve .
（4）cluster stats Resident memory does not scale horizontally , Large scale clusters can be divided into multiple clusters through tribe node Connect .
（5） Want to know heap Is it enough , It must be combined with the actual application scenarios , And for the clustered heap Continuous monitoring of usage .
（6） Understand memory requirements based on monitoring data , Reasonable allocation of all kinds of circuit breaker, Minimize the risk of memory overflow
# 18、Elasticsearch For large data volume （ Hundreds of millions ） How to realize the aggregation of ？
Elasticsearch The first approximate aggregation provided is cardinality Measure . It provides the cardinality of a field , That is, the distinct perhaps unique The number of values . It is based on HLL Algorithm .HLL We will hash our input first , Then according to the result of hash operation bits Make a probability estimate to get the base number . Its characteristics are ： Configurable precision , Used to control the use of memory （ More precise ＝ More memory ）; Small data sets are very accurate ; We can configure parameters , To set the fixed memory usage required for de duplication . Whether thousands or billions of unique value , Memory usage only depends on the accuracy of your configuration .
# 19、 In the case of concurrency ,Elasticsearch If the reading and writing are consistent ？
（1） Optimistic concurrency control can be used by version number , To ensure that the new version will not be overwritten by the old version , Specific conflicts are handled by the application layer ;
（2） Also for write operations , Consistency level support quorum/one/all, The default is quorum, That is, write operations are only allowed when most shards are available . But even though most are available , There may also be failure to write to the replica due to network and other reasons , In this way the copy is considered to be faulty , Shards will be reconstructed on a different node .
（3） For read operations , You can set replication by sync( Default ), This enables the operation to return only after the primary partition and the replica partition are completed ; If you set replication by async when , You can also set search request parameters _preference by primary To find out the main segment , Make sure the document is the latest version .
# 20、 How to monitor Elasticsearch State of the cluster ？
Marvel So that you can easily pass Kibana monitor Elasticsearch. You can view your cluster health and performance in real time , You can also analyze past clusters 、 Index and node metrics .
# 21、 Introduce the overall technical framework of your e-commerce search .
# Introduce your personalized search solution ？
be based on word2vec and Elasticsearch Personalized search
（1） be based on word2vec、Elasticsearch And custom script plug-ins , We have implemented a personalized search service , Compared with the original implementation , The new version's click through rate and conversion rate have been greatly improved ;
（2） be based on word2vec There's another thing that we can do with our product vector , It can be used to recommend similar products ;
（3） Use word2vec To achieve personalized search or personalized recommendation has certain limitations , Because it can only handle time series data such as user click history , And can't think about user preferences in a comprehensive way , There is still a lot of room for improvement ;
# Do you know the dictionary tree ？
The common dictionary data structure is as follows ：
Trie The core idea is space for time , The common prefix of string is used to reduce the cost of query time to improve the efficiency . It has 3 A basic property ：
1） The root node contains no characters , Each node contains only one character except the root node .
2） From the root to a node , The characters passing through the path are concatenated , Is the string corresponding to the node .
3） All children of each node contain different characters .
（1） You can see ,trie The number of nodes in each layer of the tree is 26^i Grade . So to save space , We can also use dynamic lists , Or use arrays to simulate dynamics . And the cost of space , Not more than words × Word length .
（2） Realization ： Open a letter set size array for each node , Each node has a linked list , Use the left son right brother notation to record this tree ;
（3） For Chinese dictionary tree , The child nodes of each node use a hash table to store , In this way, you don't have to waste too much space , And the query speed can keep the complexity of hash O(1).
# How is spelling correction implemented ？
（1） Spelling correction is based on editing distance ; Editing distance is a standard way , It's used to show through insertion 、 The minimum number of steps for the delete and replace operation to convert from one string to another ;
（2） Edit the distance calculation process ： For example, to calculate batyu and beauty Editor's distance , So let's create one 7×8 Table of （batyu The length is 5,coffee The length is 6, Gejia 2）, next , Fill in the black numbers in the following places . The calculation process of other lattices is to take the minimum of the following three values ：
If the uppermost character is equal to the leftmost character , It's the number on the top left . Otherwise it's the number on the top left +1.（ about 3,3 Say for 0）
The number on the left +1（ about 3,3 Ge said it was 2）
Number above +1（ about 3,3 Ge said it was 2）
Finally, the value in the lower right corner is the value of edit distance 3.
For spelling correction , Let's consider constructing a metric space （Metric Space）, Any relationship in this space meets the following three basic conditions ：
d(x,y) = 0 -- If x And y The distance to 0, be x=y
d(x,y) = d(y,x) -- x To y The distance of is equal to y To x Distance of
d(x,y) + d(y,z) >= d(x,z) -- Trigonometric inequality
（1） According to the trigonometric inequality , Then satisfaction and query The distance is n Another character in the range is converted to B, And A The maximum distance is d+n, The minimum is d-n.
（2）BK The process of tree construction is as follows ： Each node has any child nodes , Each edge has a value for the edit distance . All child nodes are labeled on the edge of the parent node n The edit distance is exactly n. such as , We have a tree whose parent node is ”book” And two child nodes ”cake” and ”books”,”book” To ”books” The border of 1,”book” To ”cake” On the edge of 4. After building the tree from the dictionary , Whenever you want to insert a new word , Calculate the edit distance between the word and the root node , And the search value is d(neweord, root) The edge of . Recursively compared with each child node , Until there are no children , You can create new child nodes and save the new words there . such as , Insert ”boo” Go to the tree in the above example , Let's check the root node first , lookup d(“book”, “boo”) = 1 The edge of , Then check that the label is 1 The child nodes of the edge of , Get the word ”books”. Let's calculate the distance d(“books”, “boo”)=2, Insert the new words in ”books” after , The border sign is 2.
（3） Query similar words as follows ： Calculate the edit distance between the word and the root node d, Then recursively look up each child node labeled d-n To d+n（ contain ） The edge of . If the distance between the checked node and the search word d Less than n, Then return the node and continue to query . Such as input cape And the maximum tolerance distance is 1, First, calculate the edit distance from the root d(“book”, “cape”)=4, Then find the edit distance between the root node and 3 To 5 Of , That's what I found cake This node , Calculation d(“cake”, “cape”)=1, Meet the conditions so return cake, Then find and cake The node edit distance is 0 To 2 Of , Find... Separately cape and cart node , So you get cape The result of this satisfaction .
If you like this article , Welcome to the top right corner , Share the article with friends ~~
Copyright notice ： Content source network , The copyright belongs to the creator . Unless you can't confirm , We all mark the author and the source , Please let me know if there is any infringement , We will delete it immediately and apologize . thank you !
IT Technology sharing community
Personal blog site ：https://programmerblog.xyz
Article recommendation Programmer efficiency ： Common tools for drawing flowcharts Programmer efficiency ： Organize common online note taking software Telecommuting ： Commonly used remote assistance software , Do you know ？51 MCU download program 、ISP And basic knowledge of serial port Hardware ： Circuit breaker 、 Contactor 、 Basic knowledge of relay