Elasticsearch It's been studying for a while , Now we will Elasticsearch Related core knowledge 、 The principle is known from beginners 、 The angle of learning , From the following 9 To sort out in detail . Welcome to discuss ……
0. Go on the road with questions ——ES How did it come about ？
（1） reflection ： How to retrieve large-scale data ？
Such as ： When the amount of data in the system goes up 10 Billion 、100 When it's 100 million , When we do the system architecture, we usually consider the problem from the following angles ：
1） What kind of database to use ？(mysql、sybase、oracle、 Reach a dream 、 Supernatural power 、mongodb、hbase…)
2） How to solve a single point of failure ;(lvs、F5、A10、Zookeep、MQ)
3） How to ensure data security ;( Hot standby 、 Cold standby 、 Different live )
4） How to solve the retrieval problem ;( Database agent middleware ：mysql-proxy、Cobar、MaxScale etc. ;)
5） How to solve the problem of statistical analysis ;( offline 、 Near real time )
（2） Solutions for traditional databases
For relational data , We usually use the following or similar architecture to solve the query bottleneck and write bottleneck ：
Solve the point ：
1） Master slave data security is solved by backup ;
2） Through the database agent middleware heartbeat monitoring , Solve the single point of failure problem ;
3） The query statements are distributed to each through the proxy middleware slave Query by node , And summarize the results
（3） The solution of non relational database
about Nosql database , With mongodb For example , Other principles are similar ：
Solve the point ：
1） Ensure data security through replica backup ;
2） Through the node election mechanism to solve the single point problem ;
3） First retrieve the fragment information from the configuration library , The request is then distributed to each node , Finally, the routing nodes merge and summarize the results
To open or find a new path or snap course —— How about putting all the data into memory ？
We know , It's unreliable to put data completely in memory , It's not really realistic either , When our data reaches PB When level , According to each node 96G Memory computing , When the memory is full of data , What machines do we need ：1PB=1024T=1048576G
Number of nodes =1048576/96=10922 individual
actually , Considering data backup , The number of nodes is usually in the 2.5 Around ten thousand . The huge cost determines its unreality ！
From the previous discussion, we learned that , Put the data in memory , It's better not to put it in memory , Can't solve the problem completely .
All in memory speed problem is solved , But the cost is up .
To solve the above problems , Analysis from the source , We usually look for ways from the following ：
1、 Store data in order ;
2、 Separate data from index ;
3、 compressed data ;
This leads to Elasticsearch.
1. ES All in one
1.1 ES Definition
ES=elaticsearch Abbreviation , Elasticsearch It is an open source and highly extensible distributed full-text retrieval engine , It can store almost in real time 、 Retrieving data ; It's very extensible , It can be extended to hundreds of servers , Handle PB Level of data .
Elasticsearch Also used Java Develop and use Lucene As its core to achieve all index and search functions , But its purpose is through simple RESTful API To hide Lucene Complexity , So that full-text search becomes simple .
1.2 Lucene And ES Relationship ？
1）Lucene It's just a library . Want to use it , You have to use Java As a development language and integrate it directly into your application , What's worse is ,Lucene Very complicated , You need to learn more about retrieval to understand how it works .
2）Elasticsearch Also used Java Develop and use Lucene As its core to achieve all index and search functions , But its purpose is through simple RESTful API To hide Lucene Complexity , So that full-text search becomes simple .
1.3 ES The main solution is ：
1） Retrieve relevant data ;
2） Return Statistics ;
3） Speed up .
1.4 ES working principle
When ElasticSearch After the node of , It will use multicast (multicast)( Or unicast , If the user changes the configuration ) Look for other nodes in the cluster , And connect with it . This process is shown in the figure below ：
1.5 ES The core concept
1）Cluster： colony .
ES Can be used as a separate single search server . however , In order to process large data sets , Achieve fault tolerance and high availability ,ES It can run on many cooperative servers . These clusters are called collections of servers .
2）Node： node .
Each server that forms a cluster is called a node .
3）Shard： Fragmentation .
When there's a lot of documentation , Due to memory limitations 、 Insufficient disk processing capacity 、 Unable to respond to client requests quickly enough , One node may not be enough . In this case , Data can be divided into smaller pieces . Each slice is placed on a different server .
When the index of your query is distributed over multiple slices ,ES The query is sent to each relevant slice , And put the results together , And the application doesn't know the existence of shards . namely ： This process is transparent to users .
4）Replia： copy .
To improve query throughput or achieve high availability , You can use sharded copies .
A copy is a piecemeal exact copy , Each slice can have zero or more copies .ES There can be many of the same shards in , One of them is selected to change the index operation , This special slice is called the main slice .
When the primary slice is lost , Such as ： When the data of the partition is not available , The cluster promotes the replica to the new primary partition .
5） Full text search .
Full text search is to index an article , You can search by keyword , Be similar to mysql Inside like sentence .
Full text index is to divide the content according to the meaning of words , Then create the index separately , for example ” What's your passion for ” May be segmented into ：“ You “,” passion “,“ What thing “,” Come on “ etc. token, So when you search for “ You ” perhaps “ passion ” Will find out this sentence .
1.6 ES The main concepts of data architecture （ And relational databases Mysql contrast ）
（1） Database in relational database （DataBase）, Equivalent to ES Index in （Index）
（2） There's... Under a database N A watch （Table）, Equivalent to 1 An index Index There is N Multiple types （Type）,
（3） A database table （Table） The data below consists of multiple lines （ROW） Multiple columns （column, attribute ） form , Equivalent to 1 individual Type By multiple documents （Document） And many Field form .
（4） In a relational database ,schema Defines the table 、 The fields of each table , There is also the relationship between tables and fields . Corresponding , stay ES in ：Mapping Define the... Under the index Type Field processing rules of , That is, how the index is built 、 Index type 、 Whether to save the original index JSON file 、 Whether to compress the original JSON file 、 Whether word segmentation is needed 、 How to deal with word segmentation .
（5） Add to the database insert、 Delete delete、 Change update、 check search The operation is equivalent to ES In addition to PUT/POST、 Delete Delete、 Change _update、 check GET.
1.7 ELK What is it? ？
elasticsearch： Background distributed storage and full-text retrieval
logstash: Log processing 、“ hamal ”
kibana： Data visualization .
ELK The architecture is data distributed storage 、 Visual query and log parsing create a powerful management chain . The three cooperate with each other , Learn from others' strong points and close the gap , Work together to complete distributed big data processing .
2. ES Features and advantages
1） Distributed real-time file storage , Each field can be indexed , Make it retrievable .
2） Distributed search engine for real-time analysis .
Distributed ： The index is partitioned into multiple partitions , Each fragment can have zero or more copies . Each data node in the cluster can host one or more slices , And coordinate and handle various operations ;
In most cases, rerouting and load balancing are done automatically .
3） It can be extended to hundreds of servers , Handle PB Level of structured or unstructured data . It can also run on a single computer PC On （ Tested ）
4） Support plug-in mechanism , Word segmentation plugin 、 Sync plugin 、Hadoop plug-in unit 、 Visual plug-ins, etc .
3.1 Performance results display
（1） hardware configuration ：
CPU 16 nucleus AuthenticAMD
Memory Total amount ：32GB
Hard disk Total amount ：500GB Not SSD
（2） On the basis of the above hardware indicators, the test performance is as follows ：
1） Average index throughput ： 12307docs/s（ Size per document ：40B/docs）
2） Average CPU Usage rate ： 887.7%（16 nucleus , Average per core ：55.48%）
3） Build index size ： 3.30111 GB
4） Total writes ： 20.2123 GB
5） Total test time ： 28m 54s.
3.2 performance esrally Tools （ recommend ）
The use of reference ：http://blog.csdn.net/laoyang360/article/details/52155481
4、 Why use ES？
4.1 ES Excellent cases at home and abroad
1） 2013 Beginning of the year ,GitHub Abandoned Solr, take ElasticSearch To do it PB Level search . “GitHub Use ElasticSearch Search for 20TB The data of , Include 13 Billion documents and 1300 One hundred million lines of code ”.
2） Wikipedia ： Start with elasticsearch Core search architecture based on .
3）SoundCloud：“SoundCloud Use ElasticSearch by 1.8 Billion users provide instant and accurate music search services ”.
4） Baidu ： Baidu is now widely used ElasticSearch As text data analysis , Collect all kinds of index data and user-defined data on all Baidu servers , Multi dimensional analysis and display of various data , Auxiliary location analysis instance exception or business level exception . At present, Baidu's internal coverage 20 Multiple lines of business （ Include casio、 Cloud analysis 、 Net alliance 、 forecast 、 library 、 Baidu Zhida 、 wallet 、 Risk control, etc ）, Single cluster is the largest 100 Taiwan machine ,200 individual ES node , Import... Every day 30TB+ data .
4.2 We also need
Actual project development , Almost every system has a search function , When the search reaches a certain level , Maintenance and expansion will become more difficult , A lot of companies search independently , use ElasticSearch And so on to achieve .
In recent years, ElasticSearch Rapid development , Has gone beyond its original pure search engine role , Data aggregation analysis has now been added （aggregation） And the characteristics of visualization , If you have millions of documents that need to be located by keywords ,ElasticSearch It must be the best choice . Of course , If your document is JSON Of , You can also ElasticSearch As a kind of “NoSQL database ”, application ElasticSearch Data aggregation analysis （aggregation） Characteristics of , Multi dimensional analysis of data .
【 You know ： Pan Fei, architect of Reku 】ES Replacing tradition in some scenarios DB
Personally think Elasticsearch It's good for internal storage , The efficiency is basically satisfied , Replacing tradition in some ways DB It's OK, too , The premise is that your business does not have special requirements for operational matters ; And the authority management need not be so detailed , because ES This is not perfect .
Because we are right ES The application scenario of is only to aggregate data in a certain period of time , No large number of single document requests （ Such as through userid To find a user's document , Be similar to NoSQL Application scenarios of ）, So can it be replaced NoSQL You need your own tests .
If I had a choice , I'll try to use it ES To replace the traditional NoSQL, Because its scale out mechanism is so convenient .
5. ES What is the application scenario of ？
Usually we have two problems ：
1） Try to use the new system development ES As a storage and retrieval server ;
2） The existing system upgrade needs to support full-text retrieval services , Need to use ES.
The use of the above two architectures , The following links elaborate .
First tier company ES Use scenarios ：
1） Sina ES How to analyze and handle 32 100 million real-time logs http://dockone.io/article/505
2） Ali ES Build your own log collection and analysis system http://afoo.me/columns/tec/logging-platform-spec.html
3） I like it ES Business log processing http://tech.youzan.com/you-zan-tong-ri-zhi-ping-tai-chu-tan/
4）ES Realize on-site search http://www.wtoutiao.com/p/13bkqiZ.html
6. How to deploy ES？
6.1 ES Deploy （ No installation required ）
1） Zero configuration , Open the box
2） No cumbersome installation and configuration
3）java Version for ： The minimum 1.7
I use 1.8
[root@laoyang config_lhy]# echo $JAVA_HOME
4） Download address ：
bin/elasticsearch -d( Background operation )
6.2 ES Necessary plug-ins
Necessary Head、kibana、IK（ Chinese word segmentation ）、graph Detailed installation and use of plug-ins .
6.3 ES windows Next button installation
Self writing bat Script implementation windows Next button installation .
1） A key to install ES And necessary plug-ins （head、kibana、IK、logstash etc. ）
2） Run as a service after installation ES.
3） Save at least than yourself 2 Hour time , Very efficient .
Script description ：
7. ES External interface （ Developer focus ）
1）JAVA API Interface
2）RESTful API Interface
Common increase 、 Delete 、 Change 、 Implementation of search operation ：
8.ES How to deal with problems ？
Reference resources ：
《Elasticsearch Server development 》
《 actual combat Elasticsearch、Logstash、Kibana》
《Elasticsearch In Action》
《 some ES Daniel PPT》
9、 anything else ？
《 screwing Elasticsearch methodology 》： The average programmer is efficient and sophisticated 10 Big trick ！（ Free full version ）
more ES Experience sharing of dry goods in actual combat , Please scan below 【 Mingyi world 】 WeChat official account for two-dimensional code .
（ Update at least one article per week ！）
Be with you , screwing Elasticsearch！
2016-08-18 21:10 Thinking in front of bed at home
author ： Mingyi world
Reprint please indicate the source , Original address ：
If you feel this article is helpful , Please click on ‘ The top ’ support , Your support is my biggest motivation for writing , thank you ！