编程知识 cdmana.com

MySQL's join function is weak?

Hello everyone , I'm Li Xiaobing , Today we will learn and make complaints about it MySQL Of Join function .

About MySQL Of join, You must have known a lot about it “ Anecdotes and anecdotes ”, Like two watches join We need small tables to drive big ones , Alibaba developer specifications prohibit more than three tables join operation ,MySQL Of join It's weak, it's exploding, and so on . These norms or statements are true or false , The time is right and the time is wrong , It's up to you to do it yourself join You can understand clearly only when you have a deep understanding .

below , Let's have a comprehensive understanding of MySQL Of join operation .

Text

In daily database queries , We often need to join multiple tables to obtain the merged data of multiple tables at one time , This is about to use the database join grammar .join It is very common in the data field to merge two data sets , If you know more , Will find MySQL,Oracle,PostgreSQL and Spark All support this operation . The protagonist of this article is MySQL, If not specified below , That is to say MySQL Of join As a subject . and Oracle ,PostgreSQL and Spark It can be regarded as the big hanging boss, the join The optimization and implementation of the algorithm are better than MySQL.

MySQL Of join There are many rules , Maybe a little careless , Maybe a bad one join Statement will not only result in a full table query for a table , also May affect the cache of the database , As a result, most of the hot data has been replaced , Drag down the entire database performance .

therefore , The industry is targeting MySQL Of join Summed up a lot of norms or principles , For example, small tables drive large tables and prohibit more than three tables join operation . Next we will introduce MySQL join The algorithm of , and Oracle and Spark Of join To achieve contrast , And interspersed with the answer why the above-mentioned norms or principles are formed .

about join Implementation of operations , There are about Nested Loop Join ( Loop Nested connections ),Hash Join( Hash connection ) and Sort Merge Join( Sort merge join ) Three common algorithms , They have their own advantages and disadvantages and applicable conditions , Next, we will introduce .

MySQL Medium Nested Loop Join Realization

Nested Loop Join It's a scan driver table , Every time you read a record , According to join The index on the associated field of is driven to query the corresponding data in the driven table . It is suitable for scenarios with smaller subsets of connected data , So is it MySQL join The only algorithm to implement , We'll talk about it in detail next .

MySQL There are two Nested Loop Join Variations of the algorithm , Namely Index Nested-Loop Join and Block Nested-Loop Join.

Index Nested-Loop Join Algorithm

below , Let's initialize the relevant table structure and data

CREATE TABLE `t1` (
  `id` int(11) NOT NULL,
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `a` (`a`)
) ENGINE=InnoDB;

delimiter ;;
#  Define stored procedures to initialize t1
create procedure init_data()
begin
  declare i int;
  set i=1;
  while(i<=10000)do
    insert into t1 values(i, i, i);
    set i=i+1;
  end while;
end;;
delimiter ;
#  Call the store to initialize t1
call init_data();
#  Create and initialize t2
create table t2 like t1;
insert into t2 (select * from t1 where id<=500)

According to the above order , Both tables have a primary key index id And an index a, Field b No index on . stored procedure init_data To the table t1 There's something in it 10000 Row data , In the table t2 It's inserted in 500 Row data .

for fear of MySQL The optimizer chooses the table as the driver table itself , The impact analysis SQL Statement execution , We use it directly straight_join To make the MySQL Use a fixed join table order to query , In the following sentence ,t1 It's the drive meter ,t2 It's driven watch .

select * from t2 straight_join t1 on (t2.a=t1.a);

Before using us article To introduce the explain Command to see the execution plan of the statement .

You can see from the above picture that ,t1 On the table a Fields are indexed by ,join The index is used in the procedure , So it's time to SQL The execution flow of the statement is as follows :

  • from t2 Read a row of data from the table L1;
  • Use L1 Of a Field , Go to t1 Table as a condition to query ;
  • Take out t1 The line that satisfies the condition in , Follow L1 Form the corresponding lines , Become part of the result set ;
  • repeat , Until the scan is done t2 surface .

We call this process Index Nested-Loop Join, abbreviation NLJ, Its corresponding flow chart is shown below .

It should be noted that , In the second step , according to a Field to table t1 When querying in , Index is used , So each scan only scans one line ( from explain It turns out that , It varies according to different case scenarios ).

Suppose the number of rows driving the table is N, The number of rows in the driven table is M. Because here join Statement execution , The driving table is to scan the whole table , The driven table uses the index , And each row in the table is driven by the query , So the whole thing join The approximate complexity of the process is N2log2M. obviously ,N It has a greater impact on the number of scan lines , Therefore, in this case, the small table should be used as the driving table .

Of course , The premise of all this is join The associated field of is a, also t1 Tabular a There is an index on the field .

If there is no index , When using the execution process shown in the figure above , Every time I come to t1 To match , It's going to be a full table scan . This also leads to the time complexity of the whole process programming N * M, This is unacceptable . therefore , When there is no index ,MySQL Use Block Nested-Loop Join Algorithm .

Block Nested-Loop Join

Block Nested-Loop Join The algorithm of , abbreviation BNL, It is MySQL Used when no index is available on the driven table join Algorithm , The specific process is as follows :

  • Keep watch t2 Of the current thread join_buffer in , Examples in this article SQL Not in the t2 Do any conditional filtering on , So that means t2 The whole table Put it in memory ;
  • Scan table t1, Each row of data is taken out , Just follow join_buffer Compare the data in , Satisfy join Conditions of the , Put it in the result set .

For example, the following one SQL

select * from t2 straight_join t1 on (t2.b=t1.b);

In this sentence explain The results are shown below . It can be seen that

It can be seen that , This time, join Process right t1 and t2 We did a full scan , And will table t2 Medium 500 All the data are put into memory join_buffer in , And for tables t1 Every row of data in , All going join_buffer Go through it again , Do it all 500 Second comparison , So it's a total of 500 * 10000 Secondary memory comparison operation , The specific process is shown in the figure below .

The main thing is , In the first step , It's not going to watch t2 Put all the data in join_buffer, It's based on specific SQL sentence , And put in different rows of data and different fields . For example, the following one join Statement will only change the table t2 In line with b >= 100 Data. b Fields are stored in join_buffer.

select t2.b,t1.b from t2 straight_join t1 on (t2.b=t1.b) where t2.b >= 100;

join_buffer It's not infinite , from join_buffer_size control , The default value is 256K. When the data to be stored is too large , It's just segmented storage , The whole execution process becomes :

  • Scan table t2, Save the eligible data lines into join_buffer, Because it's limited in size , Deposit in 100 The line is full of , Then go to step two ;
  • Scan table t1, Each row of data is taken out , Just follow join_buffer Compare the data in , Satisfy join Conditions of the , Put it in the result set ;
  • Empty join_buffer;
  • Take the first step again , Until all the data is scanned , because t2 Table has 500 Row data , So it's repeated 5 Time

This process reflects the name of the algorithm Block The origin of , Do it in chunks join operation . Because of the watch t2 The data is divided into 5 Deposit in join_buffer, Cause table t1 To be scanned by the full table 5 Time .

All deposited in branch 5 Deposit in
Memory operations 10000 * 500 10000 * (100 + 100 + 100 + 100 + 100)
Number of scanning lines 10000 + 500 10000 * 5 + 500

As shown above , And table data can be stored in join_buffer comparison , The number of memory judgments does not change , It's the product of the number of rows in two tables , That is to say 10000 * 500, But the driven table is scanned many times , Every time you deposit , The driven table needs to be scanned once , It affects the final efficiency of execution .

Based on the above two algorithms , We can draw the following conclusion , It's also most online about MySQL join The specification of the statement .

  • There is an index on the driven table , That is to say, you can use Index Nested-Loop Join When the algorithm , have access to join operation .

  • Whether it's Index Nested-Loop Join Algorithm or Block Nested-Loop Join We should use the small table as the driving table .

Because of the above two join The time complexity of the algorithm At least It also has a first-order relationship with the number of rows involved in the table , And it costs a lot of memory space , Therefore, the Alibaba developer specification says that it is strictly forbidden to use more than three tables join The operation is understandable .

But these two algorithms are just join One of the algorithms of , also More efficient join Algorithm , such as Hash Join and Sorted Merged join. Unfortunately, these two algorithms MySQL Currently, none of the mainstream versions of , and Oracle ,PostgreSQL and Spark They all support , Make complaints about online Tucao MySQL The reason for the weak explosion (MySQL 8.0 Version supports Hash join, however 8.0 It's not the mainstream version yet ).

In fact, Alibaba developer specifications are also from Oracle Migrate to MySQL when , because MySQL Of join Operation performance is too poor and decided to prohibit more than three tables join Operating regulations .

Hash Join Algorithm

Hash Join It's a scan driver table , utilize join Creates a hash table in memory , Then scan the driven table , Each read line of data , And find the corresponding data from the hash table . It is a common way to connect large data sets , Small amount of data is suitable for driving table , Scenes that can be put into memory , For it Large tables without indexes And parallel query scenarios can provide the best performance . Unfortunately, it is only applicable to the scenario of equivalent connection , such as on a.id = where b.a_id.

Again, the above two tables join The sentence of , The implementation process is as follows

  • Will drive the table t2 Get the data that meets the conditions in , For each line join The field value is carried out hash operation , It is then stored in a hash table in memory ;
  • Traversing the driven table t1, Each row of eligible data is taken out , It's also about it join The field value is carried out hash operation , Take the results to a hash table in memory to find a match , If you find , The result set becomes part of .

It can be seen that , The algorithm and Block Nested-Loop Join There are similarities , It's just going to be disorderly Join Buffer It's a hash table hash table, So that data matching doesn't need to change join buffer Go through all the data in , But directly through hash, To approach O(1) The time complexity of getting the matching line , These two tables are greatly improved join Speed .

But due to the hash Characteristics of , The algorithm can only be applied to the scene of equivalent connection , Other scenarios cannot be connected with this algorithm .

Sorted Merge Join Algorithm

Sort Merge Join It is based on join The associated fields of the two tables sort the two tables ( If it's already sorted , For example, if there is an index on a field, you don't need to sort again ), Then merge the two tables once . If the two tables have been sorted , There is no need to sort when performing sort merge connections , At this time Merge Join The performance will be better than Hash Join.Merge Join It can be applied to non equivalence Join(>,<,>=,<=, But it doesn't include !=, That is to say <>).

It should be noted that , If the linked field already has an index , That is to say, if it has been ordered , You can merge directly , But if the linked field has no index , The execution process is shown in the figure below .

  • Traversal table t2, Read the data that meets the conditions , According to the connection field a Sort by ;
  • Traversal table t1, Read the data that meets the conditions , Also follow the join field a Sort by ;
  • Merge two sorted data , Get the result set .

Sorted Merge Join The main time consumption of the algorithm lies in sorting the two tables , So if the two tables have already been sorted by join field , The algorithm is even better than Hash Join The algorithm is faster . In one case , The algorithm is better than Nested Loop Join The algorithm should be fast .

below , Let's summarize the differences, advantages and disadvantages of the above three algorithms .

Nested Loop Join Hash Join Sorted Merge Join
Connection condition Apply to any condition Only for equivalent connections (=) Equivalent or non equivalent connection (>,<,=,>=,<=),‘<>’ With the exception of
The main consumption of resources CPU、 disk I/O Memory 、 Temporary space Memory 、 Temporary space
characteristic When there is a highly selective index or restricted search, it is more efficient , Can quickly return to the first search results When there is no index or index condition is fuzzy ,Hash Join Than Nested Loop It works . Often than Merge Join fast . In the data warehouse environment , If there are many records in the table , Efficient When there is no index or index condition is fuzzy ,Sort Merge Join Than Nested Loop It works . When fields are indexed ahead of time , Than hash join fast , And support more connection conditions
shortcoming No index or table records are inefficient Building a hash table requires a lot of memory , The first time the results return slowly All tables need to be sorted . It's designed for optimal throughput , And do not return data until all results are found
Need index yes ( It's inefficient without indexing ) no no

about Join Understanding of operation

Finished. Join Related algorithms , Let's also talk about join Operational business understanding .

When the business is not complex , majority join It's not irreplaceable . For example, in order records, only the order user's user_id, You need to get the user's name when returning information , There are several possible implementations :

  1. A database operation , Use join operation , Order table and user table join, Return with user name ;
  2. Two database operations , Two queries , Get order information for the first time and user_id, The second is based on user_id Take your name , Using code programs to merge information ;
  3. Use redundant user names or from ES Reading relational databases in China and Africa .

The above solutions can solve the problem of data aggregation , And based on the program code to deal with , Than the database join Easier to debug and optimize , For example, the user name is not taken from the database , Instead, look for it in the cache first .

Of course , join The operation is not worthless , So every technology has its own usage scenarios , The above solutions or rules are summed up by the Internet development team , For high concurrency 、 Write lightly and read more 、 Distributed 、 Simple business logic , These scenarios generally do not require high data consistency , Even dirty reading is allowed .

however , In financial bank or financial and other enterprise application scenarios ,join Operation is indispensable , These applications are generally low concurrency 、 Frequent complex data writes 、CPU Dense, not IO Concentrated , The main business logic is processed through the database and even contains a large number of stored procedures 、 Systems with high requirements for consistency and integrity .

Personal blog , Welcome to play

版权声明
本文为[Programmer Li Xiaobing]所创,转载请带上原文链接,感谢

Scroll to Top