Sort the word , My first feeling is that almost all App There are places to sort , Taobao products are sorted according to purchase time 、B The comments of the station are sorted according to the heat ..., Of course, what we are talking about today is not how to sort gracefully under big data , How to improve sorting performance , Let's say MySQL The sorting .
about MySQL, When it comes to sorting , What did you first think of ？ keyword order by？order by It's best to have an index on the field ？ Leaf nodes are already sequential ？ Or try not to MySQL Internal sorting ？
The cause of the matter
Now suppose there is a list of users' friends ：
CREATE TABLE `user` ( `id` int(10) AUTO_INCREMENT, `user_id` int(10), `friend_addr` varchar(1000), `friend_name` varchar(100), PRIMARY KEY (`id`), KEY `user_id` (`user_id`) ) ENGINE=InnoDB;
At present, there are two points in the table that need attention ：
User user_id , Friend's name friend_name、 Friend's address friend_addr
user_id Yes, there is Indexes Of
one day , There's a junior development engineer, ape , Received a demand from Xiao Wang, the primary product manager ：
Xiao Wang ： Comrade ape , Now you need to add a function in the background , This function should be supported according to the user id Can find the names and addresses of all his friends , And ask friends' names to be sorted according to the dictionary .
Ape ： well , This function is simple , I'll be online soon .
So the little ape wrote this sql：
select friend_name,friend_addr from user where user_id=? order by name
In the moment of lightning , The little ape toe went on the line in high spirits , It's all going well , Until one day, an operation classmate led to such a query ：
select friend_name,friend_addr from user where user_id=10086 order by name
However , This query is much slower than usual , The database reported a slow query , The little ape panicked at this time b： What's going on ？user_id There is an index , And wisely, I only used select friend_name,friend_addr, It's no use select * ah . The little ape kept comforting himself , Be calm, be calm , Then it suddenly occurred to me that there was explain command , use explain Let's check that sql Let's go ahead with our plan , When the little ape used explain after , Find out extra There is a dangerous word in the field ：using filesort.
“ This query actually uses the legendary file sorting , But if a person doesn't have many friends , Even if you use file sorting , It should be soon ”, Unless this user_id=10086 I have many friends , Later, the little ape went to check , This user's friend has 10w Multiple ～.
Lost in thought, the little ape thought ： This pot seems to be fixed ,10w The data is a little big , And this one. using filesort What sort principle is it ？
Anatomical file sorting
One might say that the above problem is 10w The data is too big , It's slow even if you don't sort , This actually makes sense ,10w The data can be found at one time , Whether it's MySQL Occupation of memory buffer , Or the consumption of network bandwidth is very large , Then if I add limit 1000 Well ？ The problem of network bandwidth must have been solved , Because the packet is smaller as a whole , however using filesort The problem is still unsolved , Seeing this, you may have questions ,using filesort Is it sorted in the file ？ How is it sorted in the file ？ Or I ask ： What would you do if you were to design a ranking ？ With these questions and thoughts, let's take a look using filesort What technical difficulties will be involved and how to solve them ？
First of all, our user_id It's indexed , So I'll be there first user_id Retrieve our target data from the index tree , namely user_id=10086 The data of , But what we're looking for is friend_name and friend_addr Field , Unfortunately , The light on user_id The index cannot find these two field values
So you need to go back to the table , adopt user_id Find the corresponding primary key in the primary key index tree ,ok, We found the first user_id=10086 Of friend_name and friend_addr Field
What to do at this time ？ It must be wrong to go straight back , Because I need to know friend_name Sort , How to arrange ？ I haven't found all the data yet , Then you have to put the data in one place first , This place is sort_buffer, When you see the name, I think you should guess , you 're right ,sort_buffer This is the buffer used for sorting in this case , It should be noted here that each thread will have a separate sort_buffer, The main purpose of this is to avoid lock contention caused by multiple threads operating on the same block of memory .
When the first piece of data friend_name and friend_addr Has been put in sort_buffer in , Of course it's not over , The synchronization steps will be repeated all the time , Until all user_id=10086 Of friend_name and friend_addr Put them in sort_buffer It only ends in
sort_buffer The data in has been put in , It's time to sort , here MySQL Would be right friend_name Make a quick schedule , After a quick row ,sort_buffer in friend_name It's orderly
Finally back to sort_buffer In front of 1000 strip , end .
Everything looks smooth , however sort_buffer It takes up memory space , This is awkward , Memory itself is not infinite , It must have an upper limit , Of course sort_buffer It can't be too small , Too small , It doesn't make much sense . stay InnoDB In the storage engine , This value defaults to 256K.
mysql> show variables like 'sort_buffer_size'; +------------------+--------+ | Variable_name | Value | +------------------+--------+ | sort_buffer_size | 262144 | +------------------+--------+
in other words , If you want to put sort_buffer The data in is greater than 256K Words , Then use in sort_buffer The way of medium and fast platoon will certainly not work , Now , You may ask ：MySQL Can't it expand automatically according to the data size ？ forehead ,MySQL It's a multithreading model , If each thread expands , Then give it to other functions buffer It's small （ such as change buffer etc. ）, It will affect the quality of other functions .
Then you have to sort in another way , you 're right , This is the real file sorting , That is, the temporary file on the disk ,MySQL Will adopt the idea of merging and sorting , Divide the data to be sorted into several parts , After each piece of data is sorted in memory, it will be put into a temporary file , Finally, merge and sort the data of these sorted temporary files again ok 了 , The typical divide and rule principle , Its specific steps are as follows ：
First divide the data to be sorted , Each piece of data can be divided into sort_buffer in
For each piece of data in sort_buffer In order , After sorting , Write to a temporary file
When all the data is written to the temporary file , At this time, for each temporary file , The interior is orderly , But they are not a whole , The whole is not orderly , So the next step is to merge the data
Let's assume that there is tmpX and tmpY Two temporary files , It's going to start from tmpX Read some data into memory , And then from tmpY Read part of the data into memory , Here you may wonder why it is a part rather than a whole or a single ？ Because first, the disk is slow , So try to read more data into memory every time , But you can't read too much , Because there's more buffer The limits of space .
about tmpX Suppose what comes in is tmpX[0-5] , about tmpY Suppose what comes in is tmpY[0-5], So just compare ： If tmpX < tmpY, that tmpX It must be the smallest , then tmpX and tmpY Compare , If tmpX > tmpY, that tmpY It must be the second smallest ..., In this way, pairwise comparison can finally put tmpX and tmpY Merge into an ordered file tmpZ, Many of them tmpZ Merge again ..., Finally, all the data can be merged into an orderly large file .
File sorting is slow , Is there any other way
Through the sorting process above, we know , If the data to be sorted is large , exceed sort_buffer Size , Then you need to sort the files , File sorting involves batch sorting and merging , Very time consuming , The root cause of this problem is sort_buffer Not enough use , I don't know if you found out there's no us friend_name Need to sort , But they put friend_addr Also stuffed sort_buffer in , such The size of a single row of data is equal to friend_name The length of + friend_addr The length of , Can you make sort_buffer The only known friend_name Field , In this case , The overall utilization space is large , You don't have to use temporary files . you 're right , This is another sort optimization to be discussed next rowid Sort .
rowid The idea of sorting is not to put unnecessary data into sort_buffer in , Give Way sort_buffer Only the necessary data is retained in , So what do you think is the necessary data ？ Only put friend_name？ It's not going to work , After sorting ,friend_addr What do I do ？ Therefore, the primary key id Put it in , After this line , adopt id Go back to the table again , Get friend_addr that will do , Therefore, its general process is as follows ：
according to user_id Indexes , Find the target data , Then go back to the table , Only id and friend_name In the sort_buffer in
repeat 1 step , Until all the target data is sort_buffer in
Yes sort_buffer According to the data in friend_name Field to sort
Sort according to id Go back to the table again and find friend_addr return , Until we return to 1000 Data , end .
In fact, there are several points to pay attention to ：
In this way, you need to go back to the table twice
sort_buffer Although small , But if the amount of data itself is still large , It should still be sorted by temporary files
So here comes the question , Two ways ,MySQL How to choose ？ We have to judge which way to go according to certain conditions , This condition is to enter sort_buffer The length of a single line , If the length is too large （friend_name + friend_addr The length of ）, Will adopt rowid This way, , Otherwise, the first , The standard of length is based on max_length_for_sort_data To the , The default value is 1024 byte ：
mysql> show variables like 'max_length_for_sort_data'; +--------------------------+-------+ | Variable_name | Value | +--------------------------+-------+ | max_length_for_sort_data | 1024 | +--------------------------+-------+
I don't want to go back to my watch , Don't want to sort again
In fact, no matter which of the above methods , They all need Back to the table + Sort , The reason for returning to the table is that there is no target field on the secondary index , Sorting is because the data is not ordered , If the secondary index has a target field and is already sorted , That's the best of both worlds .
you 're right , It's a joint index , We just need to build a （user_id,friend_name,friend_addr） Joint index of , So I can get the target data through this index , also friend_name It's already sorted , As well as friend_addr Field , One move , There is no need to return the form , No need to sort again . Therefore, for the above sql, Its general process is as follows ：
Find... Through the federated index user_id=10086 The data of , Then read the corresponding friend_name and friend_addr Field directly returns , because friend_name It's already sorted , No need for extra treatment
Repeat the first step , Follow the leaf node and look back , Until you find the first one that is not 10086 The data of , end .
Although joint index can solve this problem , However, in practical application, we must not blindly establish , It is necessary to judge whether it is necessary to establish... According to the actual business logic , If you don't often have similar queries , You don't have to create , Because the federated index will take up more storage space and maintenance overhead .
about order by When the index is not used , At this time explain in Extra The field will probably appear using filesort The word
appear using filesort You don't have to be too flustered , If the amount of data is small , For example, there are dozens of data , So in sort buffer It is also fast to use fast platoon in
If the amount of data is large , More than the sort buffer Size , So it is necessary to sort temporary files , That is, merge sort , This part is made up of MySQL The optimizer decides
If there are many fields to query , Want to try to avoid using temporary file sorting , You can try setting max_length_for_sort_data The size of the field , Make it less than the sum of the length of all query fields , This may avoid , But there will be one more operation to return to the table
In real business , We can also create a joint index for the combination of fields that we often query , In this way, there is no need to go back to the table or sort separately , But federated indexing takes up more storage and overhead
When querying a large amount of data , Try to divide into batches , advance explain To observe sql Your execution plan is a good choice .
It's not easy to create , Everyone 「 Three even 」 Is the greatest support for the author , It is also the author's greatest creative power , See you next time .
Search on wechat 【 Pretend to know programming 】, Continue to share technical dry goods
Past highlights ：
本文为[Pretend to know programming]所创，转载请带上原文链接，感谢