Two data sets will be analyzed Join Operation is a very common scenario . stay Spark The physical planning phase of ,Spark Of Join Selection Class will root According to the Join hints Strategy 、Join The size of the table 、 Join It's equivalent Join Or inequality and participation Join Of key Whether you can sort and other conditions to select the most Final Join Strategy , Last Spark Will take advantage of the choice of good Join The policy performs the final calculation . At present Spark There are five Join Strategy ：
Broadcast hash join (BHJ)
Shuffle hash join（SHJ）
Shuffle sort merge join (SMJ)
Shuffle-and-replicate nested loop join, Also called Cartesian product （Cartesian product join)
Broadcast nested loop join (BNLJ)
SMJ These two kinds of Join The strategy is that we run Spark Homework is the most common .
JoinSelection According to
Join Of Key For equivalence Join Choose
Broadcast hash join、
Shuffle hash join as well as
Shuffle sort merge join One of them ; If Join Of Key It's not equivalent Join Or no designation Join Conditions , Will choose
Broadcast nested loop join or
Shuffle-and-replicate nested loop join. Different Join There is a big difference in the efficiency of strategy execution , Learn about each Join The implementation process and applicable conditions of the strategy are very necessary .
1、Broadcast Hash Join
Broadcast Hash Join The implementation of is to broadcast the data of the small table to
Executor End , This broadcasting process is similar to our own broadcasting It doesn't make any difference ：
utilize collect The operator takes the data of the small table from Executor End pull to Driver End stay Driver End calls sparkContext.broadcast Broadcast to all Executor End stay Executor The end uses broadcast data to communicate with large tables Join operation （ It's actually execution map operation ）
such Join The strategy avoids Shuffle operation . generally speaking ,Broadcast Hash Join It's better than the others Join Strategies should be implemented quickly .
Use this Join The strategy must satisfy the following conditions ： The data of a small table must be very small , Can pass
spark.sql.autoBroadcastJoinThreshold Parameters to configure , The default is 10MB If the memory is large , The threshold can be increased appropriately take
spark.sql.autoBroadcastJoinThreshold Parameter set to -1, You can turn off this connection It can only be used for equivalence Join, No participation is required Join Of keys Sortable
2、Shuffle Hash Join
When the data in the table is large , It's not suitable for radio , At this time, you can consider using
Shuffle Hash Join.
Shuffle Hash Join It's also done in large and small tables Join It's a strategy that I choose when I'm going to work . Its computational idea is ： Put the big watch and the small watch in the same way Partition algorithm and partition number for partition （ According to participation Join Of keys partition ）, That's the guarantee hash Data with the same value is distributed to the same Of the divisions , And then in the same Executor Two tables in the middle hash Partition with the same value can be done locally hash Join 了 . It's going on Join And front , We will also build partitions for small tables Hash Map.
Shuffle hash join Using the idea of partition , Break down big problems into small ones .
To enable the
Shuffle Hash Join The following conditions must be met ： Only equivalent value is supported Join, No participation is required Join Of Keys Sortable
spark.sql.join.preferSortMergeJoin Parameter must be set to false, Parameters from Spark 2.0.0 Version introduced , The default value is true, That is, by default Sort Merge Join The size of the little table （
plan.stats.sizeInBytes） Must be less than
spark.sql.shuffle.partitions（ The default value is 200） And small watch size （stats.sizeInBytes） Must be less than or equal to the size of the large table （stats.sizeInBytes）, That is to say
a.stats.sizeInBytes * 3 < = b.stats.sizeInBytes
3、Shuffle Sort Merge Join
The first two Join Policies are conditional on the size of the table , If participation Join All the watches are big , At this time, we have to consider using Shuffle Sort Merge Join 了 .
Shuffle Sort Merge Join The realization idea of ： Put the two tables according to
join key Conduct
join key Records with the same value will be divided into corresponding partitions Sort the data in each partition After sorting, connect the records in the corresponding partition No matter how big the area is ,
Sort Merge Join You don't have to load all the data on one side into memory , It's about using and losing ; Because both sequences are ordered . from Head traversal , meet key The same output , If different , If the left side is small, continue to take the left side , On the contrary, take the right side . Thus greatly improving the amount of large data
sql join The stability of .
To enable the
Shuffle Sort Merge Join The following conditions must be met ：
Only equivalent value is supported
Join, And asked to participate in
Join Of Keys Sortable
4、Cartesian product join
If Spark Two of them take part in
Join The table for does not specify the join condition , Then there will be
Cartesian product join, This Join The result is actually
It's the product of the number of rows in two tables .
5、Broadcast nested loop join
You can put Broadcast nested loop join The execution of is seen as the following calculation ：
for record_1 in relation_1:
for record_2 in relation_2:
join condition is executed
It can be seen that Broadcast nested loop join In some cases, a table is scanned repeatedly , Very inefficient . You can tell by the name , such
join Will broadcast the small table according to the relevant conditions , To reduce the number of table scans .
Broadcast nested loop join Support equivalence and non equivalence Join, Support all Join type . Big data development , Pay more attention to see your profile
本文为[Hoult_ Wu Xie]所创，转载请带上原文链接，感谢