编程知识 cdmana.com

On and where under join in hive

This article is written by the author of blog Garden :  An inch HUI, Personal blog address :https://www.cnblogs.com/zsql/

If a lot of people come into contact with mysql The order of execution (from ->on ->join ->where ->group by ->having ->select ->distinct ->order by ->limit), May be right hive Medium on and where There will be some misunderstandings , There are also some blogs on the Internet that write about these contents , But I still want to test it myself , This article mainly from inner join,left/right join and full  join If conditions go on, the difference is on and where, And the conditions of adding the dividing groove , In fact, it's important to understand these words hql It helps a lot , If you want to be more concise , What's better , And then it's time to practice .

edition : This article USES the CDH 6.3.2 Version of ,hive   2.1.1+cdh6.3.2 Testing

One 、 Test tables and data

1.1、 Build table

create  table `user`(  --  User list , Split slot table department_id int,age int,sex string,name string)PARTITIONED BY (`date` string)row format delimitedfields terminated by ','STORED AS TEXTFILE;create table department(  --  Department list id int,name string,count int)row format delimitedfields terminated by ','STORED AS TEXTFILE;

1.2、 Information

-- /data/hive/user1.txt1,34,male,zhangsan1,31,female,lili3,14,female,liushen3,24,female,sasa4,54,male,liubei4,36,female,yuji4,25,male,zhaoyun8,18,male,zhangfei-- /data/hive/user2.txt3,37,male,wangwu4,38,female,lisi3,19,female,caocao2,22,female,guanyu1,51,male,wzj6,31,female,zhenji6,25,male,sunwukong6,17,male,tangsz-- /data/hive/department.txt1,dashuju,82,kaifa,93,ui,104,hr,35,shouxiao,126,zongjian,3

1.3、 Import data into

load data local inpath '/data/hive/user1.txt'  into table `user` partition (`date`='2020-12-24');load data local inpath '/data/hive/user2.txt'  into table `user` partition (`date`='2020-12-25');load data local inpath '/data/hive/department.txt'  into table `department`;

1.4、 Search information

SELECT  * from `user`;

 SELECT  * from department ;

 

  1.5、 Analyze the table

ANALYZE TABLE `user` partition(`date`='2020-12-24') COMPUTE STATISTICS;ANALYZE TABLE `user` partition(`date`='2020-12-25') COMPUTE STATISTICS;ANALYZE TABLE department COMPUTE STATISTICS;

If you don't do the above , There will be problems in the following practice , The information is not true , There's no effect , So do it explain Before the analysis of the table , This is more accurate ( Just stepped on the pit , Every time explain It's just 1 Information , There's something wrong with the statistics )

Two 、inner  join Under the on and where analysis

2.1、 Filter without the use of dividing slots

1、 Let's first look at an unconditional inner join The results of

SELECT  * from `user` u  inner join  department d  on d.id=u.department_id;

   Review the execution plan :

explain SELECT  * from `user` u  inner join  department d  on d.id=u.department_id;
STAGE DEPENDENCIES:  Stage-4 is a root stage    Stage-3 depends on stages: Stage-4  Stage-0 depends on stages: Stage-3STAGE PLANS:  Stage: Stage-4    Map Reduce Local Work      Alias -> Map Local Tables:        d           Fetch Operator            limit: -1      Alias -> Map Local Operator Tree:        d           TableScan            alias: d            filterExpr: id is not null (type: boolean)            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE            Filter Operator              predicate: id is not null (type: boolean)              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE              HashTable Sink Operator                keys:                  0 department_id (type: int)                  1 id (type: int)  Stage: Stage-3    Map Reduce      Map Operator Tree:          TableScan            alias: u            filterExpr: department_id is not null (type: boolean)            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL            Filter Operator              predicate: department_id is not null (type: boolean)              Statistics: Num rows: 16 Data size: 2944 Basic stats: COMPLETE Column stats: PARTIAL              Map Join Operator                condition map:                     Inner Join 0 to 1                keys:                  0 department_id (type: int)                  1 id (type: int)                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10                Statistics: Num rows: 17 Data size: 3238 Basic stats: COMPLETE Column stats: NONE                Select Operator                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7                  Statistics: Num rows: 17 Data size: 3238 Basic stats: COMPLETE Column stats: NONE                  File Output Operator                    compressed: false                    Statistics: Num rows: 17 Data size: 3238 Basic stats: COMPLETE Column stats: NONE                    table:                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      Local Work:        Map Reduce Local Work  Stage: Stage-0    Fetch Operator      limit: -1      Processor Tree:        ListSink

The above statement does not do any conditional filtering , There's no split slot :

  • First of all, yes d surface (department) Do a full table scan , It's a scan of 6 Information , Then default to add id is not null The filter condition of the scan to its 6 Data to filter ( Automatic optimization , If join I want to keep it null We need special treatment in advance , Otherwise, the default will be optimized , Will cause data loss ), Finally, after filtering, there is still 6 Data participation inner join.
  • And then u surface (user) Do a full table scan , It's a scan of 16 Information , Also add preset filter conditions department_id is not null, Finally get 16 Data participation inner join.

 

2、 Let's see one with where Conditions and on The result under the condition

SELECT  * from `user` u inner join  department d on d.id=u.department_idand d.count > 9and u.age > 20where u.age < 30;

   Next, let's look at the implementation plan :

STAGE DEPENDENCIES:  Stage-4 is a root stage  Stage-3 depends on stages: Stage-4  Stage-0 depends on stages: Stage-3STAGE PLANS:  Stage: Stage-4    Map Reduce Local Work      Alias -> Map Local Tables:        d           Fetch Operator            limit: -1      Alias -> Map Local Operator Tree:        d           TableScan            alias: d            filterExpr: ((count > 9) and id is not null) (type: boolean)            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE            Filter Operator              predicate: ((count > 9) and id is not null) (type: boolean)              Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE              HashTable Sink Operator                keys:                  0 department_id (type: int)                  1 id (type: int)  Stage: Stage-3    Map Reduce      Map Operator Tree:          TableScan            alias: u            filterExpr: ((age > 20) and department_id is not null and (age < 30)) (type: boolean)            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL            Filter Operator              predicate: ((age > 20) and department_id is not null and (age < 30)) (type: boolean)              Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: PARTIAL              Map Join Operator                condition map:                     Inner Join 0 to 1                keys:                  0 department_id (type: int)                  1 id (type: int)                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10                Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE                Select Operator                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7                  Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE                  File Output Operator                    compressed: false                    Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE                    table:                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      Local Work:        Map Reduce Local Work  Stage: Stage-0    Fetch Operator      limit: -1      Processor Tree:        ListSink

The result is as shown above :

  • First scan d(department) surface , Full scan 6 Information , And filter it :((count > 9) and id is not null) , The filtering result is left 2 Data for inner join operation
  • And then scan u(user) surface , It's also a full table scan 16 Information , And filter it ((age > 20) and department_id is not null and (age < 30)), Filter the remaining 1 Information ( There are errors here , In fact, it should be left over 4 Information ,hive The implementation plan is a general statistical execution process , Not exactly ) To carry out inner join operation

A little summary :inner join In the case of filtering without using the dividing tank , A full table scan is performed on both tables , And then automatically for join Key (on d.id=u.department_id) newly added is not null The filtration conditions of , And then in coordination with on and where Filter the following conditions , stay inner join in where and on It's filtering at the same time .

2.2、 Use the split tank to filter

1、 Let's take a look at a situation where only the split tank filters

SELECT  * from `user` u inner join  department don d.id=u.department_idwhere u.`date`='2020-12-25';

  Review the execution plan :

STAGE DEPENDENCIES:  Stage-4 is a root stage  Stage-3 depends on stages: Stage-4  Stage-0 depends on stages: Stage-3STAGE PLANS:  Stage: Stage-4    Map Reduce Local Work      Alias -> Map Local Tables:        d           Fetch Operator            limit: -1      Alias -> Map Local Operator Tree:        d           TableScan            alias: d            filterExpr: id is not null (type: boolean)            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE            Filter Operator              predicate: id is not null (type: boolean)              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE              HashTable Sink Operator                keys:                  0 department_id (type: int)                  1 id (type: int)  Stage: Stage-3    Map Reduce      Map Operator Tree:          TableScan            alias: u            filterExpr: (department_id is not null and (date = '2020-12-25')) (type: boolean)            Statistics: Num rows: 8 Data size: 134 Basic stats: COMPLETE Column stats: NONE # There's only... In this slot 8 Information             Filter Operator              predicate: department_id is not null (type: boolean)              Statistics: Num rows: 8 Data size: 134 Basic stats: COMPLETE Column stats: NONE               Map Join Operator                condition map:                     Inner Join 0 to 1                keys:                  0 department_id (type: int)                  1 id (type: int)                outputColumnNames: _col0, _col1, _col2, _col3, _col8, _col9, _col10                Statistics: Num rows: 8 Data size: 147 Basic stats: COMPLETE Column stats: NONE                Select Operator                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), '2020-12-25' (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7                  Statistics: Num rows: 8 Data size: 147 Basic stats: COMPLETE Column stats: NONE                  File Output Operator                    compressed: false                    Statistics: Num rows: 8 Data size: 147 Basic stats: COMPLETE Column stats: NONE                    table:                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      Local Work:        Map Reduce Local Work  Stage: Stage-0    Fetch Operator      limit: -1      Processor Tree:        ListSink

The result is as follows , Here is a comparison with the case where no new segmentation slot has been added , After adding the split slot, the table will not be aligned u(user) Do a full table scan , In this way, we can improve efficiency , Because the storage of the partition slot is a folder , So after the split slot is filtered, you can specify the split slot for scanning , You don't have a full table scan , This situation shows that : The partition slot table first filters the partition slot , Then scan all the specified slots , Then use on and where And automatically added is not null Conditions for filtering , The filtered data is processed inner join

 2、 Look at the results of a filter with conditions and split slots

SELECT  * from `user` u inner join  department don d.id=u.department_idand d.count > 9and u.age > 20where u.age < 30  and u.`date`='2020-12-24';

   Look at the implementation plan :

STAGE DEPENDENCIES:  Stage-4 is a root stage  Stage-3 depends on stages: Stage-4  Stage-0 depends on stages: Stage-3STAGE PLANS:  Stage: Stage-4    Map Reduce Local Work      Alias -> Map Local Tables:        d           Fetch Operator            limit: -1      Alias -> Map Local Operator Tree:        d           TableScan            alias: d            filterExpr: ((count > 9) and id is not null) (type: boolean)            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE            Filter Operator              predicate: ((count > 9) and id is not null) (type: boolean)              Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE              HashTable Sink Operator                keys:                  0 department_id (type: int)                  1 id (type: int)  Stage: Stage-3    Map Reduce      Map Operator Tree:          TableScan            alias: u            filterExpr: ((age > 20) and department_id is not null and (age < 30)) (type: boolean)            Statistics: Num rows: 8 Data size: 136 Basic stats: COMPLETE Column stats: NONE            Filter Operator              predicate: ((age > 20) and department_id is not null and (age < 30)) (type: boolean)              Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE              Map Join Operator                condition map:                     Inner Join 0 to 1                keys:                  0 department_id (type: int)                  1 id (type: int)                outputColumnNames: _col0, _col1, _col2, _col3, _col8, _col9, _col10                Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE                Select Operator                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), '2020-12-24' (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7                  Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE                  File Output Operator                    compressed: false                    Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE                    table:                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      Local Work:        Map Reduce Local Work  Stage: Stage-0    Fetch Operator      limit: -1      Processor Tree:        ListSink

The result is as follows , The result is consistent with ( Filter without using split tank condition and use on and where Filtration ) Contrast , You can see it , The difference between using split slot filtering is to scan all or specified slots when scanning a table , If there is no partition tank filtration , Scan the whole table , Otherwise , Scan only the specified slot .

2.3、inner join Next on and where A summary of

stay inner join Next , If where The conditions used in the split tank filter , Scan the data of the specified slot , And then through where and on Conditions for filtering , And for join Key (on d.id=u.department_id) newly added is not null The filtration conditions of ( What needs to be noted here is join Key is null Whether to keep the information of , If you need to keep it , You need to be right join Key for special processing , Otherwise, the data is filtered out , Cause data loss ), Here on and where It's filtered at the same time , There is no distinction between them .

3、 ... and 、left/right join Under the on and where analysis

Because of left join and right join Belong to the same type , So this article only aims at left join Practice .

3.1、 The non primary table is in on and where Under conditions

Let's look at the result of an execution statement ( In the non main table in on Filter conditions are added later )( Make an appointment :u  ( Main table ) left join d( Non primary table ))

SELECT * from `user` u
left join department d
on d.id=u.department_id
and d.count > 9

  View results

    Then look at an implementation plan :

STAGE DEPENDENCIES:  Stage-4 is a root stage  Stage-3 depends on stages: Stage-4  Stage-0 depends on stages: Stage-3STAGE PLANS:  Stage: Stage-4    Map Reduce Local Work      Alias -> Map Local Tables:        d           Fetch Operator            limit: -1      Alias -> Map Local Operator Tree:        d           TableScan            alias: d            filterExpr: (count > 9) (type: boolean)            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE            Filter Operator              predicate: (count > 9) (type: boolean)              Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE              HashTable Sink Operator                keys:                  0 department_id (type: int)                  1 id (type: int)  Stage: Stage-3    Map Reduce      Map Operator Tree:          TableScan            alias: u            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL            Map Join Operator              condition map:                   Left Outer Join0 to 1              keys:                0 department_id (type: int)                1 id (type: int)              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10              Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE              Select Operator                expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7                Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE                File Output Operator                  compressed: false                  Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE                  table:                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      Local Work:        Map Reduce Local Work  Stage: Stage-0    Fetch Operator      limit: -1      Processor Tree:        ListSink

The result is as follows , We found that in left join Next , Make an appointment :u  ( Main table ) left join d( Non primary table ), The non primary table is in on The following conditions d.count > 9 Filtration works , The final scan is all 6 Information , Through conditional filtering, the remaining 2 I'll give you a piece of information and then left join, The main table scans the whole table left join, Pay attention here , stay left join Under the condition of two tables join Key (on d.id=u.department_id) They didn't add is not null Conditional filtering , So it's going on left join You need to pay attention to join Whether the key is empty , If it is empty, it can be optimized .

 

Look at the result of an execution statement ( In the non main table in where Filter conditions are added later )( Make an appointment :u  ( Main table ) left join d( Non primary table ))

SELECT  * from `user` uleft join  department don d.id=u.department_idwhere d.count > 9

The results are as follows :( With non primary tables in on The new results added later are different )

    Look at the implementation plan :

STAGE DEPENDENCIES:  Stage-4 is a root stage  Stage-3 depends on stages: Stage-4  Stage-0 depends on stages: Stage-3STAGE PLANS:  Stage: Stage-4    Map Reduce Local Work      Alias -> Map Local Tables:        d           Fetch Operator            limit: -1      Alias -> Map Local Operator Tree:        d           TableScan            alias: d            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE            HashTable Sink Operator              keys:                0 department_id (type: int)                1 id (type: int)  Stage: Stage-3    Map Reduce      Map Operator Tree:          TableScan            alias: u            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL            Map Join Operator              condition map:                   Left Outer Join0 to 1              keys:                0 department_id (type: int)                1 id (type: int)              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10              Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE              Filter Operator                predicate: (_col10 > 9) (type: boolean)                Statistics: Num rows: 5 Data size: 1039 Basic stats: COMPLETE Column stats: NONE                Select Operator                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7                  Statistics: Num rows: 5 Data size: 1039 Basic stats: COMPLETE Column stats: NONE                  File Output Operator                    compressed: false                    Statistics: Num rows: 5 Data size: 1039 Basic stats: COMPLETE Column stats: NONE                    table:                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      Local Work:        Map Reduce Local Work  Stage: Stage-0    Fetch Operator      limit: -1      Processor Tree:        ListSink

From the above implementation plan , Antithetical table u(user) and d(department) It's in the absence of any filtering , A full table scan left  join, stay left join After getting the results , Then use the non main table for the results where Conditions d.count > 9 Filter

A little summary :(left join) Use under non primary table on Or use where The results of filtering are different , If it's in on The following new condition filter , First, filter the data of the table , Then it's going on left join, If it's in where After the new filter conditions , It's going to start with left join, And then it was right left join The results are carried out where Conditional filtering , stay left join in , It won't be right join Key to add the preset is not null The filtration conditions of .

3.2、 Main table in on and where Under conditions

Let's look at the result of an execution statement ( In the main table, in on Filter conditions are added later )( Make an appointment :u  ( Main table ) left join d( Non primary table ))

SELECT  * from `user` uleft join  department don d.id=u.department_idand u.age > 20

  See the above results and find that , Still keep it 16 Information (user All the information in the table ), But we found that age<=20 We don't seem to be involved in left join The same thing , The following values are all null. 

Look at the execution plan :

STAGE DEPENDENCIES:  Stage-4 is a root stage  Stage-3 depends on stages: Stage-4  Stage-0 depends on stages: Stage-3STAGE PLANS:  Stage: Stage-4    Map Reduce Local Work      Alias -> Map Local Tables:        d           Fetch Operator            limit: -1      Alias -> Map Local Operator Tree:        d           TableScan            alias: d            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE            HashTable Sink Operator              filter predicates:                0 {(age > 20)}                1               keys:                0 department_id (type: int)                1 id (type: int)  Stage: Stage-3    Map Reduce      Map Operator Tree:          TableScan            alias: u            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL            Map Join Operator              condition map:                   Left Outer Join0 to 1              filter predicates:                0 {(age > 20)}                1               keys:                0 department_id (type: int)                1 id (type: int)              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10              Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE              Select Operator                expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7                Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE                File Output Operator                  compressed: false                  Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE                  table:                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      Local Work:        Map Reduce Local Work  Stage: Stage-0    Fetch Operator      limit: -1      Processor Tree:        ListSink

The result is as follows , It's dealing with d(department surface ) When , Scan the whole table 6 Information , Antithetical table d(department) Mark age>20 The condition of , And then u(user) Table to scan the whole table and to complete the table left join, stay left join In the process of d(department) surface ( From d Tabular The field of ) Conditions through the main table age > 20 Screening , If u Tabular age <=20, From d All fields in the table are set to null,( The above is my personal understanding , Not necessarily right . In short , Make a judgment mark first , Then go ahead left join, stay left join In the process of filtering through conditions ( The unqualified data keeps the data of the main table , Data discarding of non primary table , Set as null)), Here it is on The following conditions will not affect the number of entries in the main table , It's also going on first left join And filter it accordingly . It's not easy to understand , You can see the result by yourself .

 

Look at the result of an execution statement ( In the main table, in where Filter conditions are added later )( Make an appointment :u  ( Main table ) left join d( Non primary table ))

SELECT  * from `user` uleft join  department don d.id=u.department_idwhere u.age > 20

   Next, let's look at the implementation plan :

STAGE DEPENDENCIES:  Stage-4 is a root stage  Stage-3 depends on stages: Stage-4  Stage-0 depends on stages: Stage-3STAGE PLANS:  Stage: Stage-4    Map Reduce Local Work      Alias -> Map Local Tables:        d           Fetch Operator            limit: -1      Alias -> Map Local Operator Tree:        d           TableScan            alias: d            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE            HashTable Sink Operator              keys:                0 department_id (type: int)                1 id (type: int)  Stage: Stage-3    Map Reduce      Map Operator Tree:          TableScan            alias: u            filterExpr: (age > 20) (type: boolean)            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL            Filter Operator              predicate: (age > 20) (type: boolean)              Statistics: Num rows: 5 Data size: 920 Basic stats: COMPLETE Column stats: PARTIAL              Map Join Operator                condition map:                     Left Outer Join0 to 1                keys:                  0 department_id (type: int)                  1 id (type: int)                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10                Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE                Select Operator                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7                  Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE                  File Output Operator                    compressed: false                    Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE                    table:                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      Local Work:        Map Reduce Local Work  Stage: Stage-0    Fetch Operator      limit: -1      Processor Tree:        ListSink

The result is as follows , It can be clearly seen that when used in the main table where Filtration , The data in the main table will be filtered first, and then left join, The main table scans out 16 Information , After filtration, there is a surplus of 5 Strip , And then go on left join Get the final result .

A little summary :(left join) Use... Under the main table on Or use where The results of filtering are different , When using where When filtering the main table , Filter first and then left join. When using on Filter the main table , First, filter the non primary table , And then on the whole table left join Set the data from the non primary table in the row that does not meet the filter conditions to null.

3.3、left/right join Use the split tank to filter

Look at the following sentence :

SELECT  * from `user` uleft join  department don d.id=u.department_idwhere u.age > 20 and u.`date` = '2020-12-24';

It turns out :

   Look at the execution plan :

STAGE DEPENDENCIES:  Stage-4 is a root stage  Stage-3 depends on stages: Stage-4  Stage-0 depends on stages: Stage-3STAGE PLANS:  Stage: Stage-4    Map Reduce Local Work      Alias -> Map Local Tables:        d           Fetch Operator            limit: -1      Alias -> Map Local Operator Tree:        d           TableScan            alias: d            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE            HashTable Sink Operator              keys:                0 department_id (type: int)                1 id (type: int)  Stage: Stage-3    Map Reduce      Map Operator Tree:          TableScan            alias: u            filterExpr: ((age > 20) and (date = '2020-12-24')) (type: boolean)            Statistics: Num rows: 8 Data size: 136 Basic stats: COMPLETE Column stats: NONE            Filter Operator              predicate: (age > 20) (type: boolean)              Statistics: Num rows: 2 Data size: 34 Basic stats: COMPLETE Column stats: NONE              Map Join Operator                condition map:                     Left Outer Join0 to 1                keys:                  0 department_id (type: int)                  1 id (type: int)                outputColumnNames: _col0, _col1, _col2, _col3, _col8, _col9, _col10                Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE                Select Operator                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), '2020-12-24' (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7                  Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE                  File Output Operator                    compressed: false                    Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE                    table:                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      Local Work:        Map Reduce Local Work  Stage: Stage-0    Fetch Operator      limit: -1      Processor Tree:        ListSink

According to the above implementation plan, we can see that , Using split slot filtering can prevent full table scanning , The above values scan a split slot , So the information is just 8 Strip .

3.4、left/right join Next on and where A summary of

  1. Use condition filter tank , It can prevent full table scanning , Top priority filtering
  2. Filter under the main table , Use on and where The results of filtering are different , When using where When filtering the main table , Filter first and then left join. When using on Filter the main table , First, filter the non primary table , And then on the whole table left join Set the data from the non primary table in the row that does not meet the filter conditions to null.
  3. Filter under non primary table , Use on and where The results of filtering are different , If it's in on The following new condition filter , First, filter the data of the table , Then it's going on left join, If it's in where After the new filter conditions , It's going to start with left join, And then it was right left join The results are carried out where Conditional filtering
  4. left/right join It won't be right join Key automatically added is not null The filtration conditions of , So in left/right join Pay attention to join Key is null Situation of , This is where optimization can be done

Four 、full join Under the on and where analysis

4.1、 There is no filter condition full  join

Look directly at one without any conditions full join

SELECT  * from `user` ufull join  department don d.id=u.department_id

   Review the execution plan :

STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 depends on stages: Stage-1STAGE PLANS:  Stage: Stage-1    Map Reduce      Map Operator Tree:          TableScan            alias: u            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL            Reduce Output Operator              key expressions: department_id (type: int)              sort order: +              Map-reduce partition columns: department_id (type: int)              Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL              value expressions: age (type: int), sex (type: string), name (type: string), date (type: string)          TableScan            alias: d            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE            Reduce Output Operator              key expressions: id (type: int)              sort order: +              Map-reduce partition columns: id (type: int)              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE              value expressions: name (type: string), count (type: int)      Reduce Operator Tree:        Join Operator          condition map:               Outer Join 0 to 1          keys:            0 department_id (type: int)            1 id (type: int)          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10          Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE          Select Operator            expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7            Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE            File Output Operator              compressed: false              Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE              table:                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  Stage: Stage-0    Fetch Operator      limit: -1      Processor Tree:        ListSink

The execution plan is as follows , Each table is sorted in ascending order , No automatic optimization ( newly added null Filtration ), Execute the full table full join.

4.2、 Yes where Conditional full join

 SELECT  * from `user` ufull join  department don d.id=u.department_idwhere u.age > 20 and  d.count > 9

The results are as follows

  Review the execution plan :

STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 depends on stages: Stage-1STAGE PLANS:  Stage: Stage-1    Map Reduce      Map Operator Tree:          TableScan            alias: u            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL            Reduce Output Operator              key expressions: department_id (type: int)              sort order: +              Map-reduce partition columns: department_id (type: int)              Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL              value expressions: age (type: int), sex (type: string), name (type: string), date (type: string)          TableScan            alias: d            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE            Reduce Output Operator              key expressions: id (type: int)              sort order: +              Map-reduce partition columns: id (type: int)              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE              value expressions: name (type: string), count (type: int)      Reduce Operator Tree:        Join Operator          condition map:               Outer Join 0 to 1          keys:            0 department_id (type: int)            1 id (type: int)          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10          Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE          Filter Operator            predicate: ((_col10 > 9) and (_col1 > 20)) (type: boolean)            Statistics: Num rows: 1 Data size: 207 Basic stats: COMPLETE Column stats: NONE            Select Operator              expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7              Statistics: Num rows: 1 Data size: 207 Basic stats: COMPLETE Column stats: NONE              File Output Operator                compressed: false                Statistics: Num rows: 1 Data size: 207 Basic stats: COMPLETE Column stats: NONE                table:                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  Stage: Stage-0    Fetch Operator      limit: -1      Processor Tree:        ListSink

It can be seen from the implementation plan that , stay full join Use below where When filtering, first scan the whole table , Then go ahead full join,full join Only after you get the result where Filter by using the conditions in .

4.3、 Yes on Conditional full join( Leave questions )

SELECT  * from `user` ufull join  department don d.id=u.department_idand u.age > 20 and  d.count > 9

See the above results , It may be a little unexpected , ( Personal ability is limited , Great bloggers can explain ), Personal understanding is like left join Under the main table of on The conditions are the same , It's all in full  join In the process of filtering , However, all the information in both tables is preserved , Only if both conditions are true , No null value .( stay full join If you don't understand , Just try to use where Judging by the conditions )

Review the execution plan :

STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 depends on stages: Stage-1STAGE PLANS:  Stage: Stage-1    Map Reduce      Map Operator Tree:          TableScan            alias: u            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL            Reduce Output Operator              key expressions: department_id (type: int)              sort order: +              Map-reduce partition columns: department_id (type: int)              Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL              value expressions: age (type: int), sex (type: string), name (type: string), date (type: string)          TableScan            alias: d            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE            Reduce Output Operator              key expressions: id (type: int)              sort order: +              Map-reduce partition columns: id (type: int)              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE              value expressions: name (type: string), count (type: int)      Reduce Operator Tree:        Join Operator          condition map:               Outer Join 0 to 1          filter predicates:            0 {(VALUE._col0 > 20)}            1 {(VALUE._col1 > 9)}          keys:            0 department_id (type: int)            1 id (type: int)          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10          Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE          Select Operator            expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7            Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE            File Output Operator              compressed: false              Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE              table:                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  Stage: Stage-0    Fetch Operator      limit: -1      Processor Tree:        ListSink

Leave questions ????, If excellent bloggers understand , Leave a comment below

4.4、 Split tank filter full join

SELECT  * from department d full join `user` u on d.id=u.department_id where u.`date`= '2020-12-24';

 

  Review the execution plan :

STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 depends on stages: Stage-1STAGE PLANS:  Stage: Stage-1    Map Reduce      Map Operator Tree:          TableScan            alias: d            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE            Reduce Output Operator              key expressions: id (type: int)              sort order: +              Map-reduce partition columns: id (type: int)              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE              value expressions: name (type: string), count (type: int)          TableScan            alias: u            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL            Reduce Output Operator              key expressions: department_id (type: int)              sort order: +              Map-reduce partition columns: department_id (type: int)              Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL              value expressions: age (type: int), sex (type: string), name (type: string), date (type: string)      Reduce Operator Tree:        Join Operator          condition map:               Outer Join 0 to 1          keys:            0 id (type: int)            1 department_id (type: int)          outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8, _col9, _col10          Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE          Filter Operator            predicate: (_col10 = '2020-12-24') (type: boolean)            Statistics: Num rows: 8 Data size: 1663 Basic stats: COMPLETE Column stats: NONE            Select Operator              expressions: _col0 (type: int), _col1 (type: string), _col2 (type: int), _col6 (type: int), _col7 (type: int), _col8 (type: string), _col9 (type: string), '2020-12-24' (type: string)              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7              Statistics: Num rows: 8 Data size: 1663 Basic stats: COMPLETE Column stats: NONE              File Output Operator                compressed: false                Statistics: Num rows: 8 Data size: 1663 Basic stats: COMPLETE Column stats: NONE                table:                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  Stage: Stage-0    Fetch Operator      limit: -1      Processor Tree:        ListSink

According to the execution plan : stay full join in , Even if you use a slot filter , First of all full join Get results , And then through where Conditions for filtering , So it is recommended to use subquery to filter first , Then it's going on full join.

4.5、full join Medium on and where Summary

  1. Here it is on There are still some questions about it ..
  2. stay where Whether or not the use of split tank filter is the first full join, And then filtered , So here we have sub query filtering , And then full join
  3. stay full jion Will not automatically add join Key is is not null The condition of

5、 ... and 、 Summary

1、inner join

  • inner join First of all, it can be filtered through the dividing tank , Prevent full table scanning .
  • inner join Will automatically join Key (on d.id=u.department_id) newly added is not null The filtration conditions of
  • inner join Next on and where Filter the following conditions , stay inner join in where and on It's filtering at the same time , There is no order of difference

2、left/right join

  • left/right join Use condition filter tank , It can prevent full table scanning , Top priority filtering
  • left/right join Filter under the main table , Use on and where The results of filtering are different , When using where When filtering the main table , Filter first and then left join. When using on Filter the main table , First, filter the non primary table , And then on the whole table left join Set the data from the non primary table in the row that does not meet the filter conditions to null.
  • left/right join Filter under non primary table , Use on and where The results of filtering are different , If it's in on The following new condition filter , First, filter the data of the table , Then it's going on left join, If it's in where After the new filter conditions , It's going to start with left join, And then it was right left join The results are carried out where Conditional filtering , So when filtering non primary tables, you can use on Perform conditional filtration , This prevents writing subqueries
  • left/right join It won't be right join Key automatically added is not null The filtration conditions of , So in left/right join Pay attention to join Key is null Situation of , This is where optimization can be done

3、full join

  • full join in on Filter under the same conditions ( In doubt , Still to be explored )
  • full join in where Under the condition of filtering , Whether it's a split tank filter or not , It's all done first full join, Filter according to the conditions , It is recommended that subqueries be filtered first full join
  • It won't be right join Key automatically added is not null The filtration conditions of

版权声明
本文为[itread01]所创,转载请带上原文链接,感谢
https://cdmana.com/2020/12/20201224215001780V.html

Tags join hive
Scroll to Top