编程知识 cdmana.com

Resource metadata information statistics of data warehouse construction based on hive: Spark

In the construction of data warehouse , Metadata management is one of the most important links . according to Kimball Data warehouse theory of , Metadata can be divided into these three categories :

  1. Technical Metadata , Such as the storage structure of the table 、 Path to file
  2. Business Metadata , Such as blood relationship 、 The ownership of the business
  3. Process metadata , The number of rows per day 、 Occupy HDFS Space 、 Update time

And based on this 3 Class metadata " build " The metadata system , The following core functions are usually implemented :

1. Consanguinity

Such as table level / Blood relationship at the field level , These are mainly reflected in our daily SQL and ETL In the mission .

2. Big data cluster computing resource management

For the use of different computing engines such as Spark/Flink/Mapreduce, You can go to Yarn( It could also be other resource managers ) Collect the usage of related tasks on . Such as CPU、 Memory 、 disk IO usage . Then you can map the usage of these resources . Through the visual interface, we can find the abnormal situation in some tasks , And find some tables or services that consume resources heavily , Timely inform relevant person in charge of targeted analysis and optimization .

3. How to synchronize data and how to manage permissions

4. Hive Database table metadata information statistics

Here to Hive Database table statistics mainly refers to : Row number 、 Number of files 、 Occupied HDFS Storage size 、 Last operation time, etc .

By continuously collecting these indicators , Form a visual graph , Data warehouse related personnel can find data rules or data quality problems from this diagram . For those who use data warehouse for business development , We can analyze the trend of business volume through these graphs . On top of that , You can also do data quality verification 、 Numerical distribution exploration and other functions .

This paper mainly introduces how to use Spark Carry out the alignment Hive library 、 Partition table / Statistics of non partition table related indicators .

And in our actual production , We can not only update and obtain in time through the following ways Hive The index information recorded in the related tables in the meta database , We can also refer to the following related SQL stay Hive/Spark The underlying execution process , Implement our own set of business logic .

Be careful :Spark By default, the number of files is not counted

Grammar support

1) Partition table

Spark Yes Hive Partition table metadata Statistics , Follow Hive The native statistical support for partition tables is slightly different .

Spark It supports partition specific metadata information statistics , It also supports metadata statistics at the whole table level ( But it doesn't deal with specific partitions )

--  Statistics tab_partition The amount of data HDFS Total size of space and total number of rows .
-- Hive At present, it is not supported to parse partition table directly in this way 
--  Be careful : Execute this SQL Does not process specific partition statistics in the table 
analyze table tab_partition COMPUTE STATISTICS;

--  Same as Hive
analyze table tab_partition partition(partCol) COMPUTE STATISTICS;

--  Same as Hive
analyze table tab_partition partition(partCol='20200722000000') COMPUTE STATISTICS;

2) Non partition table

analyze table tab_no_partition COMPUTE STATISTICS;

Let's take a look at a specific example :

1) adopt Spark establish Hive surface

With partition table testdb.test_analyze_spark For example , When the table was first created Hive Tables in Metadatabase TABLE_PARAMS Information about :

+------+------------------------------------+--------------------+
|TBL_ID|                           PARAM_KEY|         PARAM_VALUE|
+------+------------------------------------+--------------------+
|  3018|                            EXTERNAL|                TRUE|
|  3018|            spark.sql.create.version|               2.4.3|
|  3018|spark.sql.sources.schema.numPartCols|                   1|
|  3018|   spark.sql.sources.schema.numParts|                   1|
|  3018|     spark.sql.sources.schema.part.0|{"type":"struct",...|
|  3018|  spark.sql.sources.schema.partCol.0|                  dt|
|  3018|               transient_lastDdlTime|          1595409374|
+------+------------------------------------+--------------------+

2) Counter table testdb.test_analyze Data storage and metadata information statistics

insert overwrite table testdb.test_analyze partition(partCol=20200721000000) select id,name from testdb.test_partition1 where partCol=20190626000000;

Carry out the above SQL after ,Hive An internal task will be launched Hive Partition metadata information statistics of table operations , But no numRows. as follows :

+------+------------------+------+-------+----------------------+-------+--------------------+-----------+
|  NAME|          TBL_NAME|TBL_ID|PART_ID|             PART_NAME|PART_ID|           PARAM_KEY|PARAM_VALUE|
+------+------------------+------+-------+----------------------+-------+--------------------+-----------+
|testdb|test_analyze_spark|  3018|  52977|partCol=20200721000000|  52977|            numFiles|          1|
|testdb|test_analyze_spark|  3018|  52977|partCol=20200721000000|  52977|           totalSize|        389|
|testdb|test_analyze_spark|  3018|  52977|partCol=20200721000000|  52977|transient_lastDdl...| 1595409909|
+------+------------------+------+-------+----------------------+-------+--------------------+-----------+

3) Connect Hive Meta database , Inquire about testdb.test_analyze_spark Metadata statistics for

connect jdbc where
url="jdbc:mysql://localhost:3306/hive?useUnicode=true&characterEncoding=UTF-8"
 and driver="com.mysql.jdbc.Driver"
 and user="root"
 and password="root"
 as db_1;

-- load jdbc.`db_1.TABLE_PARAMS` as TABLE_PARAMS ;
load jdbc.`db_1.TBLS` as tbls;
load jdbc.`db_1.DBS` as dbs;
load jdbc.`db_1.PARTITIONS` as partitions;
load jdbc.`db_1.PARTITION_PARAMS` as partition_params;

select d.NAME,t.TBL_NAME,t.TBL_ID,p.PART_ID,p.PART_NAME,a.*   
from tbls t   
left join dbs d  
on t.DB_ID = d.DB_ID  
left join partitions p  
on t.TBL_ID = p.TBL_ID   
left join partition_params a  
on p.PART_ID=a.PART_ID
where t.TBL_NAME='test_analyze_spark' and d.NAME='testdb' ;

4) result

-- Spark In execution analyze table mlsql_test.test_analyze_spark partition(dt='20200721000000') COMPUTE STATISTICS;  when , The number of partition rows will be counted :
+------+------------------+------+-------+----------------------+-------+-------------------------------+-----------+
|  NAME|          TBL_NAME|TBL_ID|PART_ID|             PART_NAME|PART_ID|                      PARAM_KEY|PARAM_VALUE|
+------+------------------+------+-------+----------------------+-------+-------------------------------+-----------+
|testdb|test_analyze_spark|  3018|  52977|partCol=20200721000000|  52977|                       numFiles|          1|
|testdb|test_analyze_spark|  3018|  52977|partCol=20200721000000|  52977|   spark.sql.statistics.numRows|          1|
|testdb|test_analyze_spark|  3018|  52977|partCol=20200721000000|  52977| spark.sql.statistics.totalSize|        389|
|testdb|test_analyze_spark|  3018|  52977|partCol=20200721000000|  52977|                      totalSize|        389|
|testdb|test_analyze_spark|  3018|  52977|partCol=20200721000000|  52977|          transient_lastDdlTime| 1595410238|
+------+------------------+------+-------+----------------------+-------+-------------------------------+-----------+

5) adopt Spark To the whole Hive Statistics of partition table metadata information

-- 1.  perform :analyze table testdb.test_analyze_spark COMPUTE STATISTICS;
-- 2. Hive Tables in Metadatabase TABLE_PARAMS It contains testdb.test_analyze_spark Information :

connect jdbc where
 url="jdbc:mysql://localhost:3306/hive?useUnicode=true&characterEncoding=UTF-8"
 and driver="com.mysql.jdbc.Driver"
 and user="root"
 and password="root"
 as db_1;

--  obtain mlsql_test Of DB_ID(49)
load jdbc.`db_1.DBS` as dbs;
select DB_ID from dbs where NAME='testdb' as db;

--  obtain test_analyze_spark Of TBL_ID(3018)
load jdbc.`db_1.TBLS` as tbls;
select TBL_ID from tbls where DB_ID=49 and TBL_NAME='test_analyze_spark' as t2;

--  obtain testdb.test_analyze_spark Table level statistics 
load jdbc.`db_1.TABLE_PARAMS` as TABLE_PARAMS ;
select * from TABLE_PARAMS where TBL_ID=3018 ;

--  result 
+------+------------------------------------+--------------------+
|TBL_ID|                           PARAM_KEY|         PARAM_VALUE|
+------+------------------------------------+--------------------+
|  3018|                            EXTERNAL|                TRUE|
|  3018|            spark.sql.create.version|               2.4.3|
|  3018|spark.sql.sources.schema.numPartCols|                   1|
|  3018|   spark.sql.sources.schema.numParts|                   1|
|  3018|     spark.sql.sources.schema.part.0|{"type":"struct",...|
|  3018|  spark.sql.sourc
es.schema.partCol.0|                  partCol|
|  3018|        spark.sql.statistics.numRows|                   1|
|  3018|      spark.sql.statistics.totalSize|                 389|
|  3018|               transient_lastDdlTime|          1595410958|
+------+------------------------------------+--------------------+

Hive and Spark Yes Hive The main difference of metadata information statistics of database table

  1. Yes Hive Table metadata information statistics SQL Syntax support is different, such as Spark Support for Hive Partition table for table level statistics , but Hive It needs to be assigned to a specific partition
  2. Yes Hive Table metadata information statistics in Hive The Metadatabase is different, such as the same number of rows ,Hive use numRows, and Spark use spark.sql.statistics.numRows
  3. Spark By default, the number of files is not counted , but Hive Statistics

Hive and Spark Yes Hive The above statistical data base is not limited to statistical tables 3 Species difference . Specifically, before the official account : Big data learning and sharing Introduction of related articles

 

Recommended articles :

Hive The implementation of auto increment sequence and metadata mp.weixin.qq.com

classical SparkSQL/Hive-SQL/MySQL interview - Exercises

Data Lake VS The data warehouse battle ? Ali proposed the integrated architecture of lake and warehouse

 

How to effectively restore the wrongly deleted HDFS file mp.weixin.qq.com

 

Hadoop Comparison of supported compression formats and application scenarios as well as Hadoop native library

SparkSQL And Hive metastore Parquet transformation

 

Spark and Spring Integrating offline data mp.weixin.qq.com

版权声明
本文为[Big data learning and sharing]所创,转载请带上原文链接,感谢
https://cdmana.com/2021/04/20210408123021426y.html

Scroll to Top