In the construction of data warehouse , Metadata management is one of the most important links . according to Kimball Data warehouse theory of , Metadata can be divided into these three categories :

  1. Technical Metadata , Such as the storage structure of the table 、 Path to file
  2. Business Metadata , Such as blood relationship 、 The ownership of the business
  3. Process metadata , The number of rows per day 、 Occupy HDFS Space 、 Update time

And based on this 3 Class metadata " build " The metadata system , The following core functions are usually implemented :

1. Consanguinity

Such as table level / Blood relationship at the field level , These are mainly reflected in our daily SQL and ETL In the mission .

2. Big data cluster computing resource management

For the use of different computing engines such as Spark/Flink/Mapreduce, You can go to Yarn( It could also be other resource managers ) Collect the usage of related tasks on . Such as CPU、 Memory 、 disk IO usage . Then you can map the usage of these resources . Through the visual interface, we can find the abnormal situation in some tasks , And find some tables or services that consume resources heavily , Timely inform relevant person in charge of targeted analysis and optimization .

3. How to synchronize data and how to manage permissions

4. Hive Database table metadata information statistics

Here to Hive Database table statistics mainly refers to : Row number 、 Number of files 、 Occupied HDFS Storage size 、 Last operation time, etc .

By continuously collecting these indicators , Form a visual graph , Data warehouse related personnel can find data rules or data quality problems from this diagram . For those who use data warehouse for business development , We can analyze the trend of business volume through these graphs . On top of that , You can also do data quality verification 、 Numerical distribution exploration and other functions .

This paper mainly introduces how to use Spark Carry out the alignment Hive library 、 Partition table / Statistics of non partition table related indicators .

And in our actual production , We can not only update and obtain in time through the following ways Hive The index information recorded in the related tables in the meta database , We can also refer to the following related SQL stay Hive/Spark The underlying execution process , Implement our own set of business logic .

Be careful :Spark By default, the number of files is not counted

Grammar support

1) Partition table

Spark Yes Hive Partition table metadata Statistics , Follow Hive The native statistical support for partition tables is slightly different .

Spark It supports partition specific metadata information statistics , It also supports metadata statistics at the whole table level ( But it doesn't deal with specific partitions )

--  Statistics tab_partition The amount of data HDFS Total size of space and total number of rows .
-- Hive At present, it is not supported to parse partition table directly in this way
-- Be careful : Execute this SQL Does not process specific partition statistics in the table
analyze table tab_partition COMPUTE STATISTICS; -- Same as Hive
analyze table tab_partition partition(partCol) COMPUTE STATISTICS; -- Same as Hive
analyze table tab_partition partition(partCol='20200722000000') COMPUTE STATISTICS;

2) Non partition table

analyze table tab_no_partition COMPUTE STATISTICS;

Let's take a look at a specific example :

1) adopt Spark establish Hive surface

With partition table testdb.test_analyze_spark For example , When the table was first created Hive Tables in Metadatabase TABLE_PARAMS Information about :

+------+------------------------------------+--------------------+
|TBL_ID| PARAM_KEY| PARAM_VALUE|
+------+------------------------------------+--------------------+
| 3018| EXTERNAL| TRUE|
| 3018| spark.sql.create.version| 2.4.3|
| 3018|spark.sql.sources.schema.numPartCols| 1|
| 3018| spark.sql.sources.schema.numParts| 1|
| 3018| spark.sql.sources.schema.part.0|{"type":"struct",...|
| 3018| spark.sql.sources.schema.partCol.0| dt|
| 3018| transient_lastDdlTime| 1595409374|
+------+------------------------------------+--------------------+

2) Counter table testdb.test_analyze Data storage and metadata information statistics

insert overwrite table testdb.test_analyze partition(partCol=20200721000000) select id,name from testdb.test_partition1 where partCol=20190626000000;

Carry out the above SQL after ,Hive An internal task will be launched Hive Partition metadata information statistics of table operations , But no numRows. as follows :

+------+------------------+------+-------+----------------------+-------+--------------------+-----------+
| NAME| TBL_NAME|TBL_ID|PART_ID| PART_NAME|PART_ID| PARAM_KEY|PARAM_VALUE|
+------+------------------+------+-------+----------------------+-------+--------------------+-----------+
|testdb|test_analyze_spark| 3018| 52977|partCol=20200721000000| 52977| numFiles| 1|
|testdb|test_analyze_spark| 3018| 52977|partCol=20200721000000| 52977| totalSize| 389|
|testdb|test_analyze_spark| 3018| 52977|partCol=20200721000000| 52977|transient_lastDdl...| 1595409909|
+------+------------------+------+-------+----------------------+-------+--------------------+-----------+

3) Connect Hive Meta database , Inquire about testdb.test_analyze_spark Metadata statistics for

connect jdbc where
url="jdbc:mysql://localhost:3306/hive?useUnicode=true&characterEncoding=UTF-8"
and driver="com.mysql.jdbc.Driver"
and user="root"
and password="root"
as db_1; -- load jdbc.`db_1.TABLE_PARAMS` as TABLE_PARAMS ;
load jdbc.`db_1.TBLS` as tbls;
load jdbc.`db_1.DBS` as dbs;
load jdbc.`db_1.PARTITIONS` as partitions;
load jdbc.`db_1.PARTITION_PARAMS` as partition_params; select d.NAME,t.TBL_NAME,t.TBL_ID,p.PART_ID,p.PART_NAME,a.*
from tbls t
left join dbs d
on t.DB_ID = d.DB_ID
left join partitions p
on t.TBL_ID = p.TBL_ID
left join partition_params a
on p.PART_ID=a.PART_ID
where t.TBL_NAME='test_analyze_spark' and d.NAME='testdb' ;

4) result

-- Spark In execution analyze table mlsql_test.test_analyze_spark partition(dt='20200721000000') COMPUTE STATISTICS;  when , The number of partition rows will be counted :
+------+------------------+------+-------+----------------------+-------+-------------------------------+-----------+
| NAME| TBL_NAME|TBL_ID|PART_ID| PART_NAME|PART_ID| PARAM_KEY|PARAM_VALUE|
+------+------------------+------+-------+----------------------+-------+-------------------------------+-----------+
|testdb|test_analyze_spark| 3018| 52977|partCol=20200721000000| 52977| numFiles| 1|
|testdb|test_analyze_spark| 3018| 52977|partCol=20200721000000| 52977| spark.sql.statistics.numRows| 1|
|testdb|test_analyze_spark| 3018| 52977|partCol=20200721000000| 52977| spark.sql.statistics.totalSize| 389|
|testdb|test_analyze_spark| 3018| 52977|partCol=20200721000000| 52977| totalSize| 389|
|testdb|test_analyze_spark| 3018| 52977|partCol=20200721000000| 52977| transient_lastDdlTime| 1595410238|
+------+------------------+------+-------+----------------------+-------+-------------------------------+-----------+

5) adopt Spark To the whole Hive Statistics of partition table metadata information

-- 1.  perform :analyze table testdb.test_analyze_spark COMPUTE STATISTICS;
-- 2. Hive Tables in Metadatabase TABLE_PARAMS It contains testdb.test_analyze_spark Information : connect jdbc where
url="jdbc:mysql://localhost:3306/hive?useUnicode=true&characterEncoding=UTF-8"
and driver="com.mysql.jdbc.Driver"
and user="root"
and password="root"
as db_1; -- obtain mlsql_test Of DB_ID(49)
load jdbc.`db_1.DBS` as dbs;
select DB_ID from dbs where NAME='testdb' as db; -- obtain test_analyze_spark Of TBL_ID(3018)
load jdbc.`db_1.TBLS` as tbls;
select TBL_ID from tbls where DB_ID=49 and TBL_NAME='test_analyze_spark' as t2; -- obtain testdb.test_analyze_spark Table level statistics
load jdbc.`db_1.TABLE_PARAMS` as TABLE_PARAMS ;
select * from TABLE_PARAMS where TBL_ID=3018 ; -- result
+------+------------------------------------+--------------------+
|TBL_ID| PARAM_KEY| PARAM_VALUE|
+------+------------------------------------+--------------------+
| 3018| EXTERNAL| TRUE|
| 3018| spark.sql.create.version| 2.4.3|
| 3018|spark.sql.sources.schema.numPartCols| 1|
| 3018| spark.sql.sources.schema.numParts| 1|
| 3018| spark.sql.sources.schema.part.0|{"type":"struct",...|
| 3018| spark.sql.sourc
es.schema.partCol.0| partCol|
| 3018| spark.sql.statistics.numRows| 1|
| 3018| spark.sql.statistics.totalSize| 389|
| 3018| transient_lastDdlTime| 1595410958|
+------+------------------------------------+--------------------+

Hive and Spark Yes Hive The main difference of metadata information statistics of database table

  1. Yes Hive Table metadata information statistics SQL Syntax support is different, such as Spark Support for Hive Partition table for table level statistics , but Hive It needs to be assigned to a specific partition
  2. Yes Hive Table metadata information statistics in Hive The Metadatabase is different, such as the same number of rows ,Hive use numRows, and Spark use spark.sql.statistics.numRows
  3. Spark By default, the number of files is not counted , but Hive Statistics

Hive and Spark Yes Hive The above statistical data base is not limited to statistical tables 3 Species difference . Specifically, before the official account : Big data learning and sharing Introduction of related articles

Recommended articles :

Hive The implementation of auto increment sequence and metadata ​mp.weixin.qq.com

classical SparkSQL/Hive-SQL/MySQL interview - Exercises

Data Lake VS The data warehouse battle ? Ali proposed the integrated architecture of lake and warehouse

How to effectively restore the wrongly deleted HDFS file ​mp.weixin.qq.com

Hadoop Comparison of supported compression formats and application scenarios as well as Hadoop native library

SparkSQL And Hive metastore Parquet transformation

Spark and Spring Integrating offline data ​mp.weixin.qq.com

be based on Hive Statistics of resource metadata information for data warehouse construction :Spark More related articles in this article

  1. be based on MaxCompute Data quality management of data warehouse

    Statement The non functional specifications introduced in this article are all suggested specifications , There is no compulsory function of the product , For guidance only . reference < The road to big data —— Alibaba big data practice >—— Alibaba data technology and products department Writing . Background and purpose Data has become... For an enterprise ...

  2. Meituan's comments are based on Flink Real time warehouse construction practice

    https://mp.weixin.qq.com/s?__biz=MjM5NjQ5MTI5OA==&mid=2651749037&idx=1&sn=4a448647b3dae5 ...

  3. stay HUE Import data in text format into hive In the warehouse

    Today, there is a need to put a document in the form of hft And fdd The data of city relationship is imported into hive In the warehouse , Not before hue Do this in ( All of them go through xshell Login fortress machine directly connected to the server for operation ), It is hereby recorded that . - ...

  4. Hive Data warehouse in the common date conversion operation

    (1)Hive Some commonly used in data warehouse dt And date conversion operation Here is a summary of some of the date conversions that are often used in my work , This kind of date conversion is often used in the control of report time granularity and statistical period Date change : (1)dt Transfer date to_date(f ...

  5. Use Oozie in workflow Time task rerun hive Historical scheduling of data warehouse table

    In the data warehouse and BI During the development and use of the system, it is often necessary to rerun some partition data in the data warehouse or for a period of time , The reason might be :1. Data statistics and computational logic / Caliber adjustment ,2. It is found that the previous data collection of buried points is wrong or the buried points are wrong ,3. Business database out of ...

  6. be based on Hive File format :RCFile Introduction and application

    Reprinted from :https://my.oschina.net/leejun2005/blog/280896 Hadoop As MR Open source implementation , Always run dynamically to parse the file format and get more than MPP The loading speed of the database is several times faster ...

  7. Hadoop、Pig、Hive、Storm、NOSQL Learning resource collection

    ( One )hadoop Related installation and deployment 1.hadoop stay windows cygwin Next deployment : http://lib.open-open.com/view/1333428291655 http://blo ...

  8. HAWQ Replace the traditional data warehouse practice ( nineteen )——OLAP

    One .OLAP brief introduction 1. Concept OLAP It's English. It's On-Line Analytical Processing Abbreviation , On line analytical processing . This concept was first developed by the father of relational database E.F.Codd On 1993 in .OLAP Permit ...

  9. CarbonData: A new generation of big data fusion engine

    [ Abstract ] CarbonData Separate storage and computing logic , Making storage and computing physically closer through indexing technology , promote CPU and IO efficiency , Realize super high performance big data analysis . With CarbonData Big data solutions for Data Warehouse Integration , Fight for financial transformation ...

  10. Reprint : be based on Hive File format :RCFile Introduction and application --- Push cool

    Hadoop As MR Open source implementation , Always run dynamically to parse the file format and get more than MPP Database loading speed is several times faster . however ,MPP The database community has been criticizing Hadoop Because the file format is not built for a specific purpose , So serialization and reordering ...

Random recommendation

  1. SVN Cornerstone Error message xcodeproj cannot be opened because the project file cannot be parsed.

    svn Click on update after , open xcode Engineering documents , There will be  xxx..xcodeproj  cannot be opened becausethe project file cannot be p ...

  2. turn : unix Actual user ID And effective users ID analysis

    I'm looking at APUE, These two questions are hard to understand ,GOOGLE once , There is an article that sums up well , It's only after a look that I understand it thoroughly . Because the user is UNIX I often encounter SUID.SGID The concept of , and SUID and SGID It's about system security , therefore ...

  3. C#7.0 There's something new in it

    The following will be C# 7.0 Description of the language features of all plans in . With Visual Studio “15” Preview 4 Release of version , Most of these features will come alive . Now it's time to show these features , You also tell me to take this opportunity to tell ...

  4. iOS Development - You really can use SDWebImage?( forward )

    Original address : http://www.jianshu.com/p/dabc0c6d083e SDWebImage As the most popular third-party framework for image download , The usage rate is very high . But do you really know how to use it ? Next, this paper will analyze through examples ...

  5. ZYB&#39;s Game( game )

    ZYB's Game Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/65536 K (Java/Others)Total ...

  6. linux Memory management initialization

    The memory management subsystem is linux The core and most important part of the kernel , The rest of the kernel needs to run on top of the memory management subsystem . The initialization is the basis of understanding the whole memory management subsystem . The initialization of the related data structure is to start the routine from the whole world start_ke ...

  7. BZOJ4944 The swimming pool Problem solving report

    Title Description There is one \(n\) A sea of infinite rows , Each grid has \(q\) Probabilistic security of network , \(1-q\) It's not safe . Frame out a rectangle with the largest area , The following two conditions are met : (1) All the squares in the rectangle are safe : (2) rectangular ...

  8. C# Reflection analysis and use in ( turn )

    original text :https://cloud.tencent.com/developer/article/1129356 1. Yes C# Understanding of reflection mechanism 2. After understanding the concept , We have to find a way to do it , Give the main syntax of management 3. Final ...

  9. Android Master :TableLayout Layout ,GridLayout Grid layout ,FrameLayout Frame layout ,AbsoluteLayout Absolute layout ,RelativeLayout Relative layout

    stay Android Several common layouts are provided in : LinearLayout Linear layout RelativeLayout Relative layout FrameLayout Frame layout AbsoluteLayout Absolute layout TableLayou ...

  10. BTREE such Mysql The default indexing method , It has universal applicability

    Article transferred from  https://blog.csdn.net/caomiao2006/article/details/52145477 Mysql At present, there are mainly the following indexing methods :FULLTEXT,HASH,BTR ...