Actually “ Data Lake ” It's a long-standing concept , If we can trace back to 2011 year .
Now The data Lake we often refer to can be considered as a centralized security repository , Users can store on any scale 、 management 、 Discover and share all structured and unstructured data , There is no need to predefine the architecture in the process .
say concretely , From the current practice of data Lake , The data put into it can be basically summed up in three types , Structured and valuable data from business systems , Although the data is not complicated ; Huge operation and maintenance data of log type , Although the value is not high, but in the enterprise IT It's indispensable for the normal operation of the architecture ; And with audio 、 Other data known for its unstructured form, such as video , It's worth highlighting, but it's less likely to be aggregated .
Usually , Different types of data need to be saved with different storage devices , Now it needs to be put into a pool and can be provided with various interfaces to complete the call , There are many challenges , And the data lake just copes with that , That is, with the help of data lake, the right data can be provided to the right people at the right time , You don't have to worry about managing access to different locations where data is stored , It can also provide a strong guarantee for data confidentiality and security , Why not ？
More about what's around the data Lake ？
According to the Aberdeen According to a survey of , The organization that implements data lake is higher than similar companies in organic revenue growth 9%. The reason is , Data lake can not only solve the problem of convenient data storage , At the same time, it can be compatible with traditional data warehouse analysis methods and try new type analysis , For example, through log files 、 Data from clickstream 、 Machine learning from new sources like social media and Internet connected devices stored in data lakes , To make intelligent decisions .
You can imagine seeing , Data lake is the powerful foundation of machine learning and artificial intelligence , It can help machines learn to use statistical algorithms learned from existing data , It's also called the training process , To make decisions about new data . say concretely , Patterns and relationships in the data will be identified during training to model , And the model becomes the key to decision intelligence , So the dual characteristics of data lake are very suitable for data scientists and researchers to conduct exploratory data query and analysis , Do some research 、 Forward looking Services .
Talk about so many amazing things about data Lake , At the technical level , Data lake and database which are often mentioned nowadays 、 How is the relationship between data warehouse and Taiwan ？ as everyone knows , The database is located in a single data application , Store the data in , The difference between relational type and non relational type ; Data warehouse is an optimized database form , Used to analyze relational data from transactional systems and line of business applications , Define the data structure and Schema To optimize fast SQL Inquire about , The results are usually used for operational reporting and Analysis .
In fact, before data warehouse, there is the concept of data mart , Most department level data can be put into it , But jingshao thinks that the definition of data warehouse is more enterprise level , Large scale , It's also an enterprise IT What can't be ignored , But because of the limitation of data warehouse , Unable to adapt to the rapid development of data changes , The era of data lake is coming ; by comparison , And data warehouse , Data storage of data lake mainly comes from Relationship data for line of business applications and mobile applications 、IoT Non relational data for devices and social media . When capturing data , No data structure or Schema, Can store all data and use different types of analysis （ Such as SQL Inquire about 、 Big data analysis 、 Full text search 、 Real time analysis and machine learning ） To get the right perspective .
Talk about database and data warehouse , Facing the hot data in the middle stage , AWS Zhang Xia, chief cloud computing enterprise strategy consultant, said , In fact, data in Taiwan is not a special term for the data industry , It's more like a professional term about application architecture in the Internet era .
To sum up , The emergence of data lake is more due to the massive storage and convenience provided by cloud computing technology 、 The possibility of High Performance Computing , In other words, it is because of the technological innovation brought by the cloud that the data lake is born .
AWS How much detail value of data Lake Service ？
When cloud computing is still in its infancy ,AWS It has already opened the opening drama with the theme of technological change , Pinch your fingers and count 2006 Ten years have passed since , In this process, the exploration of data lake is continuous and endless . In a nutshell ,AWS Organically split the data lake into data import 、 Data analysis and data storage , With the emergence of corresponding data migration expert services 、Amazon S3 Storage services and Amazon RedShift Etc. for the main analysis services , Just like positioning the data lake itself as a solution .
We see ,AWS Data Lake service is mainly based on Object storage service S3 structure .Amazon S3 As a high persistence 、 Cost effective object storage services support open data formats , At the same time, the storage and calculation are decoupled , And can do with all AWS Analysis service integration in technology matrix . According to jingshao ,Amazon S3 Provides 11 individual 9 The durability of , Highly elastic 3 Free zone Architecture , And more area replication options and separations , And has independent expansion of storage and computing power , Thus providing the best storage layer for the data Lake .
After a careful study, we found that , In fact, the life cycle of data contains a lot of content , More need for the maximum reasonable control design of the original data , Ensure data quality at the source .“ Data can be stored first in Amazon S3 in , Depending on the magnitude of the data 、 To deal with characteristics and properties , It's an automated lifecycle management function .” Zhang Xia concluded that .
It is worth mentioning that , stay AWS Huge and comprehensive data Lake Service , There is one called AWS Athena Our interactive query service is unique . When it comes to special , It mainly adopts the current popular serverless architecture , Standards can be easily used without setting up or managing infrastructure SQL Direct analysis comes from Amazon S3 The data of , There's no complicated ETL The process .
according to the understanding of ,Athena The service uses Presto, It's a kind of distributed SQL Engine to run queries ; use Apache Hive To create 、 Place and modify tables and partitions , Can quickly write in the query editor Hive canonical DDL Statements and ANSI SQL sentence ; You can also use complex joins on it 、 Window functions and complex data types . because Athena Use a mode called read time (schema-on-read) Methods , In this way, when executing a query, you can easily schema Project on target data .
in addition , Also as Amazon S3 One of the important parts of data Lake ,AWS Glue And AWS Athena No service, too Device technology Light , With no server hosting 、 Function of operation , It provides data catalog and transformation services for modern data analysis .
In general, this is a fully hosted data directory and ETL（ extract 、 Transform and load ） service , Simplify and automate data discovery 、 The difficult and time-consuming tasks in transformation and job scheduling . After all, it is observed that when users use data Lake architecture to realize data analysis solutions , Usually there are 75% Time spent on data integration tasks , Need to extract data from various data sources , Standardize it , And load it into the data store , and AWS Glue Then eliminate ETL All repetitive work on operational infrastructure .
Jingshao learns that ,AWS Glue A pre built classifier that can recognize common data formats and data types in the process of use (classifiers) Grab the data source and build the data directory , It mainly includes CSV、Apache Parquet、JSON etc. ; Be able to create a unified metadata repository across various services 、 Grab the data source to discover schema And use the new and modified table and partition definitions to fill the data directory and maintain it schema version control ; You can also use its fully hosted ETL Function to convert data or convert it to column format , To optimize costs and improve performance . On the whole , Create... With simplicity ETL The process of homework ,AWS Glue Allow users to build scalable 、 Reliable data preparation platform , These platforms can span thousands of ETL Homework , With built-in dependency resolution 、 Dispatch 、 Resource management and monitoring functions , It's easier to cross all kinds of data storage , Retrieve and manage all data , It doesn't have to be carried by hand .
A more noteworthy point ,AWS Glue You can talk to AWS Lambda as well as AWS Step Functions These serverless services are integrated , And combine it with machine learning and artificial intelligence technology , Include and Amazon SageMaker Collaborate on more automated predictive analysis , This is more amazing . Zhang Xia said , Now AWS There has been a 100 Multiple services to support any data Lake use case , More serverless in place query and processing options can shorten the time to get results and reduce the cost of data insight .
“ With AWS Glue It's operated by cloud data AWS China （ ningxia ） The area is officially online , Customers in China can easily transfer and process data from any number of data sources , Integrate data into the data lake and choose from a variety of AWS Analyze services and quickly start analyzing all data .”AWS Zhang Wenyi, global vice president and executive director of Greater China, concluded .
When it comes to more online services supporting data Lake use cases , Actually last year 8 month ,AWS They released a project called AWS Lake Formation New services for , Although it has not been put into use in China yet , But it is widely concerned by the industry , Mainly because the service will help a lot in simplifying the data Lake creation process .
For example, the original creation in a few months can be completed in a few days , And complete the collection and classification of data from database and object storage , Move data to a new Amazon S3 Data Lake , Use machine learning algorithm to clean and classify data, and access sensitive data safely .
Technical details ,AWS Lake Formation By identifying S3 Or relational databases and NoSQL Existing data stored in the database and moved to the data Lake ; After that, we grab the data 、 Cataloging and preparing for analysis , In this way, users can access the data through the analysis service they choose , Besides, others AWS Services and third party applications can also achieve the goal of accessing data by displaying Services . thus The three main elements of data Lake Service , namely Amazon S3/Glacier、AWS Glue as well as AWS Lake Formation It's all ready to go .
Talk about so much AWS After the technical details of data Lake Service , I think more officials are very concerned about ： Which volume or level of enterprises are suitable to use data lake ？ Regarding this , Zhang Xia thinks that enterprises of all sizes and fields can use the idea of data lake to build internal data application platform , Just compared with small and medium-sized enterprises , Large enterprises use this to do more 、 More complicated data analysis work . Just AWS Internal business data Lake deployment as an example , The essence is that more will come up to 60 Wan's analysis task , User recommendation involved 、 Operational information 、 Inventory and purchase , Through the data Lake service to carry out efficient analysis and continue as the core competitiveness up to now .
In addition, jingshao knows ,2016 Founded in Hangzhou by Jiayun data Club Factory, Also used all the time AWS Data Lake service to do data association , Looking for all kinds of products for customers all over the world .
I.e. deal with it every day 15 Hundreds of millions of behavior analysis , brace 180 Data mobility analysis tasks , In order to achieve user product promotion 、 Internal operation analysis and management innovation of suppliers . Of course , In fact, there are many other examples , such as More than 30 million users use the little red book AWS The data lake stores massive log data and pictures from the community 、 Comment on 、 Expression and other unstructured data to analyze user preferences ; Tens of millions of users speak fluently through the use of AWS Data lake has built a large “ Chinese English phonetic database ”, On this basis, develop oral English assessment 、 English writing scoring engine and deep adaptive learning system, etc .
from 2011 Year to date , Data lake from scattered open source solutions to the present AWS Integrate services into key 、 Unified 、 Standard solution , Its application is more and more strong ; It can be imagined that when the Internet of things 、5G、 When new technologies such as edge computing are in full swing , Data lake for data storage and analysis is really the key stage of serious discussion and Research , During this period, there will be more surprises of data intelligent mining .
本文为[Science and technology stars]所创，转载请带上原文链接，感谢