brief introduction ： 2020 Annual double 11, Cloud native real-time data warehouse for the first time in Alibaba double 11 Core data scenario landing , Realize real-time commercial full link , Millisecond level massive data processing capacity . Search recommendation business data development efficiency improved 4 times , Rookie logistics package data link is optimized from small time level to 3 minute , Koala minutes and hours business 1 Done in minutes , The real-time analysis and decision-making of the market are changing rapidly ！ today , We will share with you Alibaba cloud real-time big data solutions , Help enterprises make real-time decisions .
High performance 1 times , The price is low 3/4！ New choice of database real-time synchronization ！
The first step of real-time analysis and decision-making is to synchronize data to big data computing engine in real time ,DataWorks Data integration adopts self-developed high-performance engine , Under the same machine specifications ,RDS The real-time synchronization performance is the best for other data synchronization schemes 2 times , And the price can be as low as 1/4. adopt DataWorks Data integration , Enterprises can be efficient 、 Low cost 、 Stable real-time data synchronization .
DataWorks Data integration goes back to 2011 Year of DataX1.0 and 2.0 edition , And then 3.0 Version of the official external services , Later, gongyouyun said 、 Proprietary cloud 、 Three versions of Ali's internal functions are integrated , Established Data Integration service . stay 2019 year ,DataWorks Data integration is commercialized , Exclusive resource group online , Pay as you go 、 The monthly and yearly payment method has also been successfully applied to users . stay 2020 year , Full incremental real time synchronization solution officially released .
In a full incremental real-time synchronization solution system , It can be downloaded from MySql、Oracle、IBM DB2、SQL server、POLARDB Full offline synchronization to MaxCompute、Hologres、Elasticsearch、Kafka、DataHub And other big data products , Then, the change information of relational database can be extracted in real time , Sync to big data products . image MaxCompute This kind of offline data warehouse , You can synchronize to Log surface 、 Split to Delta surface 、Merge To Base surface , Finally write MaxCompute This is the way to do real-time incremental synchronization .
DataWorks Data integration can be extracted from relational databases through real-time database monitoring （MySQL、Oracle、PolarDB etc. ） Extract data , Then, it collects real-time message stream data by message subscription , The collected data can be used for data processing , Including data filtering 、 String substitution and what will be supported in the future Groovy function , This is also a relatively standard ETL technological process . The processed data can be multi output to different data sources , Combined with real-time operation and maintenance monitoring and alarm system , The solution of the whole database and full increment is formed , Let real-time synchronization have a complete link from full synchronization of the whole database to real-time incremental synchronization of the whole database to automatic incremental fusion of big data . in addition , The architecture of real-time synchronization is highly available ,DataWorks Data integration in the management and control layer and the implementation layer have made a standby machine structure , If the scheduling or data transmission link is down , You can switch to another link in an emergency , Ensure the stable implementation of the task .
The real-time synchronization technology of data integration has its own dirty data collection mechanism , Throughout ETL In the link of , Data that is not supported by both the reader and the writer , Can be collected and output to the target end of user configuration through the plug-in center , Including local logs 、Loghub、MaxCompute etc. , Provide support for data reprocessing .
In cloud solutions on big data , Through data integration, offline and real-time data are respectively sent through the offline engine （EMR、MaxCompute） And real time engine （MaxCompute Interactive analysis （Hologres）、Flink） To do data processing , And then converge to DataWorks Do data development and data services , Including machine learning PAI Platform to do model development, etc , Finally, it is open to data application , Include QuickBI、DataV、Tableau etc. .
Based on cloud solutions , A variety of scenario solutions have been established , Including intelligent real-time data warehouse solutions 、 Real time monitoring large screen solution 、 Data Lake solutions , One of the typical intelligent real-time data warehouse solutions , For e-commerce 、 game 、 Large scale real-time data query scenarios in social networking and other Internet industries ：
First step ： Data collection – adopt DataWorks Data integration （ Batch + real time ）、DataHub（ real time ） Unified data acquisition and access .
The second step ： be based on DataWorks Complete the research and development of full data link , Including data integration 、 Data development &ETL 、 Conversion and KPI Computing, etc , And data job scheduling 、 monitor 、 Alarm, etc .DataWorks Provide data development link security management and control capabilities , And based on DataWorks Data service module provides unified data service API Ability .
The third step ： Real time data is used according to the actual business needs Flink In real time ETL（ Optional ）, The results were put into storage MaxCompute Interactive analysis （Hologres） Building a real-time data warehouse 、 Application bazaar , And provide real-time interactive query and analysis of massive data .Hologres Provides real-time offline federated queries .
Step four ： Based on Ali QuickBI Advanced or third-party data analysis tools such as Tableau Row data visualization , As well as the construction of data service portal applications in various business sectors .
The scheme seamlessly connects the full set of links of alicloud real-time data warehouse with offline data warehouse . Meet a set of storage , Two calculations （ Real time computing and offline computing ） The cost-effective combination of .
be based on Hologres and Flink The real-time data analysis scheme of
After data integration synchronizes the data , We need real-time data warehouse to make better use of this data . The solution of real-time data warehouse was briefly introduced just now , Next, let's give you a detailed introduction based on Hologres and Flink The real-time data analysis scheme of .
MaxCompute Interactive analysis （Hologres） Put forward real-time data warehouse “ Integration of service analysis ” The concept of , Let a big data engine satisfy OLAP The real-time insight analysis can meet the needs of KV It's high QPS Check the demand of feature service , Integrate real-time analysis and services well , It greatly simplifies the complexity of real-time data warehouse architecture , Help customers analyze and make decisions in real time .
With the rapid development of digital transformation , The amount of data is exploding , And the requirement of data calculation is higher and higher , Low delay 、 Low resource consumption 、 high efficiency 、 High precision, etc . How to quickly summarize and analyze these massive historical data and daily real-time incremental data 、 Mining business value has become the most basic business needs .
In the process , Many companies have introduced batch processing 、 Real time computing , But offline batch data warehouse and real-time analysis are irreconcilable , Offline data warehouse can not meet the requirements of business timeliness , And absolute real-time data warehouse is also impractical ,“ Near real time ” It makes sense , And real time analysis 、 The construction of near real-time analysis is inseparable from the construction of real-time data warehouse system .
Enterprises in the real-time warehouse construction process is the most widely used Labmda framework , To a certain extent, it solves the business problems of most enterprises in the early stage of digital construction , But with the rapid development of business 、 The explosion of data volume and the change of business demand ,Lambda The problem of architecture is becoming more and more obvious , It mainly includes several aspects ：
1） Data is stored in multiple copies in different systems , Space waste , The problem of data consistency is hard to solve ;
2） The entire data link is composed of multiple engines and systems , Development and maintenance costs are high , High learning cost ;
4） The cost of learning is very high , Increased application costs .
So the architecture is simplified 、 Cost optimization 、 Data unification 、 Low learning threshold 、 Adapt to business agility 、 The trend of self-help analysis has become an urgent need , Enterprises expect to have a new big data product , It can satisfy real-time writing 、 Real time computing 、 Real time insight into requirements ; Can achieve real-time offline Integration , Reduce data movement , Business technology decoupling supports self-service analysis , So as to simplify the whole business system architecture .
In this context ,Hologres Launched HSAP idea . HSAP Refer to Hybrid Serving & Analytical Processing, It can support high QPS Real time write and query in real time , It can also complete complex multidimensional analysis scenarios in a set of systems .HSAP It's like a data warehouse + Online data services , It's a superset of the two . Enterprises need to unify the storage of real-time data and offline data , Provide efficient query service , Support high QPS Query for , Support complex analysis as well as federated query and Analysis , And it can directly dock with front-end applications , Do ad hoc analysis , Unified data services , Reduce data movement .Hologres Take as an example HSAP Concept development products , It belongs to Alibaba's self-developed big data brand MaxCompute, Support PB Level data is highly concurrent 、 Low latency analysis and services , Support real-time data warehouse 、 Big data interactive analysis and other scenarios . Its core feature is the integration of analysis and service 、 Design in real time 、 Storage computing separation architecture 、 compatible PG ecology .
Hologres In the full link real-time data warehouse construction scenario , And Flink Deep integration , Support at the same time Flink Of sink surface 、source surface 、 Dimension table . Business can be based on Flink In real time ETL cleaning 、 transformation , The detailed data 、 Light summary data and business summary data are stored in Hologres, Re pass Hologres Real time query and output data to a third-party analysis tool for real-time analysis .
MaxCompute+Hologres You can build second level interactive analysis ,MaxCompute The storehouse can pass through Hologres Speed up queries directly , No data movement , And docking BI Analysis tools , Real time analysis of offline data . Also support MaxCompute Fast data import Hologres Build index , Provide higher QPS、 Faster query response query services .
Holgores+Flink+MaxCompute Can achieve “ real time 、 offline 、 analysis 、 Service integration solution ”. Cold data is stored in MaxCompute in , Thermal data is stored in Hologres in .
At the same time with vector engine Proxima Deep integration of , It can be applied to real-time recommendation scenarios , Real time recommendation relies on feature queries 、 Real time index calculation 、 Vector retrieval recall ,Hologres Vector query function and Proxima Deep integration can provide high performance vector query services , add Flink and PAI, It can be applied to real-time personalized recommendation 、 Images 、 Video and face , Improve advertising retention .
at present Hologres It has been used in multiple customers and scenarios , Big data analysis and decision making .
1） Xiaohongshu built a large-scale ClickHouse colony , But after running for a while ,ClickHouse The disadvantages of the system are highlighted , For example, the cost is high 、 Slow query 、 unstable 、 Cluster operation and maintenance is complex . Adopted Hologres after , Get a separate architecture for storage and Computing , It's easy to store 15 Day data , And you can quickly query 7 God even 15 Day data , Query performance has been greatly improved ; And the primary key is de duplicated （insert or ignore）, The upstream failover No influence , No operation and maintenance and other advantages , Very high customer satisfaction .
2） Rookie intelligent logistics engine originally used Flink+HBase+OLAP The plan , It takes a long time to import data 、 Waste of resources 、 Data island and other issues seriously troubled business students , use Hologres after , The whole link 2 100 million records data processing speed end to end optimization to 3 minute , The development efficiency has been greatly improved , Overall hardware costs are down 60%.
3） Alibaba customer experience Division （CCO） Used before DataHub+Flink+OLAP+Lindorm Data warehouse plan , There is duplication of tasks 、 Data storage redundancy 、 Metadata management 、 The processing link is complex and the pain point is equal . And this year, double 11,Hologres help CCO Build sets in real time 、 Self-help 、 Systematic user experience real-time data warehouse , Perfect for double 11 scene , Support thousands of + Service big screen , Peak shaving 30%, The overall cost saving is close to 30%.Flink Write in real time TPS Peak value 100w+/s, Write delay stable 500us Inside , double 11 On the same day latency Average 142ms,99.99% Query on 200ms within .
be based on ELK Low cost real-time log monitoring analysis
Our real-time data is stored in the big data engine , There's also a lot of unstructured log data , Through Alibaba cloud's Elasticsearch, Provide low-cost hot and cold storage solutions in a fully managed way , Easy to help enterprises build a unified cloud full observation operation and maintenance monitoring platform , Realize real-time monitoring and analysis of massive data , Improve the efficiency of automatic operation and maintenance management .
Enterprise big data IT Operation and maintenance has experienced from simple operation and maintenance tools to operation and maintenance platform , Then to automatic operation and maintenance and fault prevention operation and maintenance , Up to now, it is evolving in the direction of intelligent operation and maintenance . But the existing big data operation and maintenance analysis methods still exist, and there are many atomic tools , The cost of starting is high 、 The connection between tools is difficult ,Monitoring( monitor )、Logging( journal )、Tracing( Location tracking ) We can't depend on each other for more value 、 In the real business, the revenue completely depends on the user's architecture ability and other obvious problems .
The pain points of operation and maintenance monitoring in the full observation scenario are all convergent , Like logs / There are different ways to obtain indicators , The cost of access is high 、 journal / Big challenge in index format 、 Operation and maintenance scalability , Peak stability 、 Massive data long-term storage cost is high 、 It is difficult to analyze the time series system , Log analysis tool retrieval performance bottleneck 、 High scalability requirements . To solve these problems ,Elasticsearch emerge as the times require . Open source Elasticsearch It's based on Lucene Real time distributed search and analysis engine , follow Apache Open source terms . It provides a distributed service , Can provide fast near real-time storage 、 Query and analyze large datasets . Because of its fast query speed 、 Easy to use , Often used to build complex query features .
Elasticsearch Builds on the Elastic Stack In the open source ecological matrix , Include Beats（ Lightweight data acquisition tools ）、Logstash（ collect 、 Filter 、 A tool for transmitting data ）、Elasticsearch、Kibana（ Flexible visualization tools ）.
Elastic Stack The ability basically solves the problem of 6 A pain points ：
1）Beats Get log / indicators , Provide support Autodiscover Of Beats Agent, Collect all kinds of data in a unified way
2） Rich formatted logs / Indicator means , Including open source software 、 Log in network format / The index adopts the template , No formatting required , With real-time data processing extension components , Provide rich transformations UDF/Plugin
3） High stability guarantees ： Based on distributed architecture , Ensure the basic throughput and performance of the cluster , Cross-room deployment 、 Disaster tolerance in the same city 、 Scenario based kernel optimization, etc
4） Low cost ： Alibaba cloud ElasticStack To provide heat - temperature - cold - Freezing four layer data storage mode , Use special storage compression function , Dramatically reduce storage costs .
5） Provide log analysis 、 monitor 、Tracing One stop capability , In depth optimization engine for timing scenarios , Ensure the performance of sequential log monitoring and analysis .
6） Extensibility ： Based on distributed architecture , And flexible and open RestAPI and Plugin frame , The open source community behind it is also providing rich docking means for new technology stacks
Based on this open source ecological matrix ELK You can implement logging 、 indicators 、APM、 Unified analysis of business data in one platform , Create a unified visual view 、 Align time 、 Filter conditions 、 Unified rule-based monitoring and alarm 、 Unified intelligent monitoring and alarm of machine learning . It can dock Spark、Flink Open source processing tools for more unified formatting , Stored at the end of Elasticsearch Provided to Kibana Data visualization monitoring alarm , Through association analysis and machine learning to make full use of the data scattered in all layers of the whole system , Better play to the value of some data .
Elasticsearch The business company behind it Elastic And Alibaba cloud on 2017 Strategic cooperation began in , Provide full hosting on alicloud Elastic Stack service ,100% Compatible with open source , Free supply X-Pack Commercial plug-ins , Open and use , Pay as you go . At the same time, deep function and kernel performance optimization , Provide more abundant analysis and retrieval ability , More secure 、 High Availability Services .
Compared to open source self built ,Elasticsearch It has the advantage of out of the box operation free , Enterprises can migrate to the cloud at zero cost 、 At the same time, it has stronger function and performance 、TCO The estimated cost is self built 75%.
In terms of stability , When the flow peak comes , Alibaba cloud Elasticsearch Self developed current limiting Qos plug-in unit , Implement index level read / write flow control , When a single index query / When the writing pressure is too high , For the specified index , Appropriate downgrade according to the priority of the business , Control the flow rate in a proper range .
On cost , Alibaba cloud ES Provide full hosting elastic scalability operation and maintenance capabilities , Avoid waste of low peak resources , By purchasing elastic data nodes 、 The console is configured with regular expansion and reduction 、 According to the dynamic expansion and reduction of business traffic .
in addition , Alibaba cloud Elasticsearch Log enhanced separates computing storage , Use NFS Shared storage serves as the underlying storage of nodes , Using the primary and secondary partitions , The main slice is readable and writable , Sub slice read-only mode , Achieve storage cost savings 100%、 Write performance improvement 100%、 The benefits of second order elastic expansion and contraction .
Alibaba cloud will 2021 year 1 Alibaba cloud was launched in January Elasticsearch Timing write Serverless edition , Can greatly reduce the timing / Log scenario usage cost . Users will not have to pay attention to ES Cluster write resources and write pressure , When a business request changes , From the cloud Serverless Deploy physical resources , According to the need to use , Pay as you go , Super elastic capacity expansion , Provide low cost local computing and storage nodes , Reduce data storage costs .
At the business level , Good future （ The predecessor of XRS ） As Elasticsearch Alibaba cloud benchmarking customers , Live cloud business supports millions of online classes , Support the interaction between teachers and students , Promise not to get stuck and support 500ms Low latency HD quality . But with more and more monitoring indicators , It is difficult to guarantee the real-time quality of live broadcasting . To ensure the customer experience , Tal also needs to conduct fine-grained Data permission analysis on the data in a large pool , And deal with the high flow of winter and summer vacation in the education industry , Strong volatility .
Elasticsearch It provides rich heterogeneous data collection capabilities for tal 、 The ability of log parsing and processing based on template 、 The ability to split data permissions to the field level , Support users to customize the authority system flexibly , Connect with the enterprise's own authority system , And smooth scaling , Cluster hot change , Right service 0 Influence, etc , To meet the real-time live quality monitoring and stability guarantee of customers in the scene of large traffic .
Use alicloud Databricks Data insight building Hadoop Ecological batch flow integration real-time data warehouse
except Hologres+Flink Build real-time data warehouse , Many companies are using Hadoop Ecological engine builds big data analysis platform , And offline data warehouse has been used maturely . Alibaba cloud Databricks Data insight can be based on Hadoop Ecological construction batch flow integration of real-time data warehouse , Upgrade the existing architecture of the enterprise , Meet the requirements of real-time analysis and decision-making .
The enterprise is based on Hadoop There are always some problems that can't be avoided in the ecological construction of big data platform , such as ：
1） Want to tune the job , But don't understand the kernel , High technical threshold 、 Lack of operation and maintenance personnel
2） Cluster maintenance costs are high 、 as time goes on ,HDFS The cost of data storage on is getting higher and higher
3） Need to process both streaming and batch data , The technical architecture is complex , Maintenance difficult ,BUG many
4） There is a need to add, delete and modify the data , It is difficult to provide transactional support under big data
5） Data engineers and data analysts have their own environments , It's hard to share , Working together .
The corresponding solution for enterprises is basically ： Buy expert services / Add computing resources directly 、 Using fully hosted cloud computing services 、 Multi engine collaboration , A processing stream , A batch processing method , And these solutions usually need multiple products to complete together .
Enterprises hope to complete such data analysis demands through a product . Get the data lake or Kafka After the event data , In flow analysis and BI Before Report , I hope there's an engine in the middle . This engine has a data Lake architecture that supports the separation of storage and computing 、 At the same time, it can process both streaming data and batch data 、 It can also support incremental data writing . So Alibaba cloud introduced Databricks Data Insight .Databricks Data insight is Alibaba cloud and databricks A product that the company cooperates with .Databricks As an American technology company, unicorn , yes Apache Spark The business company behind it , stay 2020 year Gartner Published data science and machine learning （DSML） Platform Magic Quadrant report , In the leadership quadrant , In the world with 5,000 Multiple customers and 450 Multiple partners . Enterprises use Alibaba cloud Databricks Data insight can build Hadoop Ecological batch flow integration real-time data warehouse .
Databricks Data insight is in ETL And data science has a lot of advantages , The most significant ones are the performance of the standard test set, which is the highest in open source 50 Times promotion 、 Use a common data storage format Parquet, High scalability , And support customer customized deployment , Meet customized needs 、 Use enterprise class Spark, Perfect open source compatibility Spark, Migration basically doesn't need API Level modification 、 Provide Z-Order Optimize , The amount of data read is reduced 95%,20 Performance improvement 、 Common table and query caching ,30 Performance improvement 、PB Level scalability 、 Providing interactive collaboration Notebook, It can meet the editing tasks of data engineers and scientists / The need to share results 、 Combine big data with AI Unified to one platform , Underlying shared data . Architecturally Databricks yes Delta Data Lake analysis under the framework , It is also a real-time data warehouse with batch and stream integration .
Databricks Data insight has begun to provide real-time big data analysis and decision-making capabilities for all walks of life , In the financial industry , Companies need to leverage data and ML（Machine Learning） Realize consumer mobile applications app Fast iteration of our products , To attract more customers .Databricks Of notebook Data sharing capability and stream batch integrated data architecture , Meet the needs of customers to process and identify the streaming and batch data of millions of users , And the interface is unified . Customer APP Participation has increased 4.5 times 、 Data processing time from 6 Shorten the hours to 6 second 、 Replace the original with a data lake 14+ A database , Efficiency has been greatly improved .
In the new retail industry , In order to ensure real-time data collection of enterprise supply chain 、 Speed up data processing , Support real-time decision making （ Quickly detect problems , Reduce economic losses ）. Use Databricks After building a real-time data warehouse , Data delay from 2 Hours down to 15 second , And because the data link is streamlined , The amount of business code is also reduced accordingly ：Python Code from 565 Line reduced to 317 That's ok ,YML Configure from 252 Line reduced to 23 That's ok .
Real time analysis and decision-making of big data is a hot topic at present , Companies want technology to respond more quickly to business needs , Alibaba cloud also hopes that through the ability of productization , Help enterprises use data faster and better , Timely response to the data needs of enterprises .
This article is the original content of Alibaba cloud , No reprint without permission .
本文为[Alibaba cloud technology blog]所创，转载请带上原文链接，感谢