10 month 24 Japan , The fourth session of Yi Guan Shu Ke OLAP Algorithm contest The curtain officially came down , This competition is held by UCloud sponsorship 、 Think no 、infoQ、CSDN And technical media support . after 40 Many days , Go through the competition and sign up 、 Test experience 、 The official competition and so on , Krypton team 、 Volcanic soy sauce team 、 Modest team Stand out from the crowd , To win the top three , The rewards are 6 ten thousand 、3 ten thousand 、1 Million cash prize . The awarding ceremony is in 10 month 24 Japanese 2020 Yi Guan A10 Data Intelligence Summit Developer Day , Yi Guan CTO Guo Wei presented awards to the winners .
And then , The champion team —— Krypton team shared the game . The krypton team last year used 300 Millisecond accurate processing 8 Billion behavioral data The achievements of Yiguan won the third session of Yiguan OLAP The winner of the algorithm contest , This year, we face different topics and rules , Show no weakness , Fold the crown again ！
Interpretation of the contest question
This year's competition , It's equivalent to providing a APP User operation scenarios , The raw data is billion level user behavior data , For example, a user visited AI, Do a behavior, such as looking at a product 、 Purchase, etc , And the platform has 5000 Million user data , This also means that the user attribute 、 Users are clustered and linked to the sequence of billions of events , There is actually a limited challenge in this environment .
And the three sets used in the official competition 8C16G Machine , To query the task to several seconds or even hundreds of milliseconds , This is actually a big challenge for the algorithm itself .
The difficulties of this tournament
This year's algorithm contest , Although the players are ahead of time 3 Days to get the server and official game data , It looks like there's plenty of preprocessing time . But before the game 15 minute , Players will receive 100 Tens of thousands of incremental data , The time is too short to pre calculate the full data 、 build cube.
The fourth session OLAP The theme of the algorithm contest is event analysis , The topic requires that event Table and profile The relation of tables , The amount of data between the two tables is 1000000000 VS 50000000,join Spending too much . Besides , The topic also examines the contestants' calculation of de duplication 、 The distributed computing method of median and other indicators , More complicated .
Problem solving ideas and skills
Krypton team chose ClickHouse + Krypton real-time analysis platform to complete the competition , Based on users ID And date, respectively , Then, according to the type of topic, choose the appropriate partition to perform the calculation . Using column storage + Low base （low cardinality） Optimize + Data compression storage scheme , Zero sharing is used in the calculation scheme MPP+CPU Instruction set optimization + The way data is heated .
10 A hundred million yes 5000 Ten thousand connections , If we call it relatively large join, So in the three lower configuration machines , The use of memory and hard disk should be very careful . Krypton team's solution is ：
First , By means of offline tasks , Integrate historical data into a wide table , And will not be successfully associated with event The table data is pulled out ;
And then , In the access to 100 Ten thousand event、5 Ten thousand profile After incremental data of , Increment event Unrelated history event data 、 Total quantity profile Make connections , This requires join For millions of data 5000w Data Association , Greatly reduces memory pressure ;
Last , When doing multidimensional query , Only the preprocessed single table needs to be operated , And that's what clickhouse Good at work .
It's the ultimate speed in the game 、 Or flexibility / The ability to roll back ？ Krypton team mentioned in the sharing that ,LZ4 Decompression performance is the main bottleneck ; In calculating the subtotal / Total time , Wasted a completed scan , It can be used -Resample Combine and scan ; By pre ordering / Pre grouping can save the cost of de counting .
With the end of the award ceremony , The fourth Yiguan OLAP The algorithm competition has come to a successful conclusion . since 8 Since the opening of the entry channel in January , Yes, from iqiyi 、 Bili, Bili 、 China Mobile 、 Institute of computing, Chinese Academy of Sciences 、 Central South University and other more than 100 teams sign up . In addition to the sharing of the champion team at the meeting , Second runner up soy sauce team 、 The humble team of the second runner up also submitted their respective replies PPT, Top three source code will also be open source for fans to browse and learn .
You can go to the official website of the algorithm contest ：http://ds.analysys.cn/portal/2020-index.html
As a domestic OLAP Important events in the field of algorithms , Yiguan's insistence in the past four years aims at promoting domestic OLAP The exchange of Technology , Gather the wisdom of algorithms , To drive data computing and application capabilities 、 Practice data capabilities and strive to be civilian . Expect more teams to join the algorithm contest next year ！
本文为[Yiguan big data]所创，转载请带上原文链接，感谢