编程知识 cdmana.com

Big data development spark five questions of soul torture

1.Spark Computing depends on memory , If at present only 10g Memory , But we need to 500G Sort and output , How to operate ?

 ①、 Put... On the disk 500G The data is divided into 100 block (chunks), each 5GB.( Be careful , Leave some system space !)   

②、 The order will be each 5GB Data read into memory , Use quick sort Algorithmic sorting . 

③、 Put the sorted data ( It's also 5GB) Save back to disk . 

④、 loop 100 Time , Now? , be-all 100 Each block has been sorted .( The rest of the work is how to merge and sort them !) 

⑤、 from 100 Read... In blocks 5G/100=0.05 G Into memory (100input buffers). 

⑥、 perform 100 Road merging , And temporarily store the merged results in 5g Memory based output buffer . When the buffer is full 5GB when , Write the final file on the hard disk , And clear the output buffer ; When 100 When any one of the input buffers has been processed , Write the next... In the block corresponding to the buffer 0.05 GB, Until all processing is completed .

2.countByValue and countByKey The difference between

First of all, from the perspective of source code :

// PairRDDFunctions.scala
def countByKey(): Map[K, Long] = self.withScope {
  self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
}

// RDD.scala
def countByValue()(implicit ord: Ordering[T] = null): Map[T, Long] = withScope {
  map(value => (value, null)).countByKey()
} 

countByValue(RDD.scala)

  • It works on ordinary RDD On

  • Its implementation calls countByKey

countByKey(PairRDDFunctions.scala)

  • It works on PairRDD On

  • Yes key Count

  • Data to be received Driver End , At the end of the day , Do not apply

problem :

  • countByKey It can work on ordinary RDD On?

  • countByValue It can work on PairRDD On?

val rdd1: RDD[Int] = sc.makeRDD(1 to 10)
val rdd2: RDD[(Int, Int)] = sc.makeRDD((1 to 10).toList.zipWithIndex)

val result1 = rdd1.countByValue() // Sure 
val result2 = rdd1.countByKey() // Grammar mistakes 

val result3 = rdd2.countByValue() // Sure 
val result4 = rdd2.countByKey() // Sure 

3. Two rdd join when will you have the item? shuffle When didn't shuffle

among join Operation is an important index to test the performance of all databases , about Spark Come on , test join And that's what it's all about Shuffle,Shuffle It needs to be transmitted through disk and network ,Shuffle The less data, the better performance , Sometimes you can try to avoid the program Shuffle , So under what circumstances are there Shuffle , Under no circumstances Shuffle Well

3.1 Broadcast join

broadcast join It's easy to understand , In addition to self realization ,Spark SQL It has been implemented by default for us , In fact, the small table is distributed to all Executors, The control parameters are :spark.sql.autoBroadcastJoinThreshold The default size is 10m, That is, if it is less than this threshold, it will be used automatically broadcast join.

3.2 Bucket join

Actually rdd The way and table similar , The difference is that the latter has to write Bucket surface , This is mainly about rdd The way , The principle is , When two rdd According to the same partition method , Partition in advance , Partition results are consistent , So we can do it Bucket join, In addition, this join There is no prior operator , You need to develop your own programs when you write them , For this kind of table join You can look at it Byte jumps at Spark SQL The core Optimization Practice on . Here's an example

rdd1、rdd2 All are Pair RDD

rdd1、rdd2 The data is exactly the same

There must be shuffle

rdd1 => 5 Zones

rdd2 => 6 Zones

rdd1 => 5 Zones => (1, 0), (2,0), || (1, 0), (2,0), || (1, 0), (2,0), || (1, 0), (2,0),(1, 0), || (2,0),(1, 0), (2,0)

rdd2 => 5 Zones => (1, 0), (2,0), || (1, 0), (2,0), || (1, 0), (2,0), || (1, 0), (2,0),(1, 0), || (2,0),(1, 0), (2,0)

There must be no shuffle

rdd1 => 5 Zones => (1,0), (1,0), (1,0), (1,0), (1,0), || (2,0), (2,0), (2,0), (2,0), (2,0), (2,0), (2,0) || empty || empty || empty

rdd2 => 5 Zones => (1,0), (1,0), (1,0), (1,0), (1,0), || (2,0), (2,0), (2,0), (2,0), (2,0), (2,0), (2,0) || empty || empty || empty

That all Shuffle The operator of , If the data is partitioned in advance (partitionBy), In many cases there is no Shuffle.

In addition to the above two ways , Generally, there are Shuffle Of join, About spark Of join The principle can be seen : Big data development -Spark Join The principle,

4..transform Doesn't it have to trigger action

There's an operator exception , That's it sortByKey, There's a sampling algorithm at the bottom , Pool sampling , Finally, according to the results of sampling , Conduct RangePartition Of , So from job From an angle, you'll see two job, Except for triggering action Outside of the operator itself , Remember the following

sortByKey → Pool sampling → collect

5. How are broadcast variables designed

We all know , The broadcast variable is to put data into each excutor On , We all know that the data of broadcast variables must be from driver I started going out , What does that mean , If the radio meter is in hive In the table , So it's stored in various block On the block , It also corresponds to multiple excutor ( It's a different name ), First pull the data to driver On , And then broadcast it , It's not all on the air , It's based on excutor Using data in advance , First, take the data , And then through bt Protocol for transmission , What is? bt What about the agreement? , It's data on distributed peer-to-peer networks , Pull the corresponding data according to the network distance , The Downloader is also the uploader , So it's different from each other task (excutor) From the driver Pull up the data , This reduces stress , In addition to spark1. A few days ago, it was still task Level , Now it's a common lock , Whole excutor Upper task Share this data .

Reference resources

https://juejin.cn/post/6844903989557854216

https://www.jianshu.com/p/6bf887bf52b2

Big data development , Pay more attention to see your profile

版权声明
本文为[Hoult_ Wu Xie]所创,转载请带上原文链接,感谢
https://cdmana.com/2021/01/20210131170357631T.html

Scroll to Top