5.3.2 双 `Value` 类型交互

这里的"双 Value 类型交互"是指的两个 RDD[V] 进行交互.

1. `union(otherDataset)`

作用：求并集. 对源 RDD 和参数 RDD 求并集后返回一个新的 RDD

案例

需求: 创建两个RDD，求并集

scala> val rdd1 = sc.parallelize(1 to 6)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(4 to 10)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> rdd1.union(rdd2)
res0: org.apache.spark.rdd.RDD[Int] = UnionRDD[4] at union at <console>:29

scala> res0.collect
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 4, 5, 6, 7, 8, 9, 10)

注意:

union和++是等价的

2. `subtract (otherDataset)`

作用: 计算差集. 从原 RDD 中减去原 RDD 和 otherDataset 中的共同的部分.

scala> rdd1.subtract(rdd2).collect
res4: Array[Int] = Array(2, 1, 3)

scala> rdd2.subtract(rdd1).collect
res5: Array[Int] = Array(8, 10, 7, 9)

3. `intersection(otherDataset)`

作用: 计算交集. 对源 RDD 和参数 RDD 求交集后返回一个新的 RDD

scala> rdd1.intersection(rdd2).collect
res8: Array[Int] = Array(4, 6, 5)

4. `cartesian(otherDataset)`

作用: 计算 2 个 RDD 的笛卡尔积. 尽量避免使用

scala> rdd1.cartesian(rdd2).collect
res11: Array[(Int, Int)] = Array((1,4), (1,5), (1,6), (2,4), (2,5), (2,6), (3,4), (3,5), (3,6), (1,7), (1,8), (1,9), (1,10), (2,7), (2,8), (2,9), (2,10), (3,7), (3,8), (3,9), (3,10), (4,4), (4,5), (4,6), (5,4), (5,5), (5,6), (6,4), (6,5), (6,6), (4,7), (4,8), (4,9), (4,10), (5,7), (5,8), (5,9), (5,10), (6,7), (6,8), (6,9), (6,10))

5. `zip(otherDataset)`

作用: 拉链操作. 需要注意的是, 在 Spark 中, 两个 RDD 的元素的数量和分区数都必须相同, 否则会抛出异常.(在 scala 中, 两个集合的长度可以不同)

其实本质就是要求的每个分区的元素的数量相同.

scala> val rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[34] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(11 to 15)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[35] at parallelize at <console>:24

scala> rdd1.zip(rdd2).collect
res17: Array[(Int, Int)] = Array((1,11), (2,12), (3,13), (4,14), (5,15))

5.3.2 双 Value 类型交互

5.3.2 双 `Value` 类型交互

1. `union(otherDataset)`

案例

注意:

2. `subtract (otherDataset)`

3. `intersection(otherDataset)`

4. `cartesian(otherDataset)`

5. `zip(otherDataset)`

results matching ""

No results matching ""

5.3.2 双 Value 类型交互

1. union(otherDataset)

案例

注意:

2. subtract (otherDataset)

3. intersection(otherDataset)

4. cartesian(otherDataset)

5. zip(otherDataset)

results matching ""

No results matching ""

5.3.2 双 `Value` 类型交互

1. `union(otherDataset)`

2. `subtract (otherDataset)`

3. `intersection(otherDataset)`

4. `cartesian(otherDataset)`

5. `zip(otherDataset)`