spark获取指定分区数据

4,142次阅读

共计 424 个字符，预计需要花费 2 分钟才能阅读完成。

使用 mapPartitionsWithIndex 如下所示

 // Create (1, 1), (2, 2), ..., (100, 100) dataset
// and partition by key so we know what to expect
val rdd = sc.parallelize((1 to 100) map (i => (i, i)), 16)
  .partitionBy(new org.apache.spark.HashPartitioner(8))
 
val zeroth = rdd
  // If partition number is not zero ignore data
  .mapPartitionsWithIndex((idx, iter) => if (idx == 0) iter else Iterator())
 
// Check if we get expected results 8, 16, ..., 96
assert (zeroth.keys.map(_ % 8 == 0).reduce(_ & _) & zeroth.count == 12)

正文完

请博主喝杯咖啡吧！

spark

发表至： bigdata

2018-06-22

转载说明：除特殊说明外本站文章皆由CC-4.0协议发布，转载请注明出处。

spark实现gbdt和lr

such annotations are only allowed in arguments to *-parameters

spark1.6 fp-growth序列化数据错误

大数据常见错误解决方案

pyhive在redhat部署问题

	`// Create (1, 1), (2, 2), ..., (100, 100) dataset`
	// and partition by key so we know what to expect
	val rdd = sc.parallelize((1 to 100) map (i => (i, i)), 16)
	.partitionBy(new org.apache.spark.HashPartitioner(8))

	val zeroth = rdd
	// If partition number is not zero ignore data
	.mapPartitionsWithIndex((idx, iter) => if (idx == 0) iter else Iterator())

	// Check if we get expected results 8, 16, ..., 96
	assert (zeroth.keys.map(_ % 8 == 0).reduce(_ & _) & zeroth.count == 12)