Spark开发笔记（2017-05-04)

xiaoxiao2021-02-28 88

在一个rdd操作中是不能同时操作另一个rdd的。你是想 valuesRdd 里面每个值对于dicRdd 进行过滤,但是在分布式系统里面,每个RDD数据集都切割分发到各个分布式机器虚拟机jvm里,每一个jvm里的数据集不一样,所以,从jvm的角度来看,它是没办法在一块数据集里面操作另外一个整体的RDD

valuesRdd.foreach { i => val samevalueKeys = dicRdd.filter { d => d._2.equals(i) }.map(d => d._1).collect() } //错误信息： Caused by: org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases: (1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. (2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758. foreach 的意思是对于每一个元素做一个操作没有返回值 filter 的意思是对于每一个元素做判断符合条件便留下来不符合便过滤掉 fileRdd.foreach 这个操作对 fileRdd 本身不会产生任何改变

函数式编程中有个约束是,不可变量就是定义为 val 的,它就永远不会变了,所以你的filedata map操作之后要定义一个新的变量

val new_rdd = filedata.map( **************************** )

查看文件时报错No such file or directory，注意是否文件名后边有空格

[myhadoop@sunlight100 ~]$ hadoop fs -du /test/spark/hyp/nulllabelurl/test/baike du: `/test/spark/hyp/nulllabelurl/test/baike': No such file or directory

转载请注明原文地址: https://www.6miu.com/read-34608.html

技术

最新回复(0)