将spark的wordcount任务提交到Yarn上,然后计算结果输出到hdfs上。
参考:https://blog.csdn.net/u010886217/article/details/82795047
hdfs和yarn
注意:--conf spark.app.coalesce=1这个参数为分区数,决定最终写入hdfs上文件的数量。
结果:
19/11/14 17:18:09 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null) 19/11/14 17:18:09 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> hadoop, PROXY_URI_BASES -> http://hadoop:8088/proxy/application_1573707031333_0001), /proxy/application_1573707031333_0001 19/11/14 17:18:09 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter 19/11/14 17:18:10 INFO Client: Application report for application_1573707031333_0001 (state: RUNNING) 19/11/14 17:18:10 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: 192.168.0.8 ApplicationMaster RPC port: 0 queue: root.root start time: 1573723074065 final status: UNDEFINED tracking URL: http://hadoop:8088/proxy/application_1573707031333_0001/ user: root 19/11/14 17:18:10 INFO YarnClientSchedulerBackend: Application application_1573707031333_0001 has started running. 19/11/14 17:18:10 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39142. 19/11/14 17:18:10 INFO NettyBlockTransferService: Server created on 192.168.0.8:39142 19/11/14 17:18:10 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 19/11/14 17:18:10 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.0.8, 39142, None) 19/11/14 17:18:10 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.8:39142 with 413.9 MB RAM, BlockManagerId(driver, 192.168.0.8, 39142, None) 19/11/14 17:18:10 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.0.8, 39142, None) 19/11/14 17:18:10 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.0.8, 39142, None) 19/11/14 17:18:11 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms) 19/11/14 17:18:12 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 264.0 KB, free 413.7 MB) 19/11/14 17:18:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 21.3 KB, free 413.6 MB) 19/11/14 17:18:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.0.8:39142 (size: 21.3 KB, free: 413.9 MB) 19/11/14 17:18:13 INFO SparkContext: Created broadcast 0 from textFile at Wordcount_product.scala:34 19/11/14 17:18:13 INFO FileInputFormat: Total input paths to process : 1 19/11/14 17:18:14 INFO SparkContext: Starting job: sortByKey at Wordcount_product.scala:36 19/11/14 17:18:14 INFO DAGScheduler: Registering RDD 3 (map at Wordcount_product.scala:35) 19/11/14 17:18:14 INFO DAGScheduler: Got job 0 (sortByKey at Wordcount_product.scala:36) with 2 output partitions 19/11/14 17:18:14 INFO DAGScheduler: Final stage: ResultStage 1 (sortByKey at Wordcount_product.scala:36) 19/11/14 17:18:14 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0) 19/11/14 17:18:14 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0) 19/11/14 17:18:14 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at Wordcount_product.scala:35), which has no missing parents 19/11/14 17:18:15 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.8 KB, free 413.6 MB) 19/11/14 17:18:15 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.8 KB, free 413.6 MB) 19/11/14 17:18:15 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.0.8:39142 (size: 2.8 KB, free: 413.9 MB) 19/11/14 17:18:15 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:996 19/11/14 17:18:15 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at Wordcount_product.scala:35) 19/11/14 17:18:15 INFO YarnScheduler: Adding task set 0.0 with 2 tasks 19/11/14 17:18:19 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (192.168.0.8:51107) with ID 1 19/11/14 17:18:19 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, hadoop, executor 1, partition 0, PROCESS_LOCAL, 6079 bytes) 19/11/14 17:18:19 INFO BlockManagerMasterEndpoint: Registering block manager hadoop:38675 with 413.9 MB RAM, BlockManagerId(1, hadoop, 38675, None) 19/11/14 17:18:21 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on hadoop:38675 (size: 2.8 KB, free: 413.9 MB) 19/11/14 17:18:21 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop:38675 (size: 21.3 KB, free: 413.9 MB) 19/11/14 17:18:23 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, hadoop, executor 1, partition 1, PROCESS_LOCAL, 6079 bytes) 19/11/14 17:18:23 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 4730 ms on hadoop (executor 1) (1/2) 19/11/14 17:18:24 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 309 ms on hadoop (executor 1) (2/2) 19/11/14 17:18:24 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 19/11/14 17:18:24 INFO DAGScheduler: ShuffleMapStage 0 (map at Wordcount_product.scala:35) finished in 8.982 s 19/11/14 17:18:24 INFO DAGScheduler: looking for newly runnable stages 19/11/14 17:18:24 INFO DAGScheduler: running: Set() 19/11/14 17:18:24 INFO DAGScheduler: waiting: Set(ResultStage 1) 19/11/14 17:18:24 INFO DAGScheduler: failed: Set() 19/11/14 17:18:24 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[7] at sortByKey at Wordcount_product.scala:36), which has no missing parents 19/11/14 17:18:24 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 4.1 KB, free 413.6 MB) 19/11/14 17:18:24 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.4 KB, free 413.6 MB) 19/11/14 17:18:24 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.0.8:39142 (size: 2.4 KB, free: 413.9 MB) 19/11/14 17:18:24 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:996 19/11/14 17:18:24 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[7] at sortByKey at Wordcount_product.scala:36) 19/11/14 17:18:24 INFO YarnScheduler: Adding task set 1.0 with 2 tasks 19/11/14 17:18:24 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, hadoop, executor 1, partition 0, NODE_LOCAL, 5825 bytes) 19/11/14 17:18:24 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on hadoop:38675 (size: 2.4 KB, free: 413.9 MB) 19/11/14 17:18:24 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 192.168.0.8:51107 19/11/14 17:18:24 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 148 bytes 19/11/14 17:18:24 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, hadoop, executor 1, partition 1, NODE_LOCAL, 5825 bytes) 19/11/14 17:18:24 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 404 ms on hadoop (executor 1) (1/2) 19/11/14 17:18:24 INFO DAGScheduler: ResultStage 1 (sortByKey at Wordcount_product.scala:36) finished in 0.512 s 19/11/14 17:18:24 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 138 ms on hadoop (executor 1) (2/2) 19/11/14 17:18:24 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool 19/11/14 17:18:24 INFO DAGScheduler: Job 0 finished: sortByKey at Wordcount_product.scala:36, took 10.570525 s 19/11/14 17:18:25 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 192.168.0.8:39142 in memory (size: 2.4 KB, free: 413.9 MB) 19/11/14 17:18:25 INFO BlockManagerInfo: Removed broadcast_2_piece0 on hadoop:38675 in memory (size: 2.4 KB, free: 413.9 MB) 19/11/14 17:18:25 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 19/11/14 17:18:25 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 19/11/14 17:18:25 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 19/11/14 17:18:25 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 19/11/14 17:18:25 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 19/11/14 17:18:25 INFO FileOutputCommitter: File Output Committer Algorithm version is 1 19/11/14 17:18:25 INFO SparkContext: Starting job: saveAsTextFile at Wordcount_product.scala:46 19/11/14 17:18:25 INFO DAGScheduler: Registering RDD 5 (map at Wordcount_product.scala:36) 19/11/14 17:18:25 INFO DAGScheduler: Got job 1 (saveAsTextFile at Wordcount_product.scala:46) with 1 output partitions 19/11/14 17:18:25 INFO DAGScheduler: Final stage: ResultStage 4 (saveAsTextFile at Wordcount_product.scala:46) 19/11/14 17:18:25 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 3) 19/11/14 17:18:25 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 3) 19/11/14 17:18:25 INFO DAGScheduler: Submitting ShuffleMapStage 3 (MapPartitionsRDD[5] at map at Wordcount_product.scala:36), which has no missing parents 19/11/14 17:18:25 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 4.0 KB, free 413.6 MB) 19/11/14 17:18:25 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.4 KB, free 413.6 MB) 19/11/14 17:18:25 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.0.8:39142 (size: 2.4 KB, free: 413.9 MB) 19/11/14 17:18:25 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:996 19/11/14 17:18:25 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 3 (MapPartitionsRDD[5] at map at Wordcount_product.scala:36) 19/11/14 17:18:25 INFO YarnScheduler: Adding task set 3.0 with 2 tasks 19/11/14 17:18:25 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 4, hadoop, executor 1, partition 0, NODE_LOCAL, 5820 bytes) 19/11/14 17:18:25 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on hadoop:38675 (size: 2.4 KB, free: 413.9 MB) 19/11/14 17:18:25 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 5, hadoop, executor 1, partition 1, NODE_LOCAL, 5820 bytes) 19/11/14 17:18:25 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 4) in 158 ms on hadoop (executor 1) (1/2) 19/11/14 17:18:25 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 5) in 137 ms on hadoop (executor 1) (2/2) 19/11/14 17:18:25 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool 19/11/14 17:18:25 INFO DAGScheduler: ShuffleMapStage 3 (map at Wordcount_product.scala:36) finished in 0.290 s 19/11/14 17:18:25 INFO DAGScheduler: looking for newly runnable stages 19/11/14 17:18:25 INFO DAGScheduler: running: Set() 19/11/14 17:18:25 INFO DAGScheduler: waiting: Set(ResultStage 4) 19/11/14 17:18:25 INFO DAGScheduler: failed: Set() 19/11/14 17:18:25 INFO DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[11] at saveAsTextFile at Wordcount_product.scala:46), which has no missing parents 19/11/14 17:18:25 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 69.3 KB, free 413.6 MB) 19/11/14 17:18:25 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 25.4 KB, free 413.5 MB) 19/11/14 17:18:25 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.0.8:39142 (size: 25.4 KB, free: 413.9 MB) 19/11/14 17:18:25 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:996 19/11/14 17:18:25 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[11] at saveAsTextFile at Wordcount_product.scala:46) 19/11/14 17:18:25 INFO YarnScheduler: Adding task set 4.0 with 1 tasks 19/11/14 17:18:25 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 6, hadoop, executor 1, partition 0, NODE_LOCAL, 6125 bytes) 19/11/14 17:18:25 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on hadoop:38675 (size: 25.4 KB, free: 413.9 MB) 19/11/14 17:18:26 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 192.168.0.8:51107 19/11/14 17:18:26 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 145 bytes 19/11/14 17:18:26 INFO DAGScheduler: ResultStage 4 (saveAsTextFile at Wordcount_product.scala:46) finished in 1.070 s 19/11/14 17:18:26 INFO DAGScheduler: Job 1 finished: saveAsTextFile at Wordcount_product.scala:46, took 1.484781 s 19/11/14 17:18:26 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 6) in 1070 ms on hadoop (executor 1) (1/1) 19/11/14 17:18:26 INFO YarnScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool(1)hdfs上结果
(2)yarn的任务
(经测试成功~)