为什么要设置呢?摘自Spark-2.3.0官方文档 http://spark.apache.org/docs/latest/running-on-yarn.html#preparations
(个人理解:spark运行所需的jar包,不设置的话每次运行就需要上传到yarn管理的各个节点的缓存,很麻烦很影响性能。如果设置了,比如说放在HDFS上,就不需要每次都上传而是从HDFS上读取,能快那么一点点……)
(但是呢,HDFS上如果只设置保存三份数据,而如果需要20个节点来运行spark任务,会怎么样呢……)
Preparations
Running Spark on YARN requires a binary distribution of Spark which is built with YARN support. Binary distributions can be downloaded from the downloads page of the project website. To build Spark yourself, refer to Building Spark.
To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer to Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.
不设置spark.yarn.archive spark.yarn.jars
18/03/16 10:41:55 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.18/03/16 10:41:56 INFO Client: Uploading resource file:/tmp/spark-b54a2aa7-08e3-4d2f-a80d-001d5d4ed914/__spark_libs__5152489990053873120.zip -> hdfs://xxxxxx1:8020/user/yyyyyy/.sparkStaging/application_1520398813114_0040/__spark_libs__5152489990053873120.zip18/03/16 10:41:58 INFO Client: Uploading resource file:/tmp/spark-b54a2aa7-08e3-4d2f-a80d-001d5d4ed914/__spark_conf__3105174281975574497.zip -> hdfs://xxxxxx1:8020/user/yyyyyy/.sparkStaging/application_1520398813114_0040/__spark_conf__.zip
设置为
spark.yarn.archive hdfs://xxxxxx1:8020/spark/jars/spark-2.1.2-jars/
会报错
18/03/16 10:38:50 ERROR SparkContext: Error initializing SparkContext.java.lang.IllegalArgumentException: Can not create a Path from an empty string
18/03/16 10:38:50 INFO SparkContext: Successfully stopped SparkContextException in thread "main" java.lang.IllegalArgumentException: Can not create a Path from an empty string
更改为
spark.yarn.archive hdfs://xxxxxx1:8020/spark/jars/spark-2.1.2-jars
OK了,就差了一个“/”???