windows下spark2.1源码编译及修改

xiaoxiao2021-02-28  61

Windows编译spark源码过程 对spark源码修改后需要重新编译spark源码,由于当前linux虚拟机上无法通过代理联网,公司提供的maven仓库也ping不通,只能在windows上编译spark源码。 编译过程如下: 1. 在spark官网下载spark源码http://spark.apache.org/downloads.html   选择2.1.0源码下载。 2. 然后在idea中导入spark源码项目(idea maven配置正确),然后对spark项目build。Build成功后在进行编译。 Build过程中遇到问题: 1) 无法找到SparkFlumeProtocol类,原因是spark flume模块是外部的,构建过程加载不到类。 进入view=>tool window=>maven project中找到Spark Project External Flume Sink模块,右键选择Generate Sources and update Folders,并在lifecycle中compile改模块。 3. 在git bash中编译spark源码 spark编译要在 bash环境下进行,直接在windows下编译会报错不支持bash命令: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project spark-core_2.11: An Ant BuildException has occured: Execute failed: java.io.IOException: Cannot run program "bash" (in directory "D:\workspace\spark-2.1.0\core"): CreateProcess error=2, 系统找不到指定的文件。 在git bash中切换到spark源码目录: 设置java虚拟机内存,编译时会占用很大内存,太小时会内存溢出。 export MAVEN_OPTS="-Xmx4g -XX:MaxPermSize=1000m -XX:ReservedCodeCacheSize=1000m" 然后指定Hadoop版本开始编译: mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -DskipTests clean package 然后是漫长编译过程,看到如下信息时,说明编译完成。  [INFO] Spark Project Parent POM .......................... SUCCESS [3.035s] [INFO] Spark Project Tags ................................ SUCCESS [5.896s] [INFO] Spark Project Sketch .............................. SUCCESS [9.240s] [INFO] Spark Project Networking .......................... SUCCESS [10.402s] [INFO] Spark Project Shuffle Streaming Service ........... SUCCESS [7.100s] [INFO] Spark Project Unsafe .............................. SUCCESS [11.549s] [INFO] Spark Project Launcher ............................ SUCCESS [8.769s] [INFO] Spark Project Core ................................ SUCCESS [2:46.378s] [INFO] Spark Project ML Local Library .................... SUCCESS [29.300s] [INFO] Spark Project GraphX .............................. SUCCESS [36.614s] [INFO] Spark Project Streaming ........................... SUCCESS [1:05.139s] [INFO] Spark Project Catalyst ............................ SUCCESS [2:45.713s] [INFO] Spark Project SQL ................................. SUCCESS [3:32.211s] [INFO] Spark Project ML Library .......................... SUCCESS [2:21.122s] [INFO] Spark Project Tools ............................... SUCCESS [6.362s] [INFO] Spark Project Hive ................................ SUCCESS [2:31.883s] [INFO] Spark Project REPL ................................ SUCCESS [15.956s] [INFO] Spark Project YARN Shuffle Service ................ SUCCESS [6.727s] [INFO] Spark Project YARN ................................ SUCCESS [33.774s] [INFO] Spark Project Assembly ............................ SUCCESS [6.532s] [INFO] Spark Project External Flume Sink ................. SUCCESS [14.954s] [INFO] Spark Project External Flume ...................... SUCCESS [19.958s] [INFO] Spark Project External Flume Assembly ............. SUCCESS [2.034s] [INFO] Spark Integration for Kafka 0.8 ................... SUCCESS [26.434s] [INFO] Spark Project Examples ............................ SUCCESS [33.904s] [INFO] Spark Project External Kafka Assembly ............. SUCCESS [10.769s] [INFO] Spark Integration for Kafka 0.10 .................. SUCCESS [26.168s] [INFO] Spark Integration for Kafka 0.10 Assembly ......... SUCCESS [10.089s] [INFO] Kafka 0.10 Source for Structured Streaming ........ SUCCESS [27.096s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 21:05.732s [INFO] Finished at: Fri May 05 10:31:06 CST 2017 [INFO] Final Memory: 135M/1845M [INFO] ------------------------------------------------------------------------

4. 修改spark-sql源码后重新打包编译spark-sql module

公司业务需求通过sparksql向hbase中load/update/delete数据,因此需要对spark-sql源码进行修改。

对spark-sql支持load/update/delete操作时,需要修改spark-sql部分源码,修改后不需要对整个spark项目进行重新编译,而是用maven重新打包spark-sql项目即可。 在Idea的终端中,进入spark-sql =》 core,执行mvn clean install –DskipTests 然后在spark源码目录spark-2.1.0\sql\core\target下就可以找到编译后的spark-sql包了。 在针对spark hbase连接器修改源码时,需要引用hbase相关jar包,因此需要在spark-sql模块的pom文件引入hbase包配置: <dependency>   <groupId>org.apache.hbase</groupId>   <artifactId>hbase-common</artifactId>   <version>1.1.2</version>   <exclusions>     <exclusion>       <groupId>asm</groupId>       <artifactId>asm</artifactId>     </exclusion>     <exclusion>       <groupId>org.jboss.netty</groupId>       <artifactId>netty</artifactId>     </exclusion>     <exclusion>       <groupId>io.netty</groupId>       <artifactId>netty</artifactId>     </exclusion>     <exclusion>       <groupId>commons-logging</groupId>       <artifactId>commons-logging</artifactId>     </exclusion>     <exclusion>       <groupId>org.jruby</groupId>       <artifactId>jruby-complete</artifactId>     </exclusion>   </exclusions> </dependency> <dependency>   <groupId>org.apache.hbase</groupId>   <artifactId>hbase-server</artifactId>   <version>1.1.2</version>   <exclusions>     <exclusion>       <groupId>asm</groupId>       <artifactId>asm</artifactId>     </exclusion>     <exclusion>       <groupId>org.jboss.netty</groupId>       <artifactId>netty</artifactId>     </exclusion>     <exclusion>       <groupId>io.netty</groupId>       <artifactId>netty</artifactId>     </exclusion>     <exclusion>       <groupId>commons-logging</groupId>       <artifactId>commons-logging</artifactId>     </exclusion>     <exclusion>       <groupId>org.jruby</groupId>       <artifactId>jruby-complete</artifactId>     </exclusion>   </exclusions> </dependency> 5. 替换spark2安装环境jar包测试修改后代码 在spark2安装目录/usr/hdp/2.3.4.0-3485/spark2/jars/找到spark-sql_2.11-2.1.0.jar包,备份后,将编译后的spark-sql包替换掉原有spark-sql包。 然后进入spark-sql测试修改功能。
转载请注明原文地址: https://www.6miu.com/read-30892.html

最新回复(0)