前言:由于工作原因,第一次接触maven和hadoop,所以在学习的过程中,看了很多博客,踩了很多坑,也总结了一些经验,现在大致记录如下。有需要的朋友请自取。
1.什么是maven 具体可以这两篇博客。 maven入门http://www.cnblogs.com/now-fighting/p/4857625.html maven体系结构https://www.cnblogs.com/now-fighting/p/4858982.html
2.hadoop集群配置
具体可以看这个系列:http://www.powerxing.com/install-hadoop/
值得注意的是,我的环境是centos7,本地没有jdk版本,只有jre,因此我自己下了一个jdk。其实 不运行java程序的话,jre就足够,但我还是下载了。
3.intellij构建一个maven项目
参考这篇博客:http://blog.csdn.net/qq_32588349/article/details/51461182
我的环境情况是:
Windows执行maven项目,hadoop集群在centos7上,我配的是伪分布式集群。
我配置的pom.xml如下:
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.wordcount_1</groupId> <artifactId>wordcount_1</artifactId> <version>1.0-SNAPSHOT</version> <packaging>jar</packaging> <repositories> <repository> <id>apache</id> <url>http://maven.apache.org</url> </repository> </repositories> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.6.5</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.6.5</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.6.5</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-jobclient</artifactId> <version>2.6.5</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-common</artifactId> <version>2.6.5</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.6.5</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>1.2.0</version> </dependency> </dependencies>**配置hadoop需要如下依赖: 基础依赖hadoop-core和hadoop-common; 如果需要读写HDFS,则还需要依赖hadoop-hdfs和hadoop-client;如果需要读写HBase,则还需要依赖hbase-client**
配置完pom.xml后,编写代码。我的项目的逻辑结构如下:
WordCount程序:
package job; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }注意要在resources目录下放三个文件。 hdfs-site.xml core-site.xml log4.properties
这样一来,能够控制日志输出格式,以及默认使用hdfs存储输入输出。否则默认使用本地输入输出。这几个文件在hadoop安装目录下的conf文件夹。
4.intellij编译执行
选择run–>edit configuration–>选择Application。填写main.Class。 在program.arguments配置好参数。这两个参数是输入输出参数,都是hdfs存储。
5.查看输入输出情况
输入情况
[pangmingyu@Centos1 opt]$ hdfs dfs -ls /user/pangmingyu/input/ 17/12/21 12:09:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 8 items -rw-r--r-- 1 pangmingyu supergroup 4436 2017-12-18 14:52 /user/pangmingyu/input/capacity-scheduler.xml -rw-r--r-- 1 pangmingyu supergroup 997 2017-12-18 14:52 /user/pangmingyu/input/core-site.xml -rw-r--r-- 1 pangmingyu supergroup 9683 2017-12-18 14:52 /user/pangmingyu/input/hadoop-policy.xml -rw-r--r-- 1 pangmingyu supergroup 1346 2017-12-18 14:52 /user/pangmingyu/input/hdfs-site.xml -rw-r--r-- 1 pangmingyu supergroup 620 2017-12-18 14:52 /user/pangmingyu/input/httpfs-site.xml -rw-r--r-- 1 pangmingyu supergroup 3523 2017-12-18 14:52 /user/pangmingyu/input/kms-acls.xml -rw-r--r-- 1 pangmingyu supergroup 5511 2017-12-18 14:52 /user/pangmingyu/input/kms-site.xml -rw-r--r-- 1 pangmingyu supergroup 690 2017-12-18 14:52 /user/pangmingyu/input/yarn-site.xml [pangmingyu@Centos1 opt]$这些文件都是从其他地方拉进来的,比如。
./bin/hdfs dfs -mkdir input ./bin/hdfs dfs -put ./etc/hadoop/*.xml input输出情况:
[pangmingyu@Centos1 opt]$ hdfs dfs -ls /opt/output 17/12/21 12:11:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 2 items -rw-r--r-- 1 yangzhenyu supergroup 0 2017-12-21 12:02 /opt/output/_SUCCESS -rw-r--r-- 1 yangzhenyu supergroup 10426 2017-12-21 12:02 /opt/output/part-r-000006 .jar打包执行
File–>project structure–>Aritifacts–>点击“+”
之后一步一步往下填参数。
填完后,退出,执行如下步骤。
然后到jar包的生成目录看,会看到jar包:wordcount_1.jar
之后,将jar包拉进Linux系统下,执行:
hadoop jar wordcount_1.jar job.WordCount hdfs://192.168.179.128:9000/user/pangmingyu/input hdfs://192.168.179.128:9000/user/pangmingyu/outputjob.WordCount 代表执行类,job是类所在的package。
最后加上一篇博客,同样是hadoop 编写wordcount,写的不错: https://www.polarxiong.com/archives/Hadoop-Intellij结合Maven本地运行和调试MapReduce程序-无需搭载Hadoop和HDFS环境.html