本地模式测试编写的MapReduce作业程序

论坛 期权论坛 编程之家     
选择匿名的用户   2021-5-29 08:51   32   0

MapReduce作业任务过程分为两个处理阶段:map阶段和reduce阶段,每个阶段都以键-值对的形式作为输入和输出。下面分别列出map函数和reduce函数。(reduce的输入必须匹配map的输出。)本例,map阶段采集的是气象数据,依据年份作为key,进行排序,温度值作为value。然后reduce对输入的map数据,从中挑选年份中的最高气温值。(本例使用的是hadoop-2.8.5)

  1. Mapper类实现:

package com.hadoop.ncdc.test;

import java.io.IOException;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;

public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

 private static final int MISSING = 9999;

 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //hadoop2的新API使用了Context,其统一了旧API中的JobConf、OutputCollector和Reporter。
  String line = value.toString();
  String year = line.substring(15, 19);
  int airTemperature;
  if (line.charAt(87) == '+') {
   airTemperature = Integer.parseInt(line.substring(88, 92));
  } else {
   airTemperature = Integer.parseInt(line.substring(87, 92));
  }
  String quality = line.substring(92, 93);
  if (airTemperature != MISSING && quality.matches("[01459]")) {
   context.write(new Text(year), new IntWritable(airTemperature));
  }
 }
}

2. Reducer类的实现:

package com.hadoop.ncdc.test;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

 public void reduce(Text key, Iterable<IntWritable> values, Context context)
   throws IOException, InterruptedException {
        //hadoop2的新API使用了Context,其统一了旧API中的JobConf、OutputCollector和Reporter。
  int maxValue = Integer.MIN_VALUE;
  for (IntWritable value : values) {
   maxValue = Math.max(maxValue, value.get());
  }
  context.write(key, new IntWritable(maxValue));
 }
}

3. main class:

package com.hadoop.ncdc.test;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;

public class MaxTemperature {

 public static void main(String[] args) throws Exception {
  // TODO Auto-generated method stub
        //新版API在org.apache.hadoop.mapreduce包中,老版是在org.apache.hadoop.mapred中
  Configuration config = new Configuration();
        //新API通过Job类来完成作业控制,旧API中对应的是JobClient,新API中已经删除该类。
  Job job = Job.getInstance(config, "Max temperature");
  job.setJarByClass(MaxTemperature.class);
  //args[0]命令行第一个输入路径参数
  FileInputFormat.addInputPath(job, new Path(args[0]));
  //args[1]命令行第二个输出路径参数
  FileOutputFormat.setOutputPath(job, new Path(args[1]));
  job.setMapperClass(MaxTemperatureMapper.class);
  job.setReducerClass(MaxTemperatureReducer.class);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);
  System.exit(job.waitForCompletion(true) ? 0 : 1);
 }

}

4. 打包成jar文件(hadoop-test.jar),运行测试。

mymacdeMac-mini:~ mymac$ hadoop jar /Users/mymac/Desktop/jjj/hadoop-test.jar com.hadoop.ncdc.test.MaxTemperature /Users/mymac/Desktop/NCDCData /Users/mymac/Desktop/output
(com.hadoop.ncdc.test.MaxTemperature这里是类所在的完整包名,输入文件是NCDCData,输出为output目录。)
命令行输入:hadoop jar jar文件路径 完整包名的main类名 输入路径 输出路径
----------------------------------------------------------------------------------
如果hadoop后面跟main类文件名(完整包名),那么需要在hadoop_classpath追加jar包。在命令行添加一句:
export HADOOP_CLASSPATH=/Users/mymac/Desktop/jjj/hadoop-test.jar(仅作为测试用,重启终端环境变量会还原为默认值) 
在执行下面命令行输入:
hadoop 完整包名的main类名 输入路径 输出路径

测试成功:

18/12/12 21:34:32 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
18/12/12 21:34:32 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
18/12/12 21:34:32 INFO input.FileInputFormat: Total input files to process : 2
18/12/12 21:34:32 INFO mapreduce.JobSubmitter: number of splits:2 //作业输入分片为2个
18/12/12 21:34:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1917490337_0001
18/12/12 21:34:33 INFO mapreduce.Job: The url to track the job: http://localhost:8080/                              
18/12/12 21:34:33 INFO mapreduce.Job: Running job: job_local1917490337_0001//作业1的ID
18/12/12 21:34:33 INFO mapred.LocalJobRunner: OutputCommitter set in config null
18/12/12 21:34:33 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
18/12/12 21:34:33 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
18/12/12 21:34:33 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
18/12/12 21:34:33 INFO mapred.LocalJobRunner: Waiting for map tasks
18/12/12 21:34:33 INFO mapred.LocalJobRunner: Starting task: attempt_local1917490337_0001_m_000000_0//第一个map任务第一次尝试
18/12/12 21:34:33 INFO mapred.LocalJobRunner: Finishing task: attempt_local1917490337_0001_m_000000_0
18/12/12 21:34:33 INFO mapred.LocalJobRunner: Starting task: attempt_local1917490337_0001_m_000001_0//第二个map任务第一次尝试
18/12/12 21:34:33 INFO mapred.LocalJobRunner: Finishing task: attempt_local1917490337_0001_m_000001_0
18/12/12 21:34:33 INFO mapred.LocalJobRunner: map task executor complete.//map任务完成
18/12/12 21:34:33 INFO mapred.LocalJobRunner: Waiting for reduce tasks
18/12/12 21:34:33 INFO mapred.LocalJobRunner: Starting task: attempt_local1917490337_0001_r_000000_0//开始第一个reduce任务的第一次尝试
18/12/12 21:34:33 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map //reduce抓取map的混洗数据
attempt_local1917490337_0001_m_000001_0 decomp: 72206 len: 72210 to MEMORY
18/12/12 21:34:33 INFO reduce.InMemoryMapOutput: Read 72206 bytes from map-output for attempt_local1917490337_0001_m_000001_0//reduce读取map的输出
18/12/12 21:34:33 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1917490337_0001_r_000000_0' to file:/Users/mymac/Desktop/output/_temporary/0/task_local1917490337_0001_r_000000
18/12/12 21:34:33 INFO mapred.LocalJobRunner: reduce > reduce
18/12/12 21:34:33 INFO mapred.Task: Task 'attempt_local1917490337_0001_r_000000_0' done.
//任务提交完毕,储存在设置的存储目录中
18/12/12 21:34:33 INFO mapred.LocalJobRunner: Finishing task: attempt_local1917490337_0001_r_000000_0
18/12/12 21:34:33 INFO mapred.LocalJobRunner: reduce task executor complete.//reduce任务完成
18/12/12 21:34:34 INFO mapreduce.Job: Job job_local1917490337_0001 running in uber mode : false
18/12/12 21:34:34 INFO mapreduce.Job:  map 100% reduce 100%
18/12/12 21:34:34 INFO mapreduce.Job: Job job_local1917490337_0001 completed successfully//作业完成

注:如果hadoop后面跟main类文件(完整包名),那么需要在hadoop_classpath追加jar包。在命令行添加一句:

export HADOOP_CLASSPATH=你的jar路径,在执行下面:

hadoop 完整包名的类文件名称 输入路径 输出路径。

分享到 :
0 人收藏
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

积分:3875789
帖子:775174
精华:0
期权论坛 期权论坛
发布
内容

下载期权论坛手机APP