얼마전에 Lucene으로 조그마한 프로젝트를 진행하면서 Inverted Index가 뭔지 처음알게 되었는데, 그때는 멋모르고 그냥 제공해주는거니까 안에는 어떤 원리로 돌아간느건지 뭘 만드는건지 모르고 사용했었다.


Hadoop을 공부하다보니, MapReduce를 사용하면 엄청 간단하게 Inverted Index를 생성할수가 있다.


일단 많은 데이터를 가진 파일들이 없어서 무료로 eBook을 제공해주는 Project Gutenberg 에서 추천 도서 10개를 TXT로 다운받아서 테스트 용도로 사용했다.

http://www.gutenberg.org/wiki/Main_Page


1. 데이터들을 Hadoop FS로 옮겨넣기

hadoop@ubuntu:~/work$ cd data

hadoop@ubuntu:~/work/data$ ls

matrix_input.2x2   pg11.txt   pg1342.txt   pg30601.txt  pg5000.txt

matrixmulti.2x3x2  pg132.txt  pg27827.txt  pg4300.txt   prince.txt


hadoop@ubuntu:~/work$ hadoop dfs -put data /data

hadoop@ubuntu:~/work$ ../bin/hadoop/bin/hadoop dfs -ls /data/data

Found 8 items

-rw-r--r--   1 hadoop supergroup     167497 2012-11-11 07:06 /data/data/pg11.txt

-rw-r--r--   1 hadoop supergroup     343695 2012-11-11 07:06 /data/data/pg132.txt

-rw-r--r--   1 hadoop supergroup     704139 2012-11-11 07:06 /data/data/pg1342.txt

-rw-r--r--   1 hadoop supergroup     359504 2012-11-11 07:06 /data/data/pg27827.txt

-rw-r--r--   1 hadoop supergroup     384522 2012-11-11 07:06 /data/data/pg30601.txt

-rw-r--r--   1 hadoop supergroup    1573112 2012-11-11 07:06 /data/data/pg4300.txt

-rw-r--r--   1 hadoop supergroup    1423801 2012-11-11 07:06 /data/data/pg5000.txt

-rw-r--r--   1 hadoop supergroup      92295 2012-11-11 07:06 /data/data/prince.txt


2. InvertedIndex를 생성하는 MapReduce 코딩하기(코딩이 너무 간단하닷!!!!)


package hopeisagoodthing;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.io.LongWritable;

public class InvertedIndex {
	public static class InvertedIndexMapper extends Mapper<Object,Text,Text,Text> {
		private Text outValue = new Text();
		private Text word = new Text();
		private static String docId = null;
		
		public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
			StringTokenizer iter = new StringTokenizer(value.toString()," ",true);
			Long pos = ((LongWritable)key).get();
			while ( iter.hasMoreTokens() ) {
				String token = iter.nextToken();
				if(token.equals(" "))
				{
					pos = pos + 1;
				}
				else
				{
					word.set(token);
					outValue.set(docId+":"+pos);
					pos = pos + token.length();
					context.write(word,outValue);
				}
			}
		}
		
		
	
		protected void setup(Context context) throws IOException, InterruptedException {
			docId = ((FileSplit)context.getInputSplit()).getPath().getName();			
		}
	}
	public static class InvertedIndexReducer extends Reducer<Text,Text,Text,Text> {
		private Text counter = new Text();
		public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
			StringBuilder countersb = new StringBuilder();
			for ( Text val : values ) {
				countersb.append(val);
				countersb.append(",");
			}
			countersb.setLength(countersb.length()-1);
			
			counter.set(countersb.toString());
			context.write(key,counter);
		}
	}
	public static void main(String[] args) throws Exception {		
		Configuration conf = new Configuration();
		String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
		
		Job job = new Job(conf,"invertedindex");
		job.setJarByClass(InvertedIndex.class);
		job.setMapperClass(InvertedIndexMapper.class);
		job.setReducerClass(InvertedIndexReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		job.setNumReduceTasks(2);	

		FileInputFormat.addInputPath(job,new Path(otherArgs[0]));
		FileOutputFormat.setOutputPath(job,new Path(otherArgs[1]));
		System.exit(job.waitForCompletion(true) ? 0 : 1 );
	}	
}



3. Hadoop에서 실행시켜보기.

hadoop@ubuntu:~/work$ ../bin/hadoop/bin/hadoop jar hopeisagoodthing.jar invertedindex -jt local /data/data /data/invertedindext_out

12/11/11 07:16:29 INFO util.NativeCodeLoader: Loaded the native-hadoop library

12/11/11 07:16:29 INFO input.FileInputFormat: Total input paths to process : 10

12/11/11 07:16:29 WARN snappy.LoadSnappy: Snappy native library not loaded

12/11/11 07:16:29 INFO mapred.JobClient: Running job: job_local_0001

12/11/11 07:16:30 INFO util.ProcessTree: setsid exited with exit code 0

12/11/11 07:16:30 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@93d6bc

12/11/11 07:16:30 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:30 INFO mapred.JobClient:  map 0% reduce 0%

12/11/11 07:16:32 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:32 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:33 INFO mapred.MapTask: Spilling map output: record full = true

12/11/11 07:16:33 INFO mapred.MapTask: bufstart = 0; bufend = 6288000; bufvoid = 99614720

12/11/11 07:16:33 INFO mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680

12/11/11 07:16:33 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:33 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:33 INFO mapred.MapTask: Finished spill 1

12/11/11 07:16:33 INFO mapred.Merger: Merging 2 sorted segments

12/11/11 07:16:33 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 6967137 bytes

12/11/11 07:16:34 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting

12/11/11 07:16:35 INFO mapred.LocalJobRunner: 

12/11/11 07:16:35 INFO mapred.LocalJobRunner: 

12/11/11 07:16:35 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.

12/11/11 07:16:35 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1b64e6a

12/11/11 07:16:35 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:36 INFO mapred.JobClient:  map 100% reduce 0%

12/11/11 07:16:36 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:36 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:37 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:38 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:38 INFO mapred.Task: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting

12/11/11 07:16:38 INFO mapred.LocalJobRunner: 

12/11/11 07:16:38 INFO mapred.Task: Task 'attempt_local_0001_m_000001_0' done.

12/11/11 07:16:38 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@161dfb5

12/11/11 07:16:38 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:39 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:39 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:39 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:39 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:39 INFO mapred.Task: Task:attempt_local_0001_m_000002_0 is done. And is in the process of commiting

12/11/11 07:16:41 INFO mapred.LocalJobRunner: 

12/11/11 07:16:41 INFO mapred.Task: Task 'attempt_local_0001_m_000002_0' done.

12/11/11 07:16:42 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@c09554

12/11/11 07:16:42 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:42 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:42 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:42 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:42 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:42 INFO mapred.Task: Task:attempt_local_0001_m_000003_0 is done. And is in the process of commiting

12/11/11 07:16:45 INFO mapred.LocalJobRunner: 

12/11/11 07:16:45 INFO mapred.Task: Task 'attempt_local_0001_m_000003_0' done.

12/11/11 07:16:45 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1309e87

12/11/11 07:16:45 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:45 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:45 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:45 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:45 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:45 INFO mapred.Task: Task:attempt_local_0001_m_000004_0 is done. And is in the process of commiting

12/11/11 07:16:48 INFO mapred.LocalJobRunner: 

12/11/11 07:16:48 INFO mapred.Task: Task 'attempt_local_0001_m_000004_0' done.

12/11/11 07:16:48 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6c585a

12/11/11 07:16:48 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:48 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:48 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:48 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:48 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:48 INFO mapred.Task: Task:attempt_local_0001_m_000005_0 is done. And is in the process of commiting

12/11/11 07:16:51 INFO mapred.LocalJobRunner: 

12/11/11 07:16:51 INFO mapred.Task: Task 'attempt_local_0001_m_000005_0' done.

12/11/11 07:16:51 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@e3c624

12/11/11 07:16:51 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:51 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:51 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:51 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:51 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:51 INFO mapred.Task: Task:attempt_local_0001_m_000006_0 is done. And is in the process of commiting

12/11/11 07:16:54 INFO mapred.LocalJobRunner: 

12/11/11 07:16:54 INFO mapred.Task: Task 'attempt_local_0001_m_000006_0' done.

12/11/11 07:16:54 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1950198

12/11/11 07:16:54 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:54 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:54 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:54 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:54 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:54 INFO mapred.Task: Task:attempt_local_0001_m_000007_0 is done. And is in the process of commiting

12/11/11 07:16:57 INFO mapred.LocalJobRunner: 

12/11/11 07:16:57 INFO mapred.Task: Task 'attempt_local_0001_m_000007_0' done.

12/11/11 07:16:57 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@53fb57

12/11/11 07:16:57 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:57 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:57 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:57 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:57 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:57 INFO mapred.Task: Task:attempt_local_0001_m_000008_0 is done. And is in the process of commiting

12/11/11 07:17:00 INFO mapred.LocalJobRunner: 

12/11/11 07:17:00 INFO mapred.Task: Task 'attempt_local_0001_m_000008_0' done.

12/11/11 07:17:00 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1742700

12/11/11 07:17:00 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:17:00 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:17:00 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:17:00 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:17:00 INFO mapred.MapTask: Finished spill 0

12/11/11 07:17:00 INFO mapred.Task: Task:attempt_local_0001_m_000009_0 is done. And is in the process of commiting

12/11/11 07:17:03 INFO mapred.LocalJobRunner: 

12/11/11 07:17:03 INFO mapred.Task: Task 'attempt_local_0001_m_000009_0' done.

12/11/11 07:17:03 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@491c4c

12/11/11 07:17:03 INFO mapred.LocalJobRunner: 

12/11/11 07:17:03 INFO mapred.Merger: Merging 10 sorted segments

12/11/11 07:17:03 INFO mapred.Merger: Down to the last merge-pass, with 10 segments left of total size: 22414959 bytes

12/11/11 07:17:03 INFO mapred.LocalJobRunner: 

12/11/11 07:17:05 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting

12/11/11 07:17:05 INFO mapred.LocalJobRunner: 

12/11/11 07:17:05 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now

12/11/11 07:17:05 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to /data/invertedindext_out

12/11/11 07:17:06 INFO mapred.LocalJobRunner: reduce > reduce

12/11/11 07:17:06 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.

12/11/11 07:17:06 INFO mapred.JobClient:  map 100% reduce 100%

12/11/11 07:17:06 INFO mapred.JobClient: Job complete: job_local_0001

12/11/11 07:17:06 INFO mapred.JobClient: Counters: 22

12/11/11 07:17:06 INFO mapred.JobClient:   File Output Format Counters 

12/11/11 07:17:06 INFO mapred.JobClient:     Bytes Written=16578538

12/11/11 07:17:06 INFO mapred.JobClient:   FileSystemCounters

12/11/11 07:17:06 INFO mapred.JobClient:     FILE_BYTES_READ=99911067

12/11/11 07:17:06 INFO mapred.JobClient:     HDFS_BYTES_READ=46741458

12/11/11 07:17:06 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=286139450

12/11/11 07:17:06 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=16578538

12/11/11 07:17:06 INFO mapred.JobClient:   File Input Format Counters 

12/11/11 07:17:06 INFO mapred.JobClient:     Bytes Read=5048729

12/11/11 07:17:06 INFO mapred.JobClient:   Map-Reduce Framework

12/11/11 07:17:06 INFO mapred.JobClient:     Map output materialized bytes=22414999

12/11/11 07:17:06 INFO mapred.JobClient:     Map input records=108530

12/11/11 07:17:06 INFO mapred.JobClient:     Reduce shuffle bytes=0

12/11/11 07:17:06 INFO mapred.JobClient:     Spilled Records=2015979

12/11/11 07:17:06 INFO mapred.JobClient:     Map output bytes=20666931

12/11/11 07:17:06 INFO mapred.JobClient:     Total committed heap usage (bytes)=2057838592

12/11/11 07:17:06 INFO mapred.JobClient:     CPU time spent (ms)=0

12/11/11 07:17:06 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1062

12/11/11 07:17:06 INFO mapred.JobClient:     Combine input records=0

12/11/11 07:17:06 INFO mapred.JobClient:     Reduce input records=874004

12/11/11 07:17:06 INFO mapred.JobClient:     Reduce input groups=94211

12/11/11 07:17:06 INFO mapred.JobClient:     Combine output records=0

12/11/11 07:17:06 INFO mapred.JobClient:     Physical memory (bytes) snapshot=0

12/11/11 07:17:06 INFO mapred.JobClient:     Reduce output records=94211

12/11/11 07:17:06 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=0

12/11/11 07:17:06 INFO mapred.JobClient:     Map output records=874004



4. 결과를 가지고 와서 확인하기 

hadoop@ubuntu:~/work$ ../bin/hadoop/bin/hadoop dfs -get /data/invertedindext_out ./

hadoop@ubuntu:~/work$ cd invertedindext_out/

hadoop@ubuntu:~/work/invertedindext_out$ ls

part-r-00000  _SUCCESS


hadoop@ubuntu:~/work/invertedindext_out$ more part-r-00000

! prince.txt:41794

" pg27827.txt:25391,pg27827.txt:22695,pg27827.txt:23024,pg27827.txt:23250,pg27827.txt:22637,pg27827.txt:22398,pg

27827.txt:22343,pg27827.txt:23961,pg27827.txt:24079,pg27827.txt:24142,pg27827.txt:24191,pg27827.txt:24298,pg27827.txt:

22284,pg27827.txt:22173,pg27827.txt:22062,pg27827.txt:24690,pg27827.txt:24755,pg27827.txt:21931,pg27827.txt:21873,pg27

827.txt:21842,pg27827.txt:21807,pg27827.txt:21490,pg27827.txt:21300,pg27827.txt:21243,pg27827.txt:22812,pg27827.txt:24

933,pg27827.txt:24990,pg27827.txt:25037

"'After pg1342.txt:639410

"'My pg1342.txt:638650

"'Spells pg132.txt:249299

"'TIS pg11.txt:121592

"'Tis pg1342.txt:584553,pg1342.txt:609915

"'To prince.txt:64294

"'army' pg132.txt:15601

"(1) pg132.txt:264126

"(1)". pg27827.txt:336002

"(2)". pg27827.txt:335943

"(Lo)cra" pg5000.txt:656915

"--Exactly. prince.txt:81916

"--SAID pg11.txt:143118

"13 pg132.txt:25622,pg132.txt:19470,pg132.txt:37173,pg132.txt:18165

"1490 pg5000.txt:1354328

"1498," pg5000.txt:1372794

"35" pg5000.txt:723641

"40," pg5000.txt:628271

"A pg132.txt:106978,pg132.txt:316296,pg132.txt:143678,pg132.txt:295414,pg132.txt:233970,pg132.txt:295533,pg132.tx

t:211327,prince.txt:22778,prince.txt:27294,prince.txt:20386,prince.txt:48701,prince.txt:20453,prince.txt:22765,prince.

txt:51105,pg27827.txt:338327,pg27827.txt:250594,pg27827.txt:279032,pg27827.txt:287388,pg27827.txt:286979,pg27827.txt:2

40963,pg27827.txt:338358,pg1342.txt:136267,pg1342.txt:288024,pg1342.txt:428735,pg1342.txt:522298,pg1342.txt:399633,pg1

342.txt:671942,pg1342.txt:137439,pg1342.txt:269156,pg1342.txt:101072,pg1342.txt:600412,pg1342.txt:381033,pg1342.txt:40

1449,pg30601.txt:192068,pg30601.txt:116286,pg30601.txt:63986,pg30601.txt:191918,pg30601.txt:63841

"AS-IS". pg5000.txt:1419667

"A_ pg5000.txt:690824

"Abide pg132.txt:187432

"About pg132.txt:7622,pg1342.txt:101653,pg1342.txt:130436,pg27827.txt:115885



블로그 이미지

rekun,ekun 커뉴

이 세상에서 꿈 이상으로 확실한 것을, 인간은 가지고 있는 것일까?

MapReduce 를 사용하여 텍스트 파일 입력으로 부터 단어수를 계산해내는 아주 간단한 예제를 만들어보았다.


한글의 경우는 형태소분석이라던지 여러 축적인 알고리즘이 필요하기 때문에 하지 간단한 예제가 될수 없어서, 정말 공백이 나오기만 하면(Tokenize 되면) 그냥 단어로 계산하게 하는 예제 이다.


기본적으로 MapReduce 는 두개의 클래스를 구현하고, 두개의 메소드만 구현하면 Hadoop이 알아서 실행시켜주기 때문에 여러 큰 데이터들로 부터 정보를 가공하기 위한 방법으로 사용하기 좋다.


Driver.java ( 클래스 등록)

package hopeisagoodthing;

import org.apache.hadoop.util.ProgramDriver;

public class Driver {
        public static void main(String[] args) {                
                ProgramDriver pgd = new ProgramDriver();
                try {
                        pgd.addClass("counter",Counter.class,"");
                        pgd.driver(args);                       
                }
                catch(Throwable e) {
                        e.printStackTrace();
                }

                System.exit(0);
        }
}




Counter.java( map - reduce)

package hopeisagoodthing;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class Counter {
        public static class CounterMapper extends Mapper<Object,Text,Text,IntWritable> {
                private final static IntWritable FOUND = new IntWritable(1);
                private Text word = new Text();
                public void map(Object key, Text value, Context context)throws IOException, InterruptedException {
                        StringTokenizer iter = new StringTokenizer(value.toString());
                        while ( iter.hasMoreTokens() ) {
                                word.set(iter.nextToken());
                                context.write(word,FOUND);
                        }
                }
        }
        public static class CounterReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
                private IntWritable count = new IntWritable();
                public void reduce(Text key, Iterable<intwritable> values, Context context) throws IOException, InterruptedException {
                        int sum = 0;
                        for ( IntWritable val : values ) {
                                sum += val.get();
                        }
                        count.set(sum);
                        context.write(key,count);
                }
        }
        public static void main(String[] args) throws Exception {
                Configuration conf = new Configuration();
                String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
                Job job = new Job(conf,"counter");
                job.setJarByClass(Counter.class);
                job.setMapperClass(CounterMapper.class);
                job.setReducerClass(CounterReducer.class);
                job.setOutputKeyClass(Text.class);
                job.setOutputValueClass(IntWritable.class);
                job.setNumReduceTasks(2);
                FileInputFormat.addInputPath(job,new Path(otherArgs[0]));
                FileOutputFormat.setOutputPath(job,new Path(otherArgs[1]));
                System.exit(job.waitForCompletion(true) ? 0 : 1 );
        }
}

실행방법

hadoop@ubuntu:~/work$ hadoop dfs -mkdir /data

hadoop@ubuntu:~/work$ hadoop dfs -put data/prince.txt /data/prince.txt

hadoop@ubuntu:~/work$ hadoop jar hopeisagoodthing.jar counter -jt local /data/prince.txt /data/prince_out


실행결과

12/11/08 07:47:31 INFO util.NativeCodeLoader: Loaded the native-hadoop library

12/11/08 07:47:31 INFO input.FileInputFormat: Total input paths to process : 1

12/11/08 07:47:31 WARN snappy.LoadSnappy: Snappy native library not loaded

12/11/08 07:47:31 INFO mapred.JobClient: Running job: job_local_0001

12/11/08 07:47:32 INFO util.ProcessTree: setsid exited with exit code 0

12/11/08 07:47:32 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1807ca8

12/11/08 07:47:32 INFO mapred.MapTask: io.sort.mb = 100

12/11/08 07:47:32 INFO mapred.JobClient:  map 0% reduce 0%

12/11/08 07:47:33 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/08 07:47:33 INFO mapred.MapTask: record buffer = 262144/327680

12/11/08 07:47:34 INFO mapred.MapTask: Starting flush of map output

12/11/08 07:47:34 INFO mapred.MapTask: Finished spill 0

12/11/08 07:47:34 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting

12/11/08 07:47:35 INFO mapred.LocalJobRunner:

12/11/08 07:47:35 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.

12/11/08 07:47:35 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@76fba0

12/11/08 07:47:35 INFO mapred.LocalJobRunner:

12/11/08 07:47:35 INFO mapred.Merger: Merging 1 sorted segments

12/11/08 07:47:35 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 190948 bytes

12/11/08 07:47:35 INFO mapred.LocalJobRunner:

12/11/08 07:47:35 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting

12/11/08 07:47:35 INFO mapred.LocalJobRunner:

12/11/08 07:47:35 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now

12/11/08 07:47:35 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to /data/prince_out

12/11/08 07:47:35 INFO mapred.JobClient:  map 100% reduce 0%

12/11/08 07:47:38 INFO mapred.LocalJobRunner: reduce > reduce

12/11/08 07:47:38 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.

12/11/08 07:47:38 INFO mapred.JobClient:  map 100% reduce 100%

12/11/08 07:47:38 INFO mapred.JobClient: Job complete: job_local_0001

12/11/08 07:47:38 INFO mapred.JobClient: Counters: 22

12/11/08 07:47:38 INFO mapred.JobClient:   File Output Format Counters

12/11/08 07:47:38 INFO mapred.JobClient:     Bytes Written=36074

12/11/08 07:47:39 INFO mapred.JobClient:   FileSystemCounters

12/11/08 07:47:39 INFO mapred.JobClient:     FILE_BYTES_READ=327856

12/11/08 07:47:39 INFO mapred.JobClient:     HDFS_BYTES_READ=184590

12/11/08 07:47:39 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=600714

12/11/08 07:47:39 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=36074

12/11/08 07:47:39 INFO mapred.JobClient:   File Input Format Counters

12/11/08 07:47:39 INFO mapred.JobClient:     Bytes Read=92295

12/11/08 07:47:39 INFO mapred.JobClient:   Map-Reduce Framework

12/11/08 07:47:39 INFO mapred.JobClient:     Map output materialized bytes=190952

12/11/08 07:47:39 INFO mapred.JobClient:     Map input records=1660

12/11/08 07:47:39 INFO mapred.JobClient:     Reduce shuffle bytes=0

12/11/08 07:47:39 INFO mapred.JobClient:     Spilled Records=33718

12/11/08 07:47:39 INFO mapred.JobClient:     Map output bytes=157228

12/11/08 07:47:39 INFO mapred.JobClient:     Total committed heap usage (bytes)=231874560

12/11/08 07:47:39 INFO mapred.JobClient:     CPU time spent (ms)=0

12/11/08 07:47:39 INFO mapred.JobClient:     SPLIT_RAW_BYTES=100

12/11/08 07:47:39 INFO mapred.JobClient:     Combine input records=0

12/11/08 07:47:39 INFO mapred.JobClient:     Reduce input records=16859

12/11/08 07:47:39 INFO mapred.JobClient:     Reduce input groups=3714

12/11/08 07:47:39 INFO mapred.JobClient:     Combine output records=0

12/11/08 07:47:39 INFO mapred.JobClient:     Physical memory (bytes) snapshot=0

12/11/08 07:47:39 INFO mapred.JobClient:     Reduce output records=3714

12/11/08 07:47:39 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=0

12/11/08 07:47:39 INFO mapred.JobClient:     Map output records=16859



실행결과 가지고 오기

hadoop@ubuntu:~/work$ hadoop dfs -get /data/prince_out ./


결과확인

hadoop@ubuntu:~/work/prince_out$ ls

part-r-00000  _SUCCESS


블로그 이미지

rekun,ekun 커뉴

이 세상에서 꿈 이상으로 확실한 것을, 인간은 가지고 있는 것일까?

얼마전 윈도우 환경하에서 VM웨어로 리눅스를 설치하고 싱글노드 Hadoop을 설치하고 테스트 한적이 있다. 

Hadoop은 여러 노드들을 붙여서 분산처리하기 위해서 나온 것인데, 환경이 안되다 보니 공부를 순수하게 학습 목적으로 테스트하였는데, 벌써부터 기억이 가물거려 그동안 봉인후 꺼내보지 않았던, 아주 오래된 노트북에 우분투를 설치하고, 데스크탑에 VM 리눅스환경에서 2 node hadoop을 설치하였다.


아래는 다음에 설치하면 까먹지 않기 위한 설치방법에 대한 정리이다.


우선 설치가 완성된후 jps를 돌린 상태 스샷.

1. master node(master,slave02)




2. slave01 node




host파일 내용은 아래와 같이 모든 노드에 동일하게 사용하여야 한다.(데스크탑을 사용하는 master node는 slave02의 역활도 함)

192.168.0.20    master

192.168.0.21    slave01

192.168.0.20    slave02



설치 방법

1. 준비물



2. hadoop용 계정 추가

$sudo adduser hadoop

 


3. hadoop이 사용할 디렉토리 구조 만들기

/home/hadoop/temp --> temp 용도의 디렉토리(hadoop을 실행하게 되면 map,reduce 하는 과정에 사용할  temp용도의 디렉토리)


4. ssh 키생성후 authorized_keys 로 등록하기(비밀번호 입력없이 바로 접속할수 있다)

hadoop@ubuntu:~$ssh-keygen -t rsa -P ""

hadoop@ubuntu:~$cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys

hadoop@ubuntu:~$scp ~/.ssh/authorized_keys hadoop@[slave 서버들]:~/.ssh/


5. hadoop package 다운로드후 설치(압축풀기)

hadoop@ubuntu:~$tar xvfz hadoop-1.0.4.tar.gz

hadoop@ubuntu:~$mkdir bin

hadoop@ubuntu:~$mv hadoop-1.0.4 ./bin/

hadoop@ubuntu:~$cd bin

hadoop@ubuntu:~$ln -s hadoop-1.0.4 hadoop


6. JAVA-JDK설치하기

hadoop@ubuntu:~$./jdk-6u37-linux-i586.bin

hadoop@ubuntu:~$sudo mv jdk1.6.0_37 /usr/local/

hadoop@ubuntu:~$cd /usr/local

hadoop@ubuntu:~$sudo chown -R root:root /usr/local/jdk1.6.0_37

hadoop@ubuntu:~$sudo ln -s jdk1.6.0_37 java-6-sun


7. hadoop 환경 설정 파일 수정해주기

1) java home 설정해주기(hadoop-env.h)

hadoop@ubuntu:~/bin/hadoop/conf$ cat hadoop-env.sh 

# Set Hadoop-specific environment variables here.


# The only required environment variable is JAVA_HOME.  All others are

# optional.  When running a distributed configuration it is best to

# set JAVA_HOME in this file, so that it is correctly defined on

# remote nodes.


# The java implementation to use.  Required.

export JAVA_HOME=/usr/local/java-6-sun


2)각종 site 파일들 수정해주기(core-site.xml, hdfs-site.xml, mapred-site.xml)


hadoop@ubuntu:~/bin/hadoop/conf$ cat core-site.xml 

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://master:10001</value>

</property>

<property>

<name>hadoop.tmp.dir</name>

<value>/home/hadoop/temp</value>

</property>

</configuration>


hadoop@ubuntu:~/bin/hadoop/conf$ cat hdfs-site.xml 

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

</configuration>


hadoop@ubuntu:~/bin/hadoop/conf$ cat mapred-site.xml 

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>

<property>

<name>mapred.job.tracker</name>

<value>master:10002</value>

</property>

</configuration>


8. hadoop format 하기 

hadoop@ubuntu:~/bin/hadoop$ bin/hadoop namenode -format

12/11/07 06:32:29 INFO namenode.NameNode: STARTUP_MSG: 

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG:   host = ubuntu/127.0.1.1

STARTUP_MSG:   args = [-format]

STARTUP_MSG:   version = 1.0.4

STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290; compiled by 'hortonfo' on Wed Oct  3 05:13:58 UTC 2012

************************************************************/

12/11/07 06:32:29 INFO util.GSet: VM type       = 32-bit

12/11/07 06:32:29 INFO util.GSet: 2% max memory = 19.33375 MB

12/11/07 06:32:29 INFO util.GSet: capacity      = 2^22 = 4194304 entries

12/11/07 06:32:29 INFO util.GSet: recommended=4194304, actual=4194304

12/11/07 06:32:30 INFO namenode.FSNamesystem: fsOwner=hadoop

12/11/07 06:32:30 INFO namenode.FSNamesystem: supergroup=supergroup

12/11/07 06:32:30 INFO namenode.FSNamesystem: isPermissionEnabled=true

12/11/07 06:32:30 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100

12/11/07 06:32:30 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)

12/11/07 06:32:30 INFO namenode.NameNode: Caching file names occuring more than 10 times 

12/11/07 06:32:31 INFO common.Storage: Image file of size 112 saved in 0 seconds.

12/11/07 06:32:31 INFO common.Storage: Storage directory /home/hadoop/temp/dfs/name has been successfully formatted.

12/11/07 06:32:31 INFO namenode.NameNode: SHUTDOWN_MSG: 

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1

************************************************************/


9. hadoop start(반드시 master 노드에서 시작)

hadoop@ubuntu:~/bin/hadoop/bin$ ./start-all.sh 

starting namenode, logging to /home/hadoop/bin/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-namenode-ubuntu.out

slave02: starting datanode, logging to /home/hadoop/bin/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-datanode-ubuntu.out

slave01: starting datanode, logging to /home/hadoop/bin/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-datanode-nuke-Satellite-A10.out

master: starting secondarynamenode, logging to /home/hadoop/bin/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-secondarynamenode-ubuntu.out

starting jobtracker, logging to /home/hadoop/bin/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-jobtracker-ubuntu.out

slave02: starting tasktracker, logging to /home/hadoop/bin/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-tasktracker-ubuntu.out

slave01: starting tasktracker, logging to /home/hadoop/bin/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-tasktracker-nuke-Satellite-A10.out

hadoop@ubuntu:~/bin/hadoop/bin$ /usr/local/java-6-sun/bin/jps 

5471 NameNode

6010 JobTracker

6316 Jps

5710 DataNode

5927 SecondaryNameNode

6239 TaskTracker


다음 글 부터는 하둡을 사용한 예제 코드들을 올릴 예정.


블로그 이미지

rekun,ekun 커뉴

이 세상에서 꿈 이상으로 확실한 것을, 인간은 가지고 있는 것일까?

티스토리 툴바