언제 부터인가 우분투에서는 openJDK가 기본으로 설치되고, /usr/bin/java도 기본으로 openJDK를 사용하도록 되어 있다.


Oracle에서 제공하는 Sun JDK를 설치하고 나서 기본 설정을 바꿔주기 위해서는 아래와 같이 하면 끝난다.(패스를 설정해줄수도 있지만)


/usr/local/java-6-sun 에 Sun JDK를 설치했다고 가정함.


nuke@ubuntu:~$ sudo update-alternatives --install "/usr/bin/java" "java" "/usr/local/java-6-sun/bin/java" 1


nnuke@ubuntu:~$ sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/local/java-6-sun/bin/javac" 1

update-alternatives: using /usr/local/java-6-sun/bin/javac to provide /usr/bin/javac (javac) in auto mode.


nuke@ubuntu:~$ sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/local/java-6-sun/bin/javaws" 1

update-alternatives: using /usr/local/java-6-sun/bin/javaws to provide /usr/bin/javaws (javaws) in auto mode.


nuke@ubuntu:~$ sudo update-alternatives --config java

There are 2 choices for the alternative java (providing /usr/bin/java).


  Selection    Path                                           Priority   Status

------------------------------------------------------------

* 0            /usr/lib/jvm/java-6-openjdk-i386/jre/bin/java   1061      auto mode

  1            /usr/lib/jvm/java-6-openjdk-i386/jre/bin/java   1061      manual mode

  2            /usr/local/java-6-sun/bin/java                         1         manual mode


Press enter to keep the current choice[*], or type selection number: 2

update-alternatives: using /usr/local/java-6-sun/bin/java to provide /usr/bin/java (java) in manual mode.


nuke@ubuntu:~$ java -version

java version "1.6.0_37"

Java(TM) SE Runtime Environment (build 1.6.0_37-b06)

Java HotSpot(TM) Client VM (build 20.12-b01, mixed mode, sharing)


이전에는 패스를 설정해서 oracle java가 먼저 실행되도록 했었지만, 기왕이면 설정을 등록하고 디폴트 설정을 선택할수 있으니 앞으로는 이와 같은 방법을 사용하여 더 편하게 사용하면 된다.

블로그 이미지

커뉴

이 세상에서 꿈 이상으로 확실한 것을, 인간은 가지고 있는 것일까?

,

얼마전에 Lucene으로 조그마한 프로젝트를 진행하면서 Inverted Index가 뭔지 처음알게 되었는데, 그때는 멋모르고 그냥 제공해주는거니까 안에는 어떤 원리로 돌아간느건지 뭘 만드는건지 모르고 사용했었다.


Hadoop을 공부하다보니, MapReduce를 사용하면 엄청 간단하게 Inverted Index를 생성할수가 있다.


일단 많은 데이터를 가진 파일들이 없어서 무료로 eBook을 제공해주는 Project Gutenberg 에서 추천 도서 10개를 TXT로 다운받아서 테스트 용도로 사용했다.

http://www.gutenberg.org/wiki/Main_Page


1. 데이터들을 Hadoop FS로 옮겨넣기

hadoop@ubuntu:~/work$ cd data

hadoop@ubuntu:~/work/data$ ls

matrix_input.2x2   pg11.txt   pg1342.txt   pg30601.txt  pg5000.txt

matrixmulti.2x3x2  pg132.txt  pg27827.txt  pg4300.txt   prince.txt


hadoop@ubuntu:~/work$ hadoop dfs -put data /data

hadoop@ubuntu:~/work$ ../bin/hadoop/bin/hadoop dfs -ls /data/data

Found 8 items

-rw-r--r--   1 hadoop supergroup     167497 2012-11-11 07:06 /data/data/pg11.txt

-rw-r--r--   1 hadoop supergroup     343695 2012-11-11 07:06 /data/data/pg132.txt

-rw-r--r--   1 hadoop supergroup     704139 2012-11-11 07:06 /data/data/pg1342.txt

-rw-r--r--   1 hadoop supergroup     359504 2012-11-11 07:06 /data/data/pg27827.txt

-rw-r--r--   1 hadoop supergroup     384522 2012-11-11 07:06 /data/data/pg30601.txt

-rw-r--r--   1 hadoop supergroup    1573112 2012-11-11 07:06 /data/data/pg4300.txt

-rw-r--r--   1 hadoop supergroup    1423801 2012-11-11 07:06 /data/data/pg5000.txt

-rw-r--r--   1 hadoop supergroup      92295 2012-11-11 07:06 /data/data/prince.txt


2. InvertedIndex를 생성하는 MapReduce 코딩하기(코딩이 너무 간단하닷!!!!)


package hopeisagoodthing;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.io.LongWritable;

public class InvertedIndex {
	public static class InvertedIndexMapper extends Mapper<Object,Text,Text,Text> {
		private Text outValue = new Text();
		private Text word = new Text();
		private static String docId = null;
		
		public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
			StringTokenizer iter = new StringTokenizer(value.toString()," ",true);
			Long pos = ((LongWritable)key).get();
			while ( iter.hasMoreTokens() ) {
				String token = iter.nextToken();
				if(token.equals(" "))
				{
					pos = pos + 1;
				}
				else
				{
					word.set(token);
					outValue.set(docId+":"+pos);
					pos = pos + token.length();
					context.write(word,outValue);
				}
			}
		}
		
		
	
		protected void setup(Context context) throws IOException, InterruptedException {
			docId = ((FileSplit)context.getInputSplit()).getPath().getName();			
		}
	}
	public static class InvertedIndexReducer extends Reducer<Text,Text,Text,Text> {
		private Text counter = new Text();
		public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
			StringBuilder countersb = new StringBuilder();
			for ( Text val : values ) {
				countersb.append(val);
				countersb.append(",");
			}
			countersb.setLength(countersb.length()-1);
			
			counter.set(countersb.toString());
			context.write(key,counter);
		}
	}
	public static void main(String[] args) throws Exception {		
		Configuration conf = new Configuration();
		String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
		
		Job job = new Job(conf,"invertedindex");
		job.setJarByClass(InvertedIndex.class);
		job.setMapperClass(InvertedIndexMapper.class);
		job.setReducerClass(InvertedIndexReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		job.setNumReduceTasks(2);	

		FileInputFormat.addInputPath(job,new Path(otherArgs[0]));
		FileOutputFormat.setOutputPath(job,new Path(otherArgs[1]));
		System.exit(job.waitForCompletion(true) ? 0 : 1 );
	}	
}



3. Hadoop에서 실행시켜보기.

hadoop@ubuntu:~/work$ ../bin/hadoop/bin/hadoop jar hopeisagoodthing.jar invertedindex -jt local /data/data /data/invertedindext_out

12/11/11 07:16:29 INFO util.NativeCodeLoader: Loaded the native-hadoop library

12/11/11 07:16:29 INFO input.FileInputFormat: Total input paths to process : 10

12/11/11 07:16:29 WARN snappy.LoadSnappy: Snappy native library not loaded

12/11/11 07:16:29 INFO mapred.JobClient: Running job: job_local_0001

12/11/11 07:16:30 INFO util.ProcessTree: setsid exited with exit code 0

12/11/11 07:16:30 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@93d6bc

12/11/11 07:16:30 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:30 INFO mapred.JobClient:  map 0% reduce 0%

12/11/11 07:16:32 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:32 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:33 INFO mapred.MapTask: Spilling map output: record full = true

12/11/11 07:16:33 INFO mapred.MapTask: bufstart = 0; bufend = 6288000; bufvoid = 99614720

12/11/11 07:16:33 INFO mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680

12/11/11 07:16:33 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:33 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:33 INFO mapred.MapTask: Finished spill 1

12/11/11 07:16:33 INFO mapred.Merger: Merging 2 sorted segments

12/11/11 07:16:33 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 6967137 bytes

12/11/11 07:16:34 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting

12/11/11 07:16:35 INFO mapred.LocalJobRunner: 

12/11/11 07:16:35 INFO mapred.LocalJobRunner: 

12/11/11 07:16:35 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.

12/11/11 07:16:35 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1b64e6a

12/11/11 07:16:35 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:36 INFO mapred.JobClient:  map 100% reduce 0%

12/11/11 07:16:36 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:36 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:37 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:38 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:38 INFO mapred.Task: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting

12/11/11 07:16:38 INFO mapred.LocalJobRunner: 

12/11/11 07:16:38 INFO mapred.Task: Task 'attempt_local_0001_m_000001_0' done.

12/11/11 07:16:38 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@161dfb5

12/11/11 07:16:38 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:39 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:39 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:39 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:39 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:39 INFO mapred.Task: Task:attempt_local_0001_m_000002_0 is done. And is in the process of commiting

12/11/11 07:16:41 INFO mapred.LocalJobRunner: 

12/11/11 07:16:41 INFO mapred.Task: Task 'attempt_local_0001_m_000002_0' done.

12/11/11 07:16:42 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@c09554

12/11/11 07:16:42 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:42 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:42 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:42 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:42 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:42 INFO mapred.Task: Task:attempt_local_0001_m_000003_0 is done. And is in the process of commiting

12/11/11 07:16:45 INFO mapred.LocalJobRunner: 

12/11/11 07:16:45 INFO mapred.Task: Task 'attempt_local_0001_m_000003_0' done.

12/11/11 07:16:45 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1309e87

12/11/11 07:16:45 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:45 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:45 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:45 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:45 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:45 INFO mapred.Task: Task:attempt_local_0001_m_000004_0 is done. And is in the process of commiting

12/11/11 07:16:48 INFO mapred.LocalJobRunner: 

12/11/11 07:16:48 INFO mapred.Task: Task 'attempt_local_0001_m_000004_0' done.

12/11/11 07:16:48 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6c585a

12/11/11 07:16:48 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:48 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:48 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:48 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:48 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:48 INFO mapred.Task: Task:attempt_local_0001_m_000005_0 is done. And is in the process of commiting

12/11/11 07:16:51 INFO mapred.LocalJobRunner: 

12/11/11 07:16:51 INFO mapred.Task: Task 'attempt_local_0001_m_000005_0' done.

12/11/11 07:16:51 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@e3c624

12/11/11 07:16:51 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:51 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:51 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:51 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:51 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:51 INFO mapred.Task: Task:attempt_local_0001_m_000006_0 is done. And is in the process of commiting

12/11/11 07:16:54 INFO mapred.LocalJobRunner: 

12/11/11 07:16:54 INFO mapred.Task: Task 'attempt_local_0001_m_000006_0' done.

12/11/11 07:16:54 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1950198

12/11/11 07:16:54 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:54 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:54 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:54 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:54 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:54 INFO mapred.Task: Task:attempt_local_0001_m_000007_0 is done. And is in the process of commiting

12/11/11 07:16:57 INFO mapred.LocalJobRunner: 

12/11/11 07:16:57 INFO mapred.Task: Task 'attempt_local_0001_m_000007_0' done.

12/11/11 07:16:57 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@53fb57

12/11/11 07:16:57 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:16:57 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:16:57 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:16:57 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:16:57 INFO mapred.MapTask: Finished spill 0

12/11/11 07:16:57 INFO mapred.Task: Task:attempt_local_0001_m_000008_0 is done. And is in the process of commiting

12/11/11 07:17:00 INFO mapred.LocalJobRunner: 

12/11/11 07:17:00 INFO mapred.Task: Task 'attempt_local_0001_m_000008_0' done.

12/11/11 07:17:00 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1742700

12/11/11 07:17:00 INFO mapred.MapTask: io.sort.mb = 100

12/11/11 07:17:00 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/11 07:17:00 INFO mapred.MapTask: record buffer = 262144/327680

12/11/11 07:17:00 INFO mapred.MapTask: Starting flush of map output

12/11/11 07:17:00 INFO mapred.MapTask: Finished spill 0

12/11/11 07:17:00 INFO mapred.Task: Task:attempt_local_0001_m_000009_0 is done. And is in the process of commiting

12/11/11 07:17:03 INFO mapred.LocalJobRunner: 

12/11/11 07:17:03 INFO mapred.Task: Task 'attempt_local_0001_m_000009_0' done.

12/11/11 07:17:03 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@491c4c

12/11/11 07:17:03 INFO mapred.LocalJobRunner: 

12/11/11 07:17:03 INFO mapred.Merger: Merging 10 sorted segments

12/11/11 07:17:03 INFO mapred.Merger: Down to the last merge-pass, with 10 segments left of total size: 22414959 bytes

12/11/11 07:17:03 INFO mapred.LocalJobRunner: 

12/11/11 07:17:05 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting

12/11/11 07:17:05 INFO mapred.LocalJobRunner: 

12/11/11 07:17:05 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now

12/11/11 07:17:05 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to /data/invertedindext_out

12/11/11 07:17:06 INFO mapred.LocalJobRunner: reduce > reduce

12/11/11 07:17:06 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.

12/11/11 07:17:06 INFO mapred.JobClient:  map 100% reduce 100%

12/11/11 07:17:06 INFO mapred.JobClient: Job complete: job_local_0001

12/11/11 07:17:06 INFO mapred.JobClient: Counters: 22

12/11/11 07:17:06 INFO mapred.JobClient:   File Output Format Counters 

12/11/11 07:17:06 INFO mapred.JobClient:     Bytes Written=16578538

12/11/11 07:17:06 INFO mapred.JobClient:   FileSystemCounters

12/11/11 07:17:06 INFO mapred.JobClient:     FILE_BYTES_READ=99911067

12/11/11 07:17:06 INFO mapred.JobClient:     HDFS_BYTES_READ=46741458

12/11/11 07:17:06 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=286139450

12/11/11 07:17:06 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=16578538

12/11/11 07:17:06 INFO mapred.JobClient:   File Input Format Counters 

12/11/11 07:17:06 INFO mapred.JobClient:     Bytes Read=5048729

12/11/11 07:17:06 INFO mapred.JobClient:   Map-Reduce Framework

12/11/11 07:17:06 INFO mapred.JobClient:     Map output materialized bytes=22414999

12/11/11 07:17:06 INFO mapred.JobClient:     Map input records=108530

12/11/11 07:17:06 INFO mapred.JobClient:     Reduce shuffle bytes=0

12/11/11 07:17:06 INFO mapred.JobClient:     Spilled Records=2015979

12/11/11 07:17:06 INFO mapred.JobClient:     Map output bytes=20666931

12/11/11 07:17:06 INFO mapred.JobClient:     Total committed heap usage (bytes)=2057838592

12/11/11 07:17:06 INFO mapred.JobClient:     CPU time spent (ms)=0

12/11/11 07:17:06 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1062

12/11/11 07:17:06 INFO mapred.JobClient:     Combine input records=0

12/11/11 07:17:06 INFO mapred.JobClient:     Reduce input records=874004

12/11/11 07:17:06 INFO mapred.JobClient:     Reduce input groups=94211

12/11/11 07:17:06 INFO mapred.JobClient:     Combine output records=0

12/11/11 07:17:06 INFO mapred.JobClient:     Physical memory (bytes) snapshot=0

12/11/11 07:17:06 INFO mapred.JobClient:     Reduce output records=94211

12/11/11 07:17:06 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=0

12/11/11 07:17:06 INFO mapred.JobClient:     Map output records=874004



4. 결과를 가지고 와서 확인하기 

hadoop@ubuntu:~/work$ ../bin/hadoop/bin/hadoop dfs -get /data/invertedindext_out ./

hadoop@ubuntu:~/work$ cd invertedindext_out/

hadoop@ubuntu:~/work/invertedindext_out$ ls

part-r-00000  _SUCCESS


hadoop@ubuntu:~/work/invertedindext_out$ more part-r-00000

! prince.txt:41794

" pg27827.txt:25391,pg27827.txt:22695,pg27827.txt:23024,pg27827.txt:23250,pg27827.txt:22637,pg27827.txt:22398,pg

27827.txt:22343,pg27827.txt:23961,pg27827.txt:24079,pg27827.txt:24142,pg27827.txt:24191,pg27827.txt:24298,pg27827.txt:

22284,pg27827.txt:22173,pg27827.txt:22062,pg27827.txt:24690,pg27827.txt:24755,pg27827.txt:21931,pg27827.txt:21873,pg27

827.txt:21842,pg27827.txt:21807,pg27827.txt:21490,pg27827.txt:21300,pg27827.txt:21243,pg27827.txt:22812,pg27827.txt:24

933,pg27827.txt:24990,pg27827.txt:25037

"'After pg1342.txt:639410

"'My pg1342.txt:638650

"'Spells pg132.txt:249299

"'TIS pg11.txt:121592

"'Tis pg1342.txt:584553,pg1342.txt:609915

"'To prince.txt:64294

"'army' pg132.txt:15601

"(1) pg132.txt:264126

"(1)". pg27827.txt:336002

"(2)". pg27827.txt:335943

"(Lo)cra" pg5000.txt:656915

"--Exactly. prince.txt:81916

"--SAID pg11.txt:143118

"13 pg132.txt:25622,pg132.txt:19470,pg132.txt:37173,pg132.txt:18165

"1490 pg5000.txt:1354328

"1498," pg5000.txt:1372794

"35" pg5000.txt:723641

"40," pg5000.txt:628271

"A pg132.txt:106978,pg132.txt:316296,pg132.txt:143678,pg132.txt:295414,pg132.txt:233970,pg132.txt:295533,pg132.tx

t:211327,prince.txt:22778,prince.txt:27294,prince.txt:20386,prince.txt:48701,prince.txt:20453,prince.txt:22765,prince.

txt:51105,pg27827.txt:338327,pg27827.txt:250594,pg27827.txt:279032,pg27827.txt:287388,pg27827.txt:286979,pg27827.txt:2

40963,pg27827.txt:338358,pg1342.txt:136267,pg1342.txt:288024,pg1342.txt:428735,pg1342.txt:522298,pg1342.txt:399633,pg1

342.txt:671942,pg1342.txt:137439,pg1342.txt:269156,pg1342.txt:101072,pg1342.txt:600412,pg1342.txt:381033,pg1342.txt:40

1449,pg30601.txt:192068,pg30601.txt:116286,pg30601.txt:63986,pg30601.txt:191918,pg30601.txt:63841

"AS-IS". pg5000.txt:1419667

"A_ pg5000.txt:690824

"Abide pg132.txt:187432

"About pg132.txt:7622,pg1342.txt:101653,pg1342.txt:130436,pg27827.txt:115885



블로그 이미지

커뉴

이 세상에서 꿈 이상으로 확실한 것을, 인간은 가지고 있는 것일까?

,

MapReduce 를 사용하여 텍스트 파일 입력으로 부터 단어수를 계산해내는 아주 간단한 예제를 만들어보았다.


한글의 경우는 형태소분석이라던지 여러 축적인 알고리즘이 필요하기 때문에 하지 간단한 예제가 될수 없어서, 정말 공백이 나오기만 하면(Tokenize 되면) 그냥 단어로 계산하게 하는 예제 이다.


기본적으로 MapReduce 는 두개의 클래스를 구현하고, 두개의 메소드만 구현하면 Hadoop이 알아서 실행시켜주기 때문에 여러 큰 데이터들로 부터 정보를 가공하기 위한 방법으로 사용하기 좋다.


Driver.java ( 클래스 등록)

package hopeisagoodthing;

import org.apache.hadoop.util.ProgramDriver;

public class Driver {
        public static void main(String[] args) {                
                ProgramDriver pgd = new ProgramDriver();
                try {
                        pgd.addClass("counter",Counter.class,"");
                        pgd.driver(args);                       
                }
                catch(Throwable e) {
                        e.printStackTrace();
                }

                System.exit(0);
        }
}




Counter.java( map - reduce)

package hopeisagoodthing;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class Counter {
        public static class CounterMapper extends Mapper<Object,Text,Text,IntWritable> {
                private final static IntWritable FOUND = new IntWritable(1);
                private Text word = new Text();
                public void map(Object key, Text value, Context context)throws IOException, InterruptedException {
                        StringTokenizer iter = new StringTokenizer(value.toString());
                        while ( iter.hasMoreTokens() ) {
                                word.set(iter.nextToken());
                                context.write(word,FOUND);
                        }
                }
        }
        public static class CounterReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
                private IntWritable count = new IntWritable();
                public void reduce(Text key, Iterable<intwritable> values, Context context) throws IOException, InterruptedException {
                        int sum = 0;
                        for ( IntWritable val : values ) {
                                sum += val.get();
                        }
                        count.set(sum);
                        context.write(key,count);
                }
        }
        public static void main(String[] args) throws Exception {
                Configuration conf = new Configuration();
                String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
                Job job = new Job(conf,"counter");
                job.setJarByClass(Counter.class);
                job.setMapperClass(CounterMapper.class);
                job.setReducerClass(CounterReducer.class);
                job.setOutputKeyClass(Text.class);
                job.setOutputValueClass(IntWritable.class);
                job.setNumReduceTasks(2);
                FileInputFormat.addInputPath(job,new Path(otherArgs[0]));
                FileOutputFormat.setOutputPath(job,new Path(otherArgs[1]));
                System.exit(job.waitForCompletion(true) ? 0 : 1 );
        }
}

실행방법

hadoop@ubuntu:~/work$ hadoop dfs -mkdir /data

hadoop@ubuntu:~/work$ hadoop dfs -put data/prince.txt /data/prince.txt

hadoop@ubuntu:~/work$ hadoop jar hopeisagoodthing.jar counter -jt local /data/prince.txt /data/prince_out


실행결과

12/11/08 07:47:31 INFO util.NativeCodeLoader: Loaded the native-hadoop library

12/11/08 07:47:31 INFO input.FileInputFormat: Total input paths to process : 1

12/11/08 07:47:31 WARN snappy.LoadSnappy: Snappy native library not loaded

12/11/08 07:47:31 INFO mapred.JobClient: Running job: job_local_0001

12/11/08 07:47:32 INFO util.ProcessTree: setsid exited with exit code 0

12/11/08 07:47:32 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1807ca8

12/11/08 07:47:32 INFO mapred.MapTask: io.sort.mb = 100

12/11/08 07:47:32 INFO mapred.JobClient:  map 0% reduce 0%

12/11/08 07:47:33 INFO mapred.MapTask: data buffer = 79691776/99614720

12/11/08 07:47:33 INFO mapred.MapTask: record buffer = 262144/327680

12/11/08 07:47:34 INFO mapred.MapTask: Starting flush of map output

12/11/08 07:47:34 INFO mapred.MapTask: Finished spill 0

12/11/08 07:47:34 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting

12/11/08 07:47:35 INFO mapred.LocalJobRunner:

12/11/08 07:47:35 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.

12/11/08 07:47:35 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@76fba0

12/11/08 07:47:35 INFO mapred.LocalJobRunner:

12/11/08 07:47:35 INFO mapred.Merger: Merging 1 sorted segments

12/11/08 07:47:35 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 190948 bytes

12/11/08 07:47:35 INFO mapred.LocalJobRunner:

12/11/08 07:47:35 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting

12/11/08 07:47:35 INFO mapred.LocalJobRunner:

12/11/08 07:47:35 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now

12/11/08 07:47:35 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to /data/prince_out

12/11/08 07:47:35 INFO mapred.JobClient:  map 100% reduce 0%

12/11/08 07:47:38 INFO mapred.LocalJobRunner: reduce > reduce

12/11/08 07:47:38 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.

12/11/08 07:47:38 INFO mapred.JobClient:  map 100% reduce 100%

12/11/08 07:47:38 INFO mapred.JobClient: Job complete: job_local_0001

12/11/08 07:47:38 INFO mapred.JobClient: Counters: 22

12/11/08 07:47:38 INFO mapred.JobClient:   File Output Format Counters

12/11/08 07:47:38 INFO mapred.JobClient:     Bytes Written=36074

12/11/08 07:47:39 INFO mapred.JobClient:   FileSystemCounters

12/11/08 07:47:39 INFO mapred.JobClient:     FILE_BYTES_READ=327856

12/11/08 07:47:39 INFO mapred.JobClient:     HDFS_BYTES_READ=184590

12/11/08 07:47:39 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=600714

12/11/08 07:47:39 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=36074

12/11/08 07:47:39 INFO mapred.JobClient:   File Input Format Counters

12/11/08 07:47:39 INFO mapred.JobClient:     Bytes Read=92295

12/11/08 07:47:39 INFO mapred.JobClient:   Map-Reduce Framework

12/11/08 07:47:39 INFO mapred.JobClient:     Map output materialized bytes=190952

12/11/08 07:47:39 INFO mapred.JobClient:     Map input records=1660

12/11/08 07:47:39 INFO mapred.JobClient:     Reduce shuffle bytes=0

12/11/08 07:47:39 INFO mapred.JobClient:     Spilled Records=33718

12/11/08 07:47:39 INFO mapred.JobClient:     Map output bytes=157228

12/11/08 07:47:39 INFO mapred.JobClient:     Total committed heap usage (bytes)=231874560

12/11/08 07:47:39 INFO mapred.JobClient:     CPU time spent (ms)=0

12/11/08 07:47:39 INFO mapred.JobClient:     SPLIT_RAW_BYTES=100

12/11/08 07:47:39 INFO mapred.JobClient:     Combine input records=0

12/11/08 07:47:39 INFO mapred.JobClient:     Reduce input records=16859

12/11/08 07:47:39 INFO mapred.JobClient:     Reduce input groups=3714

12/11/08 07:47:39 INFO mapred.JobClient:     Combine output records=0

12/11/08 07:47:39 INFO mapred.JobClient:     Physical memory (bytes) snapshot=0

12/11/08 07:47:39 INFO mapred.JobClient:     Reduce output records=3714

12/11/08 07:47:39 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=0

12/11/08 07:47:39 INFO mapred.JobClient:     Map output records=16859



실행결과 가지고 오기

hadoop@ubuntu:~/work$ hadoop dfs -get /data/prince_out ./


결과확인

hadoop@ubuntu:~/work/prince_out$ ls

part-r-00000  _SUCCESS


블로그 이미지

커뉴

이 세상에서 꿈 이상으로 확실한 것을, 인간은 가지고 있는 것일까?

,

얼마전 윈도우 환경하에서 VM웨어로 리눅스를 설치하고 싱글노드 Hadoop을 설치하고 테스트 한적이 있다. 

Hadoop은 여러 노드들을 붙여서 분산처리하기 위해서 나온 것인데, 환경이 안되다 보니 공부를 순수하게 학습 목적으로 테스트하였는데, 벌써부터 기억이 가물거려 그동안 봉인후 꺼내보지 않았던, 아주 오래된 노트북에 우분투를 설치하고, 데스크탑에 VM 리눅스환경에서 2 node hadoop을 설치하였다.


아래는 다음에 설치하면 까먹지 않기 위한 설치방법에 대한 정리이다.


우선 설치가 완성된후 jps를 돌린 상태 스샷.

1. master node(master,slave02)




2. slave01 node




host파일 내용은 아래와 같이 모든 노드에 동일하게 사용하여야 한다.(데스크탑을 사용하는 master node는 slave02의 역활도 함)

192.168.0.20    master

192.168.0.21    slave01

192.168.0.20    slave02



설치 방법

1. 준비물



2. hadoop용 계정 추가

$sudo adduser hadoop

 


3. hadoop이 사용할 디렉토리 구조 만들기

/home/hadoop/temp --> temp 용도의 디렉토리(hadoop을 실행하게 되면 map,reduce 하는 과정에 사용할  temp용도의 디렉토리)


4. ssh 키생성후 authorized_keys 로 등록하기(비밀번호 입력없이 바로 접속할수 있다)

hadoop@ubuntu:~$ssh-keygen -t rsa -P ""

hadoop@ubuntu:~$cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys

hadoop@ubuntu:~$scp ~/.ssh/authorized_keys hadoop@[slave 서버들]:~/.ssh/


5. hadoop package 다운로드후 설치(압축풀기)

hadoop@ubuntu:~$tar xvfz hadoop-1.0.4.tar.gz

hadoop@ubuntu:~$mkdir bin

hadoop@ubuntu:~$mv hadoop-1.0.4 ./bin/

hadoop@ubuntu:~$cd bin

hadoop@ubuntu:~$ln -s hadoop-1.0.4 hadoop


6. JAVA-JDK설치하기

hadoop@ubuntu:~$./jdk-6u37-linux-i586.bin

hadoop@ubuntu:~$sudo mv jdk1.6.0_37 /usr/local/

hadoop@ubuntu:~$cd /usr/local

hadoop@ubuntu:~$sudo chown -R root:root /usr/local/jdk1.6.0_37

hadoop@ubuntu:~$sudo ln -s jdk1.6.0_37 java-6-sun


7. hadoop 환경 설정 파일 수정해주기

1) java home 설정해주기(hadoop-env.h)

hadoop@ubuntu:~/bin/hadoop/conf$ cat hadoop-env.sh 

# Set Hadoop-specific environment variables here.


# The only required environment variable is JAVA_HOME.  All others are

# optional.  When running a distributed configuration it is best to

# set JAVA_HOME in this file, so that it is correctly defined on

# remote nodes.


# The java implementation to use.  Required.

export JAVA_HOME=/usr/local/java-6-sun


2)각종 site 파일들 수정해주기(core-site.xml, hdfs-site.xml, mapred-site.xml)


hadoop@ubuntu:~/bin/hadoop/conf$ cat core-site.xml 

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://master:10001</value>

</property>

<property>

<name>hadoop.tmp.dir</name>

<value>/home/hadoop/temp</value>

</property>

</configuration>


hadoop@ubuntu:~/bin/hadoop/conf$ cat hdfs-site.xml 

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

</configuration>


hadoop@ubuntu:~/bin/hadoop/conf$ cat mapred-site.xml 

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>

<property>

<name>mapred.job.tracker</name>

<value>master:10002</value>

</property>

</configuration>


8. hadoop format 하기 

hadoop@ubuntu:~/bin/hadoop$ bin/hadoop namenode -format

12/11/07 06:32:29 INFO namenode.NameNode: STARTUP_MSG: 

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG:   host = ubuntu/127.0.1.1

STARTUP_MSG:   args = [-format]

STARTUP_MSG:   version = 1.0.4

STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290; compiled by 'hortonfo' on Wed Oct  3 05:13:58 UTC 2012

************************************************************/

12/11/07 06:32:29 INFO util.GSet: VM type       = 32-bit

12/11/07 06:32:29 INFO util.GSet: 2% max memory = 19.33375 MB

12/11/07 06:32:29 INFO util.GSet: capacity      = 2^22 = 4194304 entries

12/11/07 06:32:29 INFO util.GSet: recommended=4194304, actual=4194304

12/11/07 06:32:30 INFO namenode.FSNamesystem: fsOwner=hadoop

12/11/07 06:32:30 INFO namenode.FSNamesystem: supergroup=supergroup

12/11/07 06:32:30 INFO namenode.FSNamesystem: isPermissionEnabled=true

12/11/07 06:32:30 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100

12/11/07 06:32:30 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)

12/11/07 06:32:30 INFO namenode.NameNode: Caching file names occuring more than 10 times 

12/11/07 06:32:31 INFO common.Storage: Image file of size 112 saved in 0 seconds.

12/11/07 06:32:31 INFO common.Storage: Storage directory /home/hadoop/temp/dfs/name has been successfully formatted.

12/11/07 06:32:31 INFO namenode.NameNode: SHUTDOWN_MSG: 

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1

************************************************************/


9. hadoop start(반드시 master 노드에서 시작)

hadoop@ubuntu:~/bin/hadoop/bin$ ./start-all.sh 

starting namenode, logging to /home/hadoop/bin/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-namenode-ubuntu.out

slave02: starting datanode, logging to /home/hadoop/bin/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-datanode-ubuntu.out

slave01: starting datanode, logging to /home/hadoop/bin/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-datanode-nuke-Satellite-A10.out

master: starting secondarynamenode, logging to /home/hadoop/bin/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-secondarynamenode-ubuntu.out

starting jobtracker, logging to /home/hadoop/bin/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-jobtracker-ubuntu.out

slave02: starting tasktracker, logging to /home/hadoop/bin/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-tasktracker-ubuntu.out

slave01: starting tasktracker, logging to /home/hadoop/bin/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-tasktracker-nuke-Satellite-A10.out

hadoop@ubuntu:~/bin/hadoop/bin$ /usr/local/java-6-sun/bin/jps 

5471 NameNode

6010 JobTracker

6316 Jps

5710 DataNode

5927 SecondaryNameNode

6239 TaskTracker


다음 글 부터는 하둡을 사용한 예제 코드들을 올릴 예정.


블로그 이미지

커뉴

이 세상에서 꿈 이상으로 확실한 것을, 인간은 가지고 있는 것일까?

,

리눅스를 써온지 벌써 10년도 넘은것 같은데, 재설치하거나 환경 설정하는일이 거의 이제는 자동화되도 보니, 콘솔만 사용하는 환경에서는 아이피변경이나 네트워크 설정등을 하는 법을 거의 다 까먹게 되었다.


오늘도 Hadoop 설정 테스트 하기 위해서 패키지 하나 추가하려다 보니, nameserver 설정이 그동안 잘못되어있었는지 resolving 중에서 0%에서 진행이 되지 않고 있다.


그래서 이참에 우분투를 사용하는 리눅스에서 네트워크 인터페이스를 설정하는 방법을 정리해두고자 한다.


1. 인터페이스 설정하는 방법.


hadoop@nuke-Satellite-A10:~$sudo vi /etc/network/interfaces


auto lo

iface lo inet loopback


auto eth0

iface eth0 inet static

address 192.168.0.11  --> 쓰고 싶은 고정아이피 주소를 여기에

netmask 255.255.255.0

network 192.168.0.0

broadcast 192.168.0.255

gateway 192.168.0.1  --> 공유기를 사용하고 있다면, 공유기가 보통 게이트 웨이주소가 되는데, 사용하는 게이트웨이주소를 여기에.

dns-nameservers 168.126.63.1 168.126.63.2  --> KT에서 제공하는 네임서버 주소 두개.



2. 네임서버 등록

hadoop@nuke-Satellite-A10:~$sudo vi /etc/resolv.conf


nameserver 168.126.63.1

nameserver 168.126.63.2

블로그 이미지

커뉴

이 세상에서 꿈 이상으로 확실한 것을, 인간은 가지고 있는 것일까?

,