얼마전에 Lucene으로 조그마한 프로젝트를 진행하면서 Inverted Index가 뭔지 처음알게 되었는데, 그때는 멋모르고 그냥 제공해주는거니까 안에는 어떤 원리로 돌아간느건지 뭘 만드는건지 모르고 사용했었다.
Hadoop을 공부하다보니, MapReduce를 사용하면 엄청 간단하게 Inverted Index를 생성할수가 있다.
일단 많은 데이터를 가진 파일들이 없어서 무료로 eBook을 제공해주는 Project Gutenberg 에서 추천 도서 10개를 TXT로 다운받아서 테스트 용도로 사용했다.
http://www.gutenberg.org/wiki/Main_Page
1. 데이터들을 Hadoop FS로 옮겨넣기
hadoop@ubuntu:~/work$ cd data
hadoop@ubuntu:~/work/data$ ls
matrix_input.2x2 pg11.txt pg1342.txt pg30601.txt pg5000.txt
matrixmulti.2x3x2 pg132.txt pg27827.txt pg4300.txt prince.txt
hadoop@ubuntu:~/work$ hadoop dfs -put data /data
hadoop@ubuntu:~/work$ ../bin/hadoop/bin/hadoop dfs -ls /data/data
Found 8 items
-rw-r--r-- 1 hadoop supergroup 167497 2012-11-11 07:06 /data/data/pg11.txt
-rw-r--r-- 1 hadoop supergroup 343695 2012-11-11 07:06 /data/data/pg132.txt
-rw-r--r-- 1 hadoop supergroup 704139 2012-11-11 07:06 /data/data/pg1342.txt
-rw-r--r-- 1 hadoop supergroup 359504 2012-11-11 07:06 /data/data/pg27827.txt
-rw-r--r-- 1 hadoop supergroup 384522 2012-11-11 07:06 /data/data/pg30601.txt
-rw-r--r-- 1 hadoop supergroup 1573112 2012-11-11 07:06 /data/data/pg4300.txt
-rw-r--r-- 1 hadoop supergroup 1423801 2012-11-11 07:06 /data/data/pg5000.txt
-rw-r--r-- 1 hadoop supergroup 92295 2012-11-11 07:06 /data/data/prince.txt
2. InvertedIndex를 생성하는 MapReduce 코딩하기(코딩이 너무 간단하닷!!!!)
package hopeisagoodthing; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.io.LongWritable; public class InvertedIndex { public static class InvertedIndexMapper extends Mapper<Object,Text,Text,Text> { private Text outValue = new Text(); private Text word = new Text(); private static String docId = null; public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer iter = new StringTokenizer(value.toString()," ",true); Long pos = ((LongWritable)key).get(); while ( iter.hasMoreTokens() ) { String token = iter.nextToken(); if(token.equals(" ")) { pos = pos + 1; } else { word.set(token); outValue.set(docId+":"+pos); pos = pos + token.length(); context.write(word,outValue); } } } protected void setup(Context context) throws IOException, InterruptedException { docId = ((FileSplit)context.getInputSplit()).getPath().getName(); } } public static class InvertedIndexReducer extends Reducer<Text,Text,Text,Text> { private Text counter = new Text(); public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { StringBuilder countersb = new StringBuilder(); for ( Text val : values ) { countersb.append(val); countersb.append(","); } countersb.setLength(countersb.length()-1); counter.set(countersb.toString()); context.write(key,counter); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs(); Job job = new Job(conf,"invertedindex"); job.setJarByClass(InvertedIndex.class); job.setMapperClass(InvertedIndexMapper.class); job.setReducerClass(InvertedIndexReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setNumReduceTasks(2); FileInputFormat.addInputPath(job,new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job,new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1 ); } }
3. Hadoop에서 실행시켜보기.
hadoop@ubuntu:~/work$ ../bin/hadoop/bin/hadoop jar hopeisagoodthing.jar invertedindex -jt local /data/data /data/invertedindext_out
12/11/11 07:16:29 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/11/11 07:16:29 INFO input.FileInputFormat: Total input paths to process : 10
12/11/11 07:16:29 WARN snappy.LoadSnappy: Snappy native library not loaded
12/11/11 07:16:29 INFO mapred.JobClient: Running job: job_local_0001
12/11/11 07:16:30 INFO util.ProcessTree: setsid exited with exit code 0
12/11/11 07:16:30 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@93d6bc
12/11/11 07:16:30 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:30 INFO mapred.JobClient: map 0% reduce 0%
12/11/11 07:16:32 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:32 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:33 INFO mapred.MapTask: Spilling map output: record full = true
12/11/11 07:16:33 INFO mapred.MapTask: bufstart = 0; bufend = 6288000; bufvoid = 99614720
12/11/11 07:16:33 INFO mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680
12/11/11 07:16:33 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:33 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:33 INFO mapred.MapTask: Finished spill 1
12/11/11 07:16:33 INFO mapred.Merger: Merging 2 sorted segments
12/11/11 07:16:33 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 6967137 bytes
12/11/11 07:16:34 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/11/11 07:16:35 INFO mapred.LocalJobRunner:
12/11/11 07:16:35 INFO mapred.LocalJobRunner:
12/11/11 07:16:35 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
12/11/11 07:16:35 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1b64e6a
12/11/11 07:16:35 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:36 INFO mapred.JobClient: map 100% reduce 0%
12/11/11 07:16:36 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:36 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:37 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:38 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:38 INFO mapred.Task: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
12/11/11 07:16:38 INFO mapred.LocalJobRunner:
12/11/11 07:16:38 INFO mapred.Task: Task 'attempt_local_0001_m_000001_0' done.
12/11/11 07:16:38 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@161dfb5
12/11/11 07:16:38 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:39 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:39 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:39 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:39 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:39 INFO mapred.Task: Task:attempt_local_0001_m_000002_0 is done. And is in the process of commiting
12/11/11 07:16:41 INFO mapred.LocalJobRunner:
12/11/11 07:16:41 INFO mapred.Task: Task 'attempt_local_0001_m_000002_0' done.
12/11/11 07:16:42 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@c09554
12/11/11 07:16:42 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:42 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:42 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:42 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:42 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:42 INFO mapred.Task: Task:attempt_local_0001_m_000003_0 is done. And is in the process of commiting
12/11/11 07:16:45 INFO mapred.LocalJobRunner:
12/11/11 07:16:45 INFO mapred.Task: Task 'attempt_local_0001_m_000003_0' done.
12/11/11 07:16:45 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1309e87
12/11/11 07:16:45 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:45 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:45 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:45 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:45 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:45 INFO mapred.Task: Task:attempt_local_0001_m_000004_0 is done. And is in the process of commiting
12/11/11 07:16:48 INFO mapred.LocalJobRunner:
12/11/11 07:16:48 INFO mapred.Task: Task 'attempt_local_0001_m_000004_0' done.
12/11/11 07:16:48 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6c585a
12/11/11 07:16:48 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:48 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:48 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:48 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:48 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:48 INFO mapred.Task: Task:attempt_local_0001_m_000005_0 is done. And is in the process of commiting
12/11/11 07:16:51 INFO mapred.LocalJobRunner:
12/11/11 07:16:51 INFO mapred.Task: Task 'attempt_local_0001_m_000005_0' done.
12/11/11 07:16:51 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@e3c624
12/11/11 07:16:51 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:51 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:51 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:51 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:51 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:51 INFO mapred.Task: Task:attempt_local_0001_m_000006_0 is done. And is in the process of commiting
12/11/11 07:16:54 INFO mapred.LocalJobRunner:
12/11/11 07:16:54 INFO mapred.Task: Task 'attempt_local_0001_m_000006_0' done.
12/11/11 07:16:54 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1950198
12/11/11 07:16:54 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:54 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:54 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:54 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:54 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:54 INFO mapred.Task: Task:attempt_local_0001_m_000007_0 is done. And is in the process of commiting
12/11/11 07:16:57 INFO mapred.LocalJobRunner:
12/11/11 07:16:57 INFO mapred.Task: Task 'attempt_local_0001_m_000007_0' done.
12/11/11 07:16:57 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@53fb57
12/11/11 07:16:57 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:57 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:57 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:57 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:57 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:57 INFO mapred.Task: Task:attempt_local_0001_m_000008_0 is done. And is in the process of commiting
12/11/11 07:17:00 INFO mapred.LocalJobRunner:
12/11/11 07:17:00 INFO mapred.Task: Task 'attempt_local_0001_m_000008_0' done.
12/11/11 07:17:00 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1742700
12/11/11 07:17:00 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:17:00 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:17:00 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:17:00 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:17:00 INFO mapred.MapTask: Finished spill 0
12/11/11 07:17:00 INFO mapred.Task: Task:attempt_local_0001_m_000009_0 is done. And is in the process of commiting
12/11/11 07:17:03 INFO mapred.LocalJobRunner:
12/11/11 07:17:03 INFO mapred.Task: Task 'attempt_local_0001_m_000009_0' done.
12/11/11 07:17:03 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@491c4c
12/11/11 07:17:03 INFO mapred.LocalJobRunner:
12/11/11 07:17:03 INFO mapred.Merger: Merging 10 sorted segments
12/11/11 07:17:03 INFO mapred.Merger: Down to the last merge-pass, with 10 segments left of total size: 22414959 bytes
12/11/11 07:17:03 INFO mapred.LocalJobRunner:
12/11/11 07:17:05 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
12/11/11 07:17:05 INFO mapred.LocalJobRunner:
12/11/11 07:17:05 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
12/11/11 07:17:05 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to /data/invertedindext_out
12/11/11 07:17:06 INFO mapred.LocalJobRunner: reduce > reduce
12/11/11 07:17:06 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
12/11/11 07:17:06 INFO mapred.JobClient: map 100% reduce 100%
12/11/11 07:17:06 INFO mapred.JobClient: Job complete: job_local_0001
12/11/11 07:17:06 INFO mapred.JobClient: Counters: 22
12/11/11 07:17:06 INFO mapred.JobClient: File Output Format Counters
12/11/11 07:17:06 INFO mapred.JobClient: Bytes Written=16578538
12/11/11 07:17:06 INFO mapred.JobClient: FileSystemCounters
12/11/11 07:17:06 INFO mapred.JobClient: FILE_BYTES_READ=99911067
12/11/11 07:17:06 INFO mapred.JobClient: HDFS_BYTES_READ=46741458
12/11/11 07:17:06 INFO mapred.JobClient: FILE_BYTES_WRITTEN=286139450
12/11/11 07:17:06 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=16578538
12/11/11 07:17:06 INFO mapred.JobClient: File Input Format Counters
12/11/11 07:17:06 INFO mapred.JobClient: Bytes Read=5048729
12/11/11 07:17:06 INFO mapred.JobClient: Map-Reduce Framework
12/11/11 07:17:06 INFO mapred.JobClient: Map output materialized bytes=22414999
12/11/11 07:17:06 INFO mapred.JobClient: Map input records=108530
12/11/11 07:17:06 INFO mapred.JobClient: Reduce shuffle bytes=0
12/11/11 07:17:06 INFO mapred.JobClient: Spilled Records=2015979
12/11/11 07:17:06 INFO mapred.JobClient: Map output bytes=20666931
12/11/11 07:17:06 INFO mapred.JobClient: Total committed heap usage (bytes)=2057838592
12/11/11 07:17:06 INFO mapred.JobClient: CPU time spent (ms)=0
12/11/11 07:17:06 INFO mapred.JobClient: SPLIT_RAW_BYTES=1062
12/11/11 07:17:06 INFO mapred.JobClient: Combine input records=0
12/11/11 07:17:06 INFO mapred.JobClient: Reduce input records=874004
12/11/11 07:17:06 INFO mapred.JobClient: Reduce input groups=94211
12/11/11 07:17:06 INFO mapred.JobClient: Combine output records=0
12/11/11 07:17:06 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
12/11/11 07:17:06 INFO mapred.JobClient: Reduce output records=94211
12/11/11 07:17:06 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
12/11/11 07:17:06 INFO mapred.JobClient: Map output records=874004
4. 결과를 가지고 와서 확인하기
hadoop@ubuntu:~/work$ ../bin/hadoop/bin/hadoop dfs -get /data/invertedindext_out ./
hadoop@ubuntu:~/work$ cd invertedindext_out/
hadoop@ubuntu:~/work/invertedindext_out$ ls
part-r-00000 _SUCCESS
hadoop@ubuntu:~/work/invertedindext_out$ more part-r-00000
! prince.txt:41794
" pg27827.txt:25391,pg27827.txt:22695,pg27827.txt:23024,pg27827.txt:23250,pg27827.txt:22637,pg27827.txt:22398,pg
27827.txt:22343,pg27827.txt:23961,pg27827.txt:24079,pg27827.txt:24142,pg27827.txt:24191,pg27827.txt:24298,pg27827.txt:
22284,pg27827.txt:22173,pg27827.txt:22062,pg27827.txt:24690,pg27827.txt:24755,pg27827.txt:21931,pg27827.txt:21873,pg27
827.txt:21842,pg27827.txt:21807,pg27827.txt:21490,pg27827.txt:21300,pg27827.txt:21243,pg27827.txt:22812,pg27827.txt:24
933,pg27827.txt:24990,pg27827.txt:25037
"'After pg1342.txt:639410
"'My pg1342.txt:638650
"'Spells pg132.txt:249299
"'TIS pg11.txt:121592
"'Tis pg1342.txt:584553,pg1342.txt:609915
"'To prince.txt:64294
"'army' pg132.txt:15601
"(1) pg132.txt:264126
"(1)". pg27827.txt:336002
"(2)". pg27827.txt:335943
"(Lo)cra" pg5000.txt:656915
"--Exactly. prince.txt:81916
"--SAID pg11.txt:143118
"13 pg132.txt:25622,pg132.txt:19470,pg132.txt:37173,pg132.txt:18165
"1490 pg5000.txt:1354328
"1498," pg5000.txt:1372794
"35" pg5000.txt:723641
"40," pg5000.txt:628271
"A pg132.txt:106978,pg132.txt:316296,pg132.txt:143678,pg132.txt:295414,pg132.txt:233970,pg132.txt:295533,pg132.tx
t:211327,prince.txt:22778,prince.txt:27294,prince.txt:20386,prince.txt:48701,prince.txt:20453,prince.txt:22765,prince.
txt:51105,pg27827.txt:338327,pg27827.txt:250594,pg27827.txt:279032,pg27827.txt:287388,pg27827.txt:286979,pg27827.txt:2
40963,pg27827.txt:338358,pg1342.txt:136267,pg1342.txt:288024,pg1342.txt:428735,pg1342.txt:522298,pg1342.txt:399633,pg1
342.txt:671942,pg1342.txt:137439,pg1342.txt:269156,pg1342.txt:101072,pg1342.txt:600412,pg1342.txt:381033,pg1342.txt:40
1449,pg30601.txt:192068,pg30601.txt:116286,pg30601.txt:63986,pg30601.txt:191918,pg30601.txt:63841
"AS-IS". pg5000.txt:1419667
"A_ pg5000.txt:690824
"Abide pg132.txt:187432
"About pg132.txt:7622,pg1342.txt:101653,pg1342.txt:130436,pg27827.txt:115885
'공부하고(?) > Linux' 카테고리의 다른 글
Cassandra 설치하기 (0) | 2012.11.17 |
---|---|
우분투(Ubuntu)에 Sun JDK를 설치하고 나서 java 설정변경해주는법 (0) | 2012.11.17 |
Hadoop - MapReduce를 이용한 간단한 단어수 계산하는 방법 (0) | 2012.11.09 |
Hadoop 설치 방법 정리. (0) | 2012.11.08 |
생각보다 잘 까먹는 우분투 네트워크 설정 방법. (0) | 2012.11.07 |