얼마전에 Lucene으로 조그마한 프로젝트를 진행하면서 Inverted Index가 뭔지 처음알게 되었는데, 그때는 멋모르고 그냥 제공해주는거니까 안에는 어떤 원리로 돌아간느건지 뭘 만드는건지 모르고 사용했었다.
Hadoop을 공부하다보니, MapReduce를 사용하면 엄청 간단하게 Inverted Index를 생성할수가 있다.
일단 많은 데이터를 가진 파일들이 없어서 무료로 eBook을 제공해주는 Project Gutenberg 에서 추천 도서 10개를 TXT로 다운받아서 테스트 용도로 사용했다.
hadoop@ubuntu:~/work$ ../bin/hadoop/bin/hadoop jar hopeisagoodthing.jar invertedindex -jt local /data/data /data/invertedindext_out
12/11/11 07:16:29 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/11/11 07:16:29 INFO input.FileInputFormat: Total input paths to process : 10
12/11/11 07:16:29 WARN snappy.LoadSnappy: Snappy native library not loaded
12/11/11 07:16:29 INFO mapred.JobClient: Running job: job_local_0001
12/11/11 07:16:30 INFO util.ProcessTree: setsid exited with exit code 0
12/11/11 07:16:30 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@93d6bc
12/11/11 07:16:30 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:30 INFO mapred.JobClient: map 0% reduce 0%
12/11/11 07:16:32 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:32 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:33 INFO mapred.MapTask: Spilling map output: record full = true
12/11/11 07:16:33 INFO mapred.MapTask: bufstart = 0; bufend = 6288000; bufvoid = 99614720
12/11/11 07:16:33 INFO mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680
12/11/11 07:16:33 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:33 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:33 INFO mapred.MapTask: Finished spill 1
12/11/11 07:16:33 INFO mapred.Merger: Merging 2 sorted segments
12/11/11 07:16:33 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 6967137 bytes
12/11/11 07:16:34 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/11/11 07:16:35 INFO mapred.LocalJobRunner:
12/11/11 07:16:35 INFO mapred.LocalJobRunner:
12/11/11 07:16:35 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
12/11/11 07:16:35 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1b64e6a
12/11/11 07:16:35 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:36 INFO mapred.JobClient: map 100% reduce 0%
12/11/11 07:16:36 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:36 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:37 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:38 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:38 INFO mapred.Task: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
12/11/11 07:16:38 INFO mapred.LocalJobRunner:
12/11/11 07:16:38 INFO mapred.Task: Task 'attempt_local_0001_m_000001_0' done.
12/11/11 07:16:38 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@161dfb5
12/11/11 07:16:38 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:39 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:39 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:39 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:39 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:39 INFO mapred.Task: Task:attempt_local_0001_m_000002_0 is done. And is in the process of commiting
12/11/11 07:16:41 INFO mapred.LocalJobRunner:
12/11/11 07:16:41 INFO mapred.Task: Task 'attempt_local_0001_m_000002_0' done.
12/11/11 07:16:42 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@c09554
12/11/11 07:16:42 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:42 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:42 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:42 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:42 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:42 INFO mapred.Task: Task:attempt_local_0001_m_000003_0 is done. And is in the process of commiting
12/11/11 07:16:45 INFO mapred.LocalJobRunner:
12/11/11 07:16:45 INFO mapred.Task: Task 'attempt_local_0001_m_000003_0' done.
12/11/11 07:16:45 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1309e87
12/11/11 07:16:45 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:45 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:45 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:45 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:45 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:45 INFO mapred.Task: Task:attempt_local_0001_m_000004_0 is done. And is in the process of commiting
12/11/11 07:16:48 INFO mapred.LocalJobRunner:
12/11/11 07:16:48 INFO mapred.Task: Task 'attempt_local_0001_m_000004_0' done.
12/11/11 07:16:48 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6c585a
12/11/11 07:16:48 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:48 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:48 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:48 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:48 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:48 INFO mapred.Task: Task:attempt_local_0001_m_000005_0 is done. And is in the process of commiting
12/11/11 07:16:51 INFO mapred.LocalJobRunner:
12/11/11 07:16:51 INFO mapred.Task: Task 'attempt_local_0001_m_000005_0' done.
12/11/11 07:16:51 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@e3c624
12/11/11 07:16:51 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:51 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:51 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:51 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:51 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:51 INFO mapred.Task: Task:attempt_local_0001_m_000006_0 is done. And is in the process of commiting
12/11/11 07:16:54 INFO mapred.LocalJobRunner:
12/11/11 07:16:54 INFO mapred.Task: Task 'attempt_local_0001_m_000006_0' done.
12/11/11 07:16:54 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1950198
12/11/11 07:16:54 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:54 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:54 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:54 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:54 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:54 INFO mapred.Task: Task:attempt_local_0001_m_000007_0 is done. And is in the process of commiting
12/11/11 07:16:57 INFO mapred.LocalJobRunner:
12/11/11 07:16:57 INFO mapred.Task: Task 'attempt_local_0001_m_000007_0' done.
12/11/11 07:16:57 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@53fb57
12/11/11 07:16:57 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:16:57 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:16:57 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:16:57 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:16:57 INFO mapred.MapTask: Finished spill 0
12/11/11 07:16:57 INFO mapred.Task: Task:attempt_local_0001_m_000008_0 is done. And is in the process of commiting
12/11/11 07:17:00 INFO mapred.LocalJobRunner:
12/11/11 07:17:00 INFO mapred.Task: Task 'attempt_local_0001_m_000008_0' done.
12/11/11 07:17:00 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1742700
12/11/11 07:17:00 INFO mapred.MapTask: io.sort.mb = 100
12/11/11 07:17:00 INFO mapred.MapTask: data buffer = 79691776/99614720
12/11/11 07:17:00 INFO mapred.MapTask: record buffer = 262144/327680
12/11/11 07:17:00 INFO mapred.MapTask: Starting flush of map output
12/11/11 07:17:00 INFO mapred.MapTask: Finished spill 0
12/11/11 07:17:00 INFO mapred.Task: Task:attempt_local_0001_m_000009_0 is done. And is in the process of commiting
12/11/11 07:17:03 INFO mapred.LocalJobRunner:
12/11/11 07:17:03 INFO mapred.Task: Task 'attempt_local_0001_m_000009_0' done.
12/11/11 07:17:03 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@491c4c
12/11/11 07:17:03 INFO mapred.LocalJobRunner:
12/11/11 07:17:03 INFO mapred.Merger: Merging 10 sorted segments
12/11/11 07:17:03 INFO mapred.Merger: Down to the last merge-pass, with 10 segments left of total size: 22414959 bytes
12/11/11 07:17:03 INFO mapred.LocalJobRunner:
12/11/11 07:17:05 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
12/11/11 07:17:05 INFO mapred.LocalJobRunner:
12/11/11 07:17:05 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
12/11/11 07:17:05 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to /data/invertedindext_out
12/11/11 07:17:06 INFO mapred.LocalJobRunner: reduce > reduce
12/11/11 07:17:06 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
12/11/11 07:17:06 INFO mapred.JobClient: map 100% reduce 100%
12/11/11 07:17:06 INFO mapred.JobClient: Job complete: job_local_0001
12/11/11 07:17:06 INFO mapred.JobClient: Counters: 22
12/11/11 07:17:06 INFO mapred.JobClient: File Output Format Counters
12/11/11 07:17:06 INFO mapred.JobClient: Bytes Written=16578538
12/11/11 07:17:06 INFO mapred.JobClient: FileSystemCounters
12/11/11 07:17:06 INFO mapred.JobClient: FILE_BYTES_READ=99911067
12/11/11 07:17:06 INFO mapred.JobClient: HDFS_BYTES_READ=46741458
12/11/11 07:17:06 INFO mapred.JobClient: FILE_BYTES_WRITTEN=286139450
12/11/11 07:17:06 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=16578538
12/11/11 07:17:06 INFO mapred.JobClient: File Input Format Counters
12/11/11 07:17:06 INFO mapred.JobClient: Bytes Read=5048729
12/11/11 07:17:06 INFO mapred.JobClient: Map-Reduce Framework
12/11/11 07:17:06 INFO mapred.JobClient: Map output materialized bytes=22414999
12/11/11 07:17:06 INFO mapred.JobClient: Map input records=108530
12/11/11 07:17:06 INFO mapred.JobClient: Reduce shuffle bytes=0
12/11/11 07:17:06 INFO mapred.JobClient: Spilled Records=2015979
12/11/11 07:17:06 INFO mapred.JobClient: Map output bytes=20666931
12/11/11 07:17:06 INFO mapred.JobClient: Total committed heap usage (bytes)=2057838592
12/11/11 07:17:06 INFO mapred.JobClient: CPU time spent (ms)=0
12/11/11 07:17:06 INFO mapred.JobClient: SPLIT_RAW_BYTES=1062
12/11/11 07:17:06 INFO mapred.JobClient: Combine input records=0
12/11/11 07:17:06 INFO mapred.JobClient: Reduce input records=874004
12/11/11 07:17:06 INFO mapred.JobClient: Reduce input groups=94211
12/11/11 07:17:06 INFO mapred.JobClient: Combine output records=0
12/11/11 07:17:06 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
12/11/11 07:17:06 INFO mapred.JobClient: Reduce output records=94211
12/11/11 07:17:06 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
12/11/11 07:17:06 INFO mapred.JobClient: Map output records=874004
hadoop@ubuntu:~/work$ ../bin/hadoop/bin/hadoop dfs -get /data/invertedindext_out ./
hadoop@ubuntu:~/work$ cd invertedindext_out/
hadoop@ubuntu:~/work/invertedindext_out$ ls
part-r-00000 _SUCCESS
hadoop@ubuntu:~/work/invertedindext_out$ more part-r-00000
! prince.txt:41794
" pg27827.txt:25391,pg27827.txt:22695,pg27827.txt:23024,pg27827.txt:23250,pg27827.txt:22637,pg27827.txt:22398,pg
27827.txt:22343,pg27827.txt:23961,pg27827.txt:24079,pg27827.txt:24142,pg27827.txt:24191,pg27827.txt:24298,pg27827.txt:
22284,pg27827.txt:22173,pg27827.txt:22062,pg27827.txt:24690,pg27827.txt:24755,pg27827.txt:21931,pg27827.txt:21873,pg27
827.txt:21842,pg27827.txt:21807,pg27827.txt:21490,pg27827.txt:21300,pg27827.txt:21243,pg27827.txt:22812,pg27827.txt:24
933,pg27827.txt:24990,pg27827.txt:25037
"'After pg1342.txt:639410
"'My pg1342.txt:638650
"'Spells pg132.txt:249299
"'TIS pg11.txt:121592
"'Tis pg1342.txt:584553,pg1342.txt:609915
"'To prince.txt:64294
"'army' pg132.txt:15601
"(1) pg132.txt:264126
"(1)". pg27827.txt:336002
"(2)". pg27827.txt:335943
"(Lo)cra" pg5000.txt:656915
"--Exactly. prince.txt:81916
"--SAID pg11.txt:143118
"13 pg132.txt:25622,pg132.txt:19470,pg132.txt:37173,pg132.txt:18165
"1490 pg5000.txt:1354328
"1498," pg5000.txt:1372794
"35" pg5000.txt:723641
"40," pg5000.txt:628271
"A pg132.txt:106978,pg132.txt:316296,pg132.txt:143678,pg132.txt:295414,pg132.txt:233970,pg132.txt:295533,pg132.tx
t:211327,prince.txt:22778,prince.txt:27294,prince.txt:20386,prince.txt:48701,prince.txt:20453,prince.txt:22765,prince.
txt:51105,pg27827.txt:338327,pg27827.txt:250594,pg27827.txt:279032,pg27827.txt:287388,pg27827.txt:286979,pg27827.txt:2
40963,pg27827.txt:338358,pg1342.txt:136267,pg1342.txt:288024,pg1342.txt:428735,pg1342.txt:522298,pg1342.txt:399633,pg1
342.txt:671942,pg1342.txt:137439,pg1342.txt:269156,pg1342.txt:101072,pg1342.txt:600412,pg1342.txt:381033,pg1342.txt:40
1449,pg30601.txt:192068,pg30601.txt:116286,pg30601.txt:63986,pg30601.txt:191918,pg30601.txt:63841
"AS-IS". pg5000.txt:1419667
"A_ pg5000.txt:690824
"Abide pg132.txt:187432
"About pg132.txt:7622,pg1342.txt:101653,pg1342.txt:130436,pg27827.txt:115885