大数据教程(10.1)倒排索引建立

2024-02-02 02:59

本文主要是介绍大数据教程(10.1)倒排索引建立,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

2019独角兽企业重金招聘Python工程师标准>>> hot3.png

        前面博主介绍了sql中join功能的大数据实现,本节将继续为小伙伴们分享倒排索引的建立。

        一、需求

              在很多项目中,我们需要对我们的文档建立索引(如:论坛帖子);我们需要记录某个词在各个文档中出现的次数并且记录下来供我们进行查询搜素,这就是我们做搜素引擎最基础的功能;分词框架有开源的CJK等,搜素框架有lucene等。但是当我们需要建立索引的文件数量太多的时候,我们使用lucene来做效率就会很低;此时我们需要建立自己的索引,可以使用hadoop来实现。

              图1、待统计的文档

7d0c694f915277529d1e6a04f185e3241c3.jpg

              图2、建立的索引文件效果

8aadcd28e29b3372947ff2a2c8a5f4eb82c.jpg

        二、代码实现

               step1:map-reduce

package com.empire.hadoop.mr.inverindex;import java.io.IOException;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class InverIndexStepOne {static class InverIndexStepOneMapper extends Mapper<LongWritable, Text, Text, IntWritable> {Text        k = new Text();IntWritable v = new IntWritable(1);@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();String[] words = line.split(" ");FileSplit inputSplit = (FileSplit) context.getInputSplit();String fileName = inputSplit.getPath().getName();for (String word : words) {k.set(word + "--" + fileName);context.write(k, v);}}}static class InverIndexStepOneReducer extends Reducer<Text, IntWritable, Text, IntWritable> {@Overrideprotected void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {int count = 0;for (IntWritable value : values) {count += value.get();}context.write(key, new IntWritable(count));}}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = Job.getInstance(conf);job.setJarByClass(InverIndexStepOne.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));// FileInputFormat.setInputPaths(job, new Path(args[0]));// FileOutputFormat.setOutputPath(job, new Path(args[1]));job.setMapperClass(InverIndexStepOneMapper.class);job.setReducerClass(InverIndexStepOneReducer.class);job.waitForCompletion(true);}}

               step2:map-reduce

package com.empire.hadoop.mr.inverindex;import java.io.IOException;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class IndexStepTwo {public static class IndexStepTwoMapper extends Mapper<LongWritable, Text, Text, Text> {@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();String[] files = line.split("--");context.write(new Text(files[0]), new Text(files[1]));}}public static class IndexStepTwoReducer extends Reducer<Text, Text, Text, Text> {@Overrideprotected void reduce(Text key, Iterable<Text> values, Context context)throws IOException, InterruptedException {StringBuffer sb = new StringBuffer();for (Text text : values) {sb.append(text.toString().replace("\t", "-->") + "\t");}context.write(key, new Text(sb.toString()));}}public static void main(String[] args) throws Exception {if (args.length < 1 || args == null) {args = new String[] { "D:/temp/out/part-r-00000", "D:/temp/out2" };}Configuration config = new Configuration();Job job = Job.getInstance(config);job.setJarByClass(IndexStepTwo.class);job.setMapperClass(IndexStepTwoMapper.class);job.setReducerClass(IndexStepTwoReducer.class);//		job.setMapOutputKeyClass(Text.class);//		job.setMapOutputValueClass(Text.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.exit(job.waitForCompletion(true) ? 1 : 0);}
}

        三、执行程序

#上传jarAlt+p
lcd d:/
put  IndexStepOne.jar  IndexStepTwo.jar
put a.txt b.txt c.txt#准备hadoop处理的数据文件cd /home/hadoop
hadoop fs  -mkdir -p /index/indexinput
hdfs dfs -put  a.txt b.txt c.txt  /index/indexinput#运行程序hadoop jar IndexStepOne.jar  com.empire.hadoop.mr.inverindex.InverIndexStepOne /index/indexinput /index/indexsteponeoutput   hadoop jar IndexStepTwo.jar  com.empire.hadoop.mr.inverindex.IndexStepTwo /index/indexsteponeoutput    /index/indexsteptwooutput  

        四、运行效果

[hadoop@centos-aaron-h1 ~]$ hadoop jar IndexStepOne.jar  com.empire.hadoop.mr.inverindex.InverIndexStepOne /index/indexinput /index/indexsteponeoutput   
18/12/19 07:08:42 INFO client.RMProxy: Connecting to ResourceManager at centos-aaron-h1/192.168.29.144:8032
18/12/19 07:08:43 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/12/19 07:08:43 INFO input.FileInputFormat: Total input files to process : 3
18/12/19 07:08:43 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
18/12/19 07:08:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1545173547743_0001
18/12/19 07:08:45 INFO impl.YarnClientImpl: Submitted application application_1545173547743_0001
18/12/19 07:08:45 INFO mapreduce.Job: The url to track the job: http://centos-aaron-h1:8088/proxy/application_1545173547743_0001/
18/12/19 07:08:45 INFO mapreduce.Job: Running job: job_1545173547743_0001
18/12/19 07:08:56 INFO mapreduce.Job: Job job_1545173547743_0001 running in uber mode : false
18/12/19 07:08:56 INFO mapreduce.Job:  map 0% reduce 0%
18/12/19 07:09:05 INFO mapreduce.Job:  map 33% reduce 0%
18/12/19 07:09:20 INFO mapreduce.Job:  map 67% reduce 0%
18/12/19 07:09:21 INFO mapreduce.Job:  map 100% reduce 100%
18/12/19 07:09:23 INFO mapreduce.Job: Job job_1545173547743_0001 completed successfully
18/12/19 07:09:23 INFO mapreduce.Job: Counters: 50File System CountersFILE: Number of bytes read=1252FILE: Number of bytes written=791325FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=689HDFS: Number of bytes written=297HDFS: Number of read operations=12HDFS: Number of large read operations=0HDFS: Number of write operations=2Job Counters Killed map tasks=1Launched map tasks=4Launched reduce tasks=1Data-local map tasks=4Total time spent by all maps in occupied slots (ms)=53828Total time spent by all reduces in occupied slots (ms)=13635Total time spent by all map tasks (ms)=53828Total time spent by all reduce tasks (ms)=13635Total vcore-milliseconds taken by all map tasks=53828Total vcore-milliseconds taken by all reduce tasks=13635Total megabyte-milliseconds taken by all map tasks=55119872Total megabyte-milliseconds taken by all reduce tasks=13962240Map-Reduce FrameworkMap input records=14Map output records=70Map output bytes=1106Map output materialized bytes=1264Input split bytes=345Combine input records=0Combine output records=0Reduce input groups=21Reduce shuffle bytes=1264Reduce input records=70Reduce output records=21Spilled Records=140Shuffled Maps =3Failed Shuffles=0Merged Map outputs=3GC time elapsed (ms)=1589CPU time spent (ms)=5600Physical memory (bytes) snapshot=749715456Virtual memory (bytes) snapshot=3382075392Total committed heap usage (bytes)=380334080Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format Counters Bytes Read=344File Output Format Counters Bytes Written=297
[hadoop@centos-aaron-h1 ~]$ 
[hadoop@centos-aaron-h1 ~]$  hadoop jar IndexStepTwo.jar  com.empire.hadoop.mr.inverindex.IndexStepTwo /index/indexsteponeoutput /index/indexsteptwooutput
18/12/19 07:11:27 INFO client.RMProxy: Connecting to ResourceManager at centos-aaron-h1/192.168.29.144:8032
18/12/19 07:11:27 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/12/19 07:11:27 INFO input.FileInputFormat: Total input files to process : 1
18/12/19 07:11:28 INFO mapreduce.JobSubmitter: number of splits:1
18/12/19 07:11:28 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
18/12/19 07:11:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1545173547743_0002
18/12/19 07:11:28 INFO impl.YarnClientImpl: Submitted application application_1545173547743_0002
18/12/19 07:11:29 INFO mapreduce.Job: The url to track the job: http://centos-aaron-h1:8088/proxy/application_1545173547743_0002/
18/12/19 07:11:29 INFO mapreduce.Job: Running job: job_1545173547743_0002
18/12/19 07:11:36 INFO mapreduce.Job: Job job_1545173547743_0002 running in uber mode : false
18/12/19 07:11:36 INFO mapreduce.Job:  map 0% reduce 0%
18/12/19 07:11:42 INFO mapreduce.Job:  map 100% reduce 0%
18/12/19 07:11:48 INFO mapreduce.Job:  map 100% reduce 100%
18/12/19 07:11:48 INFO mapreduce.Job: Job job_1545173547743_0002 completed successfully
18/12/19 07:11:48 INFO mapreduce.Job: Counters: 49File System CountersFILE: Number of bytes read=324FILE: Number of bytes written=394987FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=427HDFS: Number of bytes written=253HDFS: Number of read operations=6HDFS: Number of large read operations=0HDFS: Number of write operations=2Job Counters Launched map tasks=1Launched reduce tasks=1Data-local map tasks=1Total time spent by all maps in occupied slots (ms)=3234Total time spent by all reduces in occupied slots (ms)=3557Total time spent by all map tasks (ms)=3234Total time spent by all reduce tasks (ms)=3557Total vcore-milliseconds taken by all map tasks=3234Total vcore-milliseconds taken by all reduce tasks=3557Total megabyte-milliseconds taken by all map tasks=3311616Total megabyte-milliseconds taken by all reduce tasks=3642368Map-Reduce FrameworkMap input records=21Map output records=21Map output bytes=276Map output materialized bytes=324Input split bytes=130Combine input records=0Combine output records=0Reduce input groups=7Reduce shuffle bytes=324Reduce input records=21Reduce output records=7Spilled Records=42Shuffled Maps =1Failed Shuffles=0Merged Map outputs=1GC time elapsed (ms)=210CPU time spent (ms)=990Physical memory (bytes) snapshot=339693568Virtual memory (bytes) snapshot=1694265344Total committed heap usage (bytes)=137760768Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format Counters Bytes Read=297File Output Format Counters Bytes Written=253
[hadoop@centos-aaron-h1 ~]$ 

        五、运行结果

[hadoop@centos-aaron-h1 ~]$  hdfs dfs -cat  /index/indexsteponeoutput/part-r-00000
boby--a.txt     1
boby--b.txt     2
boby--c.txt     4
fork--a.txt     2
fork--b.txt     4
fork--c.txt     8
hello--a.txt    2
hello--b.txt    4
hello--c.txt    8
integer--a.txt  1
integer--b.txt  2
integer--c.txt  4
source--a.txt   1
source--b.txt   2
source--c.txt   4
tom--a.txt      1
tom--b.txt      2
tom--c.txt      4
[hadoop@centos-aaron-h1 ~]$ 
[hadoop@centos-aaron-h1 ~]$  hdfs dfs -cat  /index/indexsteptwooutput/part-r-00000
boby    a.txt-->1       b.txt-->2       c.txt-->4
fork    a.txt-->2       b.txt-->4       c.txt-->8
hello   b.txt-->4       c.txt-->8       a.txt-->2
integer a.txt-->1       b.txt-->2       c.txt-->4
source  a.txt-->1       b.txt-->2       c.txt-->4
tom     a.txt-->1       b.txt-->2       c.txt-->4
[hadoop@centos-aaron-h1 ~]$ 

        最后寄语,以上是博主本次文章的全部内容,如果大家觉得博主的文章还不错,请点赞;如果您对博主其它服务器大数据技术或者博主本人感兴趣,请关注博主博客,并且欢迎随时跟博主沟通交流。

转载于:https://my.oschina.net/u/2371923/blog/2990230

这篇关于大数据教程(10.1)倒排索引建立的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/669290

相关文章

详解如何使用Python构建从数据到文档的自动化工作流

《详解如何使用Python构建从数据到文档的自动化工作流》这篇文章将通过真实工作场景拆解,为大家展示如何用Python构建自动化工作流,让工具代替人力完成这些数字苦力活,感兴趣的小伙伴可以跟随小编一起... 目录一、Excel处理:从数据搬运工到智能分析师二、PDF处理:文档工厂的智能生产线三、邮件自动化:

Java中的登录技术保姆级详细教程

《Java中的登录技术保姆级详细教程》:本文主要介绍Java中登录技术保姆级详细教程的相关资料,在Java中我们可以使用各种技术和框架来实现这些功能,文中通过代码介绍的非常详细,需要的朋友可以参考... 目录1.登录思路2.登录标记1.会话技术2.会话跟踪1.Cookie技术2.Session技术3.令牌技

Python数据分析与可视化的全面指南(从数据清洗到图表呈现)

《Python数据分析与可视化的全面指南(从数据清洗到图表呈现)》Python是数据分析与可视化领域中最受欢迎的编程语言之一,凭借其丰富的库和工具,Python能够帮助我们快速处理、分析数据并生成高质... 目录一、数据采集与初步探索二、数据清洗的七种武器1. 缺失值处理策略2. 异常值检测与修正3. 数据

pandas实现数据concat拼接的示例代码

《pandas实现数据concat拼接的示例代码》pandas.concat用于合并DataFrame或Series,本文主要介绍了pandas实现数据concat拼接的示例代码,具有一定的参考价值,... 目录语法示例:使用pandas.concat合并数据默认的concat:参数axis=0,join=

C#代码实现解析WTGPS和BD数据

《C#代码实现解析WTGPS和BD数据》在现代的导航与定位应用中,准确解析GPS和北斗(BD)等卫星定位数据至关重要,本文将使用C#语言实现解析WTGPS和BD数据,需要的可以了解下... 目录一、代码结构概览1. 核心解析方法2. 位置信息解析3. 经纬度转换方法4. 日期和时间戳解析5. 辅助方法二、L

Python使用Code2flow将代码转化为流程图的操作教程

《Python使用Code2flow将代码转化为流程图的操作教程》Code2flow是一款开源工具,能够将代码自动转换为流程图,该工具对于代码审查、调试和理解大型代码库非常有用,在这篇博客中,我们将深... 目录引言1nVflRA、为什么选择 Code2flow?2、安装 Code2flow3、基本功能演示

使用Python和Matplotlib实现可视化字体轮廓(从路径数据到矢量图形)

《使用Python和Matplotlib实现可视化字体轮廓(从路径数据到矢量图形)》字体设计和矢量图形处理是编程中一个有趣且实用的领域,通过Python的matplotlib库,我们可以轻松将字体轮廓... 目录背景知识字体轮廓的表示实现步骤1. 安装依赖库2. 准备数据3. 解析路径指令4. 绘制图形关键

Java Spring 中的监听器Listener详解与实战教程

《JavaSpring中的监听器Listener详解与实战教程》Spring提供了多种监听器机制,可以用于监听应用生命周期、会话生命周期和请求处理过程中的事件,:本文主要介绍JavaSprin... 目录一、监听器的作用1.1 应用生命周期管理1.2 会话管理1.3 请求处理监控二、创建监听器2.1 Ser

解决mysql插入数据锁等待超时报错:Lock wait timeout exceeded;try restarting transaction

《解决mysql插入数据锁等待超时报错:Lockwaittimeoutexceeded;tryrestartingtransaction》:本文主要介绍解决mysql插入数据锁等待超时报... 目录报错信息解决办法1、数据库中执行如下sql2、再到 INNODB_TRX 事务表中查看总结报错信息Lock

MySQL 安装配置超完整教程

《MySQL安装配置超完整教程》MySQL是一款广泛使用的开源关系型数据库管理系统(RDBMS),由瑞典MySQLAB公司开发,目前属于Oracle公司旗下产品,:本文主要介绍MySQL安装配置... 目录一、mysql 简介二、下载 MySQL三、安装 MySQL四、配置环境变量五、配置 MySQL5.1