北大天网搜索引擎TSE分析及完全注释[5]倒排索引的建立及文件介绍

本文主要是介绍北大天网搜索引擎TSE分析及完全注释[5]倒排索引的建立及文件介绍，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

不好意思让大家久等了，前一阵一直在忙考试，终于结束了。呵呵！废话不多说了下面我们开始吧！

TSE用的是将抓取回来的网页文档全部装入一个大文档，让后对这一个大文档内的数据整体统一的建索引，其中包含了几个步骤。

view plain copy to clipboard print ?

1. The document index (Doc.idx) keeps information about each document.
It is a fixed width ISAM (Index sequential access mode) index, orderd by docID.
The information stored in each entry includes a pointer into the repository,
a document length, a document checksum.
//Doc.idx 文档编号文档长度 checksum hash码
0 0 bc9ce846d7987c4534f53d423380ba70
1 76760 4f47a3cad91f7d35f4bb6b2a638420e5
2 141624 d019433008538f65329ae8e39b86026c
3 142350 5705b8f58110f9ad61b1321c52605795
//Doc.idx end
The url index (url.idx) is used to convert URLs into docIDs.
//url.idx
5c36868a9c5117eadbda747cbdb0725f 0
3272e136dd90263ee306a835c6c70d77 1
6b8601bb3bb9ab80f868d549b5c5a5f3 2
3f9eba99fa788954b5ff7f35a5db6e1f 3
//url.idx end
It is a list of URL checksums with their corresponding docIDs and is sorted by
checksum. In order to find the docID of a particular URL, the URL's checksum
is computed and a binary search is performed on the checksums file to find its
docID.
./DocIndex
got Doc.idx, Url.idx, DocId2Url.idx //Data文件夹中的Doc.idx DocId2Url.idx和Doc.idx中
//DocId2Url.idx
0 http://*.*.edu.cn/index.aspx
1 http://*.*.edu.cn/showcontent1.jsp?NewsID=118
2 http://*.*.edu.cn/0102.html
3 http://*.*.edu.cn/0103.html
//DocId2Url.idx end
2. sort Url.idx|uniq > Url.idx.sort_uniq //Data文件夹中的Url.idx.sort_uniq
//Url.idx.sort_uniq
//对hash值进行排序
000bfdfd8b2dedd926b58ba00d40986b 1111
000c7e34b653b5135a2361c6818e48dc 1831
0019d12f438eec910a06a606f570fde8 366
0033f7c005ec776f67f496cd8bc4ae0d 2103
3. Segment document to terms, (with finding document according to the url)
./DocSegment Tianwang.raw.2559638448 //Tianwang.raw.2559638448为爬回来的文件，每个页面包含http头
got Tianwang.raw.2559638448.seg
//Tianwang.raw.2559638448 爬取的原始网页文件在文档内部每一个文档之间应该是通过version，</html>和回车做标志位分割的
version: 1.0
url: http://***.105.138.175/Default2.asp?lang=gb
origin: http://***.105.138.175/
date: Fri, 23 May 2008 20:01:36 GMT
ip: 162.105.138.175
length: 38413
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Fri, 23 May 2008 11:17:49 GMT
Connection: keep-alive
Connection: Keep-Alive
Content-Length: 38088
Content-Type: text/html; Charset=gb2312
Expires: Fri, 23 May 2008 11:17:49 GMT
Set-Cookie: ASPSESSIONIDSSTRDCAB=IMEOMBIAIPDFCKPAEDJFHOIH; path=/
Cache-control: private
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Apabi数字资源平台</title>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">
<META NAME="DESCRIPTION" CONTENT="数字图书馆方正数字图书馆电子图书电子书 ebook e书 Apabi 数字资源平台">
<link rel="stylesheet" type="text/css" href="css/common.css">
<style type="text/css">
<!--
.style4 {color: #666666}
-->
</style>
<script LANGUAGE="vbscript">
...
</script>
<Script Language="javascript">
...
</Script>
</head>
<body leftmargin="0" topmargin="0">
</body>
</html>
//Tianwang.raw.2559638448 end
//Tianwang.raw.2559638448.seg 将每个页面分成一行如下(注意中间没有回车作为分隔)
1
...
...
...
2
...
...
...
//Tianwang.raw.2559638448.seg end
//下是 Tiny search 非必须因素
4. Create forward index (docic-->termid) //建立正向索引
./CrtForwardIdx Tianwang.raw.2559638448.seg > moon.fidx
//Tianwang.raw.2559638448.seg 将每个页面分成一行如下<BR>//分词 DocID<BR>1<BR>三星/ s/ 手机/ 论坛/ ,/ 手机/ 铃声/ 下载/ ,/ 手机/ 图片/ 下载/ ,/ 手机/<BR>2<BR>...<BR>...<BR>...

1.  The document index (Doc.idx) keeps information about each document.
It is a fixed width ISAM (Index sequential access mode) index, orderd by docID.
The information stored in each entry includes a pointer into the repository,
a document length, a document checksum.
//Doc.idx  文档编号	文档长度	checksum hash码
0	0	bc9ce846d7987c4534f53d423380ba70
1	76760	4f47a3cad91f7d35f4bb6b2a638420e5
2	141624	d019433008538f65329ae8e39b86026c
3	142350	5705b8f58110f9ad61b1321c52605795
//Doc.idx	end
The url index (url.idx) is used to convert URLs into docIDs.
//url.idx
5c36868a9c5117eadbda747cbdb0725f	0
3272e136dd90263ee306a835c6c70d77	1
6b8601bb3bb9ab80f868d549b5c5a5f3	2
3f9eba99fa788954b5ff7f35a5db6e1f	3
//url.idx	end
It is a list of URL checksums with their corresponding docIDs and is sorted by
checksum. In order to find the docID of a particular URL, the URL's checksum
is computed and a binary search is performed on the checksums file to find its
docID.
./DocIndex
got Doc.idx, Url.idx, DocId2Url.idx	//Data文件夹中的Doc.idx DocId2Url.idx和Doc.idx中
//DocId2Url.idx
0	http://*.*.edu.cn/index.aspx
1	http://*.*.edu.cn/showcontent1.jsp?NewsID=118
2	http://*.*.edu.cn/0102.html
3	http://*.*.edu.cn/0103.html
//DocId2Url.idx	end
2.  sort Url.idx|uniq > Url.idx.sort_uniq	//Data文件夹中的Url.idx.sort_uniq
//Url.idx.sort_uniq
//对hash值进行排序
000bfdfd8b2dedd926b58ba00d40986b	1111
000c7e34b653b5135a2361c6818e48dc	1831
0019d12f438eec910a06a606f570fde8	366
0033f7c005ec776f67f496cd8bc4ae0d	2103
3. Segment document to terms, (with finding document according to the url)
./DocSegment Tianwang.raw.2559638448		//Tianwang.raw.2559638448为爬回来的文件 ，每个页面包含http头
got Tianwang.raw.2559638448.seg		
//Tianwang.raw.2559638448	爬取的原始网页文件在文档内部每一个文档之间应该是通过version，</html>和回车做标志位分割的
version: 1.0
url: http://***.105.138.175/Default2.asp?lang=gb
origin: http://***.105.138.175/
date: Fri, 23 May 2008 20:01:36 GMT
ip: 162.105.138.175
length: 38413
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Fri, 23 May 2008 11:17:49 GMT
Connection: keep-alive
Connection: Keep-Alive
Content-Length: 38088
Content-Type: text/html; Charset=gb2312
Expires: Fri, 23 May 2008 11:17:49 GMT
Set-Cookie: ASPSESSIONIDSSTRDCAB=IMEOMBIAIPDFCKPAEDJFHOIH; path=/
Cache-control: private
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Apabi数字资源平台</title>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">
<META NAME="DESCRIPTION" CONTENT="数字图书馆 方正数字图书馆 电子图书 电子书 ebook e书 Apabi 数字资源平台">
<link rel="stylesheet" type="text/css" href="css/common.css">
<style type="text/css">
<!--
.style4 {color: #666666}
-->
</style>
<script LANGUAGE="vbscript">
...
</script>
<Script Language="javascript">
...
</Script>
</head>
<body leftmargin="0" topmargin="0">
</body>
</html>
//Tianwang.raw.2559638448	end
//Tianwang.raw.2559638448.seg	将每个页面分成一行如下(注意中间没有回车作为分隔)
1
...
...
...
2
...
...
...
//Tianwang.raw.2559638448.seg	end
//下是 Tiny search 非必须因素
4. Create forward index (docic-->termid)		//建立正向索引
./CrtForwardIdx Tianwang.raw.2559638448.seg > moon.fidx
//Tianwang.raw.2559638448.seg 将每个页面分成一行如下
//分词   DocID
1
三星/  s/  手机/  论坛/  ,/  手机/  铃声/  下载/  ,/  手机/  图片/  下载/  ,/  手机/
2
...
...
...

view plain copy to clipboard print ?

//Tianwang.raw.2559638448.seg end
//moon.fidx
//每篇文档号对应文档内分出来的分词 DocID
都会 2391
使 2391
那些 2391
拥有 2391
它 2391
的 2391
人 2391
的 2391
视野 2391
变 2391
窄 2391
在 2180
研究生部 2180
主页 2180
培养 2180
管理 2180
栏目 2180
下载 2180
） 2180
、 2180
关于 2180
做好 2180
年 2180
国家 2180
公派 2180
研究生 2180
项目 2180
//moon.fidx end
5.# set | grep "LANG"
LANG=en; export LANG;
sort moon.fidx > moon.fidx.sort
6. Create inverted index (termid-->docid) //建立倒排索引
./CrtInvertedIdx moon.fidx.sort > sun.iidx
//sun.iidx //文件规模大概减少1/2
花工 236
花海 2103
花卉 1018 1061 1061 1061 1730 1730 1730 1730 1730 1852 949 949
花蕾 447 447
花木 1061
花呢 1430
花期 447 447 447 447 447 525
花钱 174 236
花色 1730 1730
花色品种 1660
花生 450 526
花式 1428 1430 1430 1430
花纹 1430 1430
花序 447 447 447 447 447 450
花絮 136 137
花芽 450 450
//sun.iidx end
TSESearch CGI program for query
Snapshot CGI program for page snapshot

//Tianwang.raw.2559638448.seg end
//moon.fidx
//每篇文档号对应文档内分出来的	分词	DocID
都会	2391
使	2391
那些	2391
拥有	2391
它	2391
的	2391
人	2391
的	2391
视野	2391
变	2391
窄	2391
在	2180
研究生部	2180
主页	2180
培养	2180
管理	2180
栏目	2180
下载	2180
）	2180
、	2180
关于	2180
做好	2180
年	2180
国家	2180
公派	2180
研究生	2180
项目	2180
//moon.fidx	end
5.# set | grep "LANG"
LANG=en; export LANG;
sort moon.fidx > moon.fidx.sort
6. Create inverted index (termid-->docid)	//建立倒排索引
./CrtInvertedIdx moon.fidx.sort > sun.iidx
//sun.iidx	//文件规模大概减少1/2
花工	 236
花海	 2103
花卉	 1018 1061 1061 1061 1730 1730 1730 1730 1730 1852 949 949
花蕾	 447 447
花木	 1061
花呢	 1430
花期	 447 447 447 447 447 525
花钱	 174 236
花色	 1730 1730
花色品种	 1660
花生	 450 526
花式	 1428 1430 1430 1430
花纹	 1430 1430
花序	 447 447 447 447 447 450
花絮	 136 137
花芽	 450 450
//sun.iidx	end
TSESearch	CGI program for query
Snapshot	CGI program for page snapshot

author:http://hi.baidu.com/jrckkyy
author:http://blog.csdn.net/jrckkyy

这篇关于北大天网搜索引擎TSE分析及完全注释[5]倒排索引的建立及文件介绍的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！