自制搜索(elasticsearch安装,mongo-connector同步数据,python操作)

本文主要是介绍自制搜索(elasticsearch安装,mongo-connector同步数据,python操作)，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

做一个搜索，以es为基础，数据存在mongodb

1：Elasticsearch

下载：

elasticsearch下载地址https://www.elastic.co/downloads/elasticsearch

安装：

修改elasticsearch-5.5.1/config/elasticsearch.yml

# 集群名称
cluster.name: myElasticsearch
# 节点名称
node.name: node001
# 0.0.0.0是为了让别的机器访问
network.host: 0.0.0.0
# 端口
http.port: 9200

命令：elasticsearch-5.5.1/bin/elasticsearch

浏览器：127.0.0.1:9200

2：Elasticsearch-head

修改elasticsearch-5.5.1/config/elasticsearch.yml

# 增加新的参数，这样head插件可以访问es
http.cors.enabled: true
http.cors.allow-origin: "*"

下载(需要git)：

git clone git://github.com/mobz/elasticsearch-head.git

安装grunt(需要node和npm):

npm install -g grunt-cli
npm install -g grunt

修改head源码

elasticsearch-head/Gruntfile.js

connect: {server: {options: {port: 9100,hostname: '*',base: '.',keepalive: true}}}});

添加hostname: '*',

elasticsearch-head/_site/app.js

# 修改head的连接地址:
this.base_uri = this.config.base_uri || this.prefs.get("app-base_uri") || "http://localhost:9200";
# 把localhost修改成你es的服务器地址，如:
this.base_uri = this.config.base_uri || this.prefs.get("app-base_uri") || "http://x.x.x.x:9200";

安装elasticsearch-head(需要node和npm)

cd elasticsearch-head/

npminstall

启动：

grunt server

浏览器：127.0.0.1:9100

3:mongo-connector

mongo-connector需要开启MongoDB复制集

新建三个data文件夹

# replSet后面是复制集名称，port是端口，dbpath是data目录
# 第一个节点
sudo mongod --dbpath=/Users/zjl/mongodbdata/data1 --port 27018 --replSet rs0
# 第二个节点
sudo mongod --dbpath=/Users/zjl/mongodbdata/data2 --port 27019 --replSet rs0
# 第三个节点
sudo mongod --dbpath=/Users/zjl/mongodbdata/data3 --port 27020 --replSet rs0

进入mongo的shell

mongo 127.0.0.1:27018config = {"_id": "rs0",members: [{ "_id": 0,"host": "127.0.0.1:27018"},{ "_id": 1,"host": "127.0.0.1:27019"},{ "_id": 2,"host": "127.0.0.1:27020",arbiterOnly:true}]}
# arbiterOnly:true是这个节点是仲裁节点，据说这个很鸡肋，只需节点个数是单数就不需要了，仲裁节点不存数据# 进行初始化：
rs.initiate(config);# 查看配置信息
rs.conf();# 查看状态
rs.status();

安装mongo-connector

https://github.com/mongodb-labs/mongo-connector

pip install 'mongo-connector[elastic5]'

同步命令：mongo-connector -m 127.0.0.1:27018 -t 127.0.0.1:9200 -d elastic2_doc_manager

出现Logging to /xx/xx/mongo-connector.log.说明正常

现在我们在mongo主节点新建一个库，会马上同步到其它子节点，并且同步到elasticsearch

rs后面是端口号，27018是主节点，我在主节点新建一个库，马上就同步到子节点

也同步到了elasticsearch上，mongo的库是elasticsearch的索引，表是elasticsearch的type

4:python操作elasticsearch

https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/index.html

pip install elasticsearch

百度，谷歌之类的搜索会对用户的输入进行关键词提取，去重，英文单词纠错

先往mongo添加一些数据

elasticsearch据说内部默认字符串相似度算法是TF-IDF，但是没有分词，不过elasticsearch有ik分词器的插件可以不需要自己手动实现，不过这里我用jieba分词,因为es搜索时内部的算法已经封装好了，而且字符串算法都那些套路，所以我对搜索结果再进行算法加工也没有太大的意义，所以我能做的就是在关键字进入es之前做一些处理，比如关键词提取，单词纠错，感觉能做很有限，毕竟搜索引擎到底怎么样的我不清楚

estest

|----enchant_py.py(单词纠错，网上找的)

|----EsQuery.py(elasticsearch操作)

|----flaskrun.py(flask服务)

|----dict.txt(jieba词库，发现分词结果不理想往里面添加词语，设置词频)

|----stop_words.txt(jieba的停用词，用于关键词提取)

|----big.txt(单词纠错用到的词库，这个太长了，找一个英英词典或者英文小说，如果效果不好，往里面添加你想要的词)

enchant_py.py

# -*- coding: utf-8 -*-
#__author__="ZJL"import re, collectionsdef words(text): return re.findall('[a-z]+', text.lower())def train(features):model = collections.defaultdict(lambda: 1)for f in features:model[f] += 1return modelNWORDS = train(words(open('big.txt').read()))alphabet = 'abcdefghijklmnopqrstuvwxyz'def edits1(word):n = len(word)return set([word[0:i] + word[i + 1:] for i in range(n)] +  # deletion[word[0:i] + word[i + 1] + word[i] + word[i + 2:] for i in range(n - 1)] +  # transposition[word[0:i] + c + word[i + 1:] for i in range(n) for c in alphabet] +  # alteration[word[0:i] + c + word[i:] for i in range(n + 1) for c in alphabet])  # insertiondef known_edits2(word):return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)def known(words): return set(w for w in words if w in NWORDS)def correct(word):candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]return max(candidates, key=lambda w: NWORDS[w])# print('thew => ' + correct('thew'))
# print('spak => ' + correct('spak'))
# print('goof => ' + correct('goof'))
# print('babyu => ' + correct('babyu'))
# print('spalling => ' + correct('spalling'))
# print("Hello =>"+ correct('Hello'))

EsQuery.py

# -*- coding: utf-8 -*-
#__author__="ZJL"from elasticsearch import Elasticsearch
import jieba
import jieba.analyse
import re
import enchant_py
import jsonclass ESQuery(object):def __init__(self):self.es = Elasticsearch("127.0.0.1:9200")def ES_Query(self,es_index,es_doc_type,query_key_list,strs,num,size_num):from_num = (num - 1) * size_numsize_num = num * size_numesstrs = " ".join(strs.get("key_list", ""))str_key = strs.get("key_str", "")re_nums = re.findall(r'[0-9]+', esstrs)re_nums_list = []if re_nums:for re_num in re_nums:re_nums_list.append({"match": {"age": re_num}})for query_key in query_key_list:re_nums_list.append({"match": {query_key: esstrs}})print(re_nums_list)body = {"query":{"bool":{"must": [],"must_not": [],"should": re_nums_list}},"from": from_num,"size": size_num,"sort": [],"aggs": {},# 关键字高亮"highlight": {"fields": {"school": {},"name":{}}}}a = self.es.search(index=es_index, doc_type=es_doc_type,body=body)aa = a["hits"]aa["key_str"] = str_keydata_json = json.dumps(aa)print(data_json)return (data_json)def Check_Keyword(self,key_str):# 词库file_name = "dict.txt"# 停用词 stop_words.txtstop_file_name = "stop_words.txt"# 加载词库jieba.load_userdict(file_name)# 加载停用词jieba.analyse.set_stop_words(stop_file_name)key_str_copy = key_str# 正则找出所有英文单词result_list = re.findall(r'[a-zA-Z]+', key_str_copy)# key_str_list = list(jieba.cut(key_str.strip()))# print(key_str_list)# 单词量小于3(百度超过两个也不纠错)，将单词纠错，将原词与纠错后的词添加到字典corr_dict = {}if len(result_list)<3 and len(result_list)>0:for restr in result_list:strd = enchant_py.correct(restr)if restr!=strd:corr_dict[restr] = strd# 将纠错后的词替换原来的单词for corr in corr_dict:key_str_copy = key_str_copy.replace(corr,corr_dict.get(corr,""))# jieba的tf-idf算法，提取关键词tagstr = jieba.analyse.extract_tags(key_str_copy, topK=20, withWeight=False, allowPOS=())# 考虑到英文短句超不多在这个范围，且不太会有停用词，这样中英文结合后也能去掉中文的停用词elif len(result_list)<3 and len(result_list)>5:tagstr = jieba.analyse.extract_tags(key_str_copy, topK=20, withWeight=False, allowPOS=())# 英文单词过多就直接原样输出else:# 分词key_str_list = list(jieba.cut(key_str_copy))# 如果全英文中出现特殊符号就去掉stop_key = [" ","(",")",".",",","\'","\"","*","+","-","\\","/","`","~","@","#","$","%","^","&",'[',']',"{","}",";","?","!","\t","\n",":"]for key in stop_key:if key in key_str_list:key_str_list.remove(key)tagstr = key_str_list# 如果单词没有纠错就不显示if corr_dict:data_str = key_strelse:data_str = ""data = {"key_list":tagstr,"key_str":data_str}print(data)return data

flaskrun.py

# -*-coding:utf-8 -*-__author__ = "ZJL"from flask import Flask
from flask import request
from EsQuery import ESQuery
from werkzeug.contrib.fixers import ProxyFixapp = Flask(__name__)"""
@api {get} / 首页
@apiName index
@apiGroup indexx
"""
@app.route("/")
def index():return "hello world""""
@api {get} /query 查询
@apiName 查询
@apiGroup 查询xx 
@apiParam {string} strs 关键字
@apiParam {string} num 页码.
@apiParam {string} size_num 每页数量
"""
@app.route('/query', methods=['GET'])
def es_query():if request.method == 'GET' and request.args['strs'] and request.args['num'] and request.args['size_num']:num = int(request.args['num'])size_num = int(request.args['size_num'])strs = request.args['strs']eq = ESQuery()key_str_dict = eq.Check_Keyword(strs)es_index = ["test99911"]es_type = []es_query_list = ["title","body"]data_json = eq.ES_Query(es_index, es_type, es_query_list,key_str_dict, num, size_num)return data_jsonelse:return "no"app.wsgi_app = ProxyFix(app.wsgi_app)
if __name__ == "__main__":app.run(host="0.0.0.0",port=5123) # ,debug=True,threaded=True# 分别通过3中方式获取参数:request.form, request.args,request.values# postForm= request.form# getArgs= request.args# postValues= request.values

dict.txt

相似度 5

stop_words.txt

的
了
和
是
就
都
而
及
與
著
或
一個
沒有
我們
你們
妳們
他們
她們
是否
与
着
一个
没有
我们
你们
他们
她们
它们

big.txt太长了不贴了

这篇关于自制搜索(elasticsearch安装,mongo-connector同步数据,python操作)的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

自制搜索(elasticsearch安装,mongo-connector同步数据,python操作)

相关文章

SpringBoot多环境配置数据读取方式

Python实现网格交易策略的过程

Python标准库之数据压缩和存档的应用详解

使用Python构建智能BAT文件生成器的完美解决方案

解决pandas无法读取csv文件数据的问题

Python进行JSON和Excel文件转换处理指南

Python操作PDF文档的主流库使用指南

python设置环境变量路径实现过程

python中列表应用和扩展性实用详解

python运用requests模拟浏览器发送请求过程