基于文本来推荐相似酒店

2024-05-29 20:20
文章标签 推荐 酒店 相似 本来

本文主要是介绍基于文本来推荐相似酒店,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

基于文本来推荐相似酒店

查看数据集基本信息

import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import random
import cufflinks
import cufflinks
from plotly.offline import iplot
df=pd.read_csv("Seattle_Hotels.csv",encoding="latin-1")
df.head()
nameaddressdesc
0Hilton Garden Seattle Downtown1821 Boren Avenue, Seattle Washington 98101 USALocated on the southern tip of Lake Union, the...
1Sheraton Grand Seattle1400 6th Avenue, Seattle, Washington 98101 USALocated in the city's vibrant core, the Sherat...
2Crowne Plaza Seattle Downtown1113 6th Ave, Seattle, WA 98101Located in the heart of downtown Seattle, the ...
3Kimpton Hotel Monaco Seattle1101 4th Ave, Seattle, WA98101What?s near our hotel downtown Seattle locatio...
4The Westin Seattle1900 5th Avenue, Seattle, Washington 98101 USASituated amid incredible shopping and iconic a...
df.shape
(152, 3)
df['desc'][0]
"Located on the southern tip of Lake Union, the Hilton Garden Inn Seattle Downtown hotel is perfectly located for business and leisure. \nThe neighborhood is home to numerous major international companies including Amazon, Google and the Bill & Melinda Gates Foundation. A wealth of eclectic restaurants and bars make this area of Seattle one of the most sought out by locals and visitors. Our proximity to Lake Union allows visitors to take in some of the Pacific Northwest's majestic scenery and enjoy outdoor activities like kayaking and sailing. over 2,000 sq. ft. of versatile space and a complimentary business center. State-of-the-art A/V technology and our helpful staff will guarantee your conference, cocktail reception or wedding is a success. Refresh in the sparkling saltwater pool, or energize with the latest equipment in the 24-hour fitness center. Tastefully decorated and flooded with natural light, our guest rooms and suites offer everything you need to relax and stay productive. Unwind in the bar, and enjoy American cuisine for breakfast, lunch and dinner in our restaurant. The 24-hour Pavilion Pantry? stocks a variety of snacks, drinks and sundries."

查看酒店描述中主要介绍信息

vec=CountVectorizer().fit(df['desc'])
vec=CountVectorizer().fit(df['desc'])
bag_of_words=vec.transform(df['desc'])
sum_words=bag_of_words.sum(axis=0)
words_freq=[(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
sum_words=sorted(words_freq,key= lambda x: x[1],reverse=True)
sum_words[1:10]
[('and', 1062),('of', 536),('seattle', 533),('to', 471),('in', 449),('our', 359),('you', 304),('hotel', 295),('with', 280)]
bag_of_words=vec.transform(df['desc'])
bag_of_words.shape
(152, 3200)
bag_of_words.toarray()
array([[0, 1, 0, ..., 0, 0, 0],[0, 0, 0, ..., 0, 0, 0],[0, 0, 0, ..., 0, 0, 0],...,[0, 0, 0, ..., 0, 0, 0],[0, 0, 0, ..., 0, 0, 0],[0, 0, 0, ..., 1, 0, 0]], dtype=int64)
sum_words=bag_of_words.sum(axis=0)
sum_words
matrix([[ 1, 11, 11, ...,  2,  6,  2]], dtype=int64)
words_freq=[(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
sum_words=sorted(words_freq,key= lambda x: x[1],reverse=True)
sum_words[1:10]
[('and', 1062),('of', 536),('seattle', 533),('to', 471),('in', 449),('our', 359),('you', 304),('hotel', 295),('with', 280)]

将以上信息整合成函数

def get_top_n_words(corpus,n=None):vec=CountVectorizer().fit(df['desc'])bag_of_words=vec.transform(df['desc'])sum_words=bag_of_words.sum(axis=0)words_freq=[(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]sum_words=sorted(words_freq,key= lambda x: x[1],reverse=True)return sum_words[:n]
common_words=get_top_n_words(df['desc'],20)
common_words
[('the', 1258),('and', 1062),('of', 536),('seattle', 533),('to', 471),('in', 449),('our', 359),('you', 304),('hotel', 295),('with', 280),('is', 271),('at', 231),('from', 224),('for', 216),('your', 186),('or', 161),('center', 151),('are', 136),('downtown', 133),('on', 129)]
df1=pd.DataFrame(common_words,columns=['desc','count'])
common_words=get_top_n_words(df['desc'],20)
df2=pd.DataFrame(common_words,columns=['desc','count'])
chart_info2=df2.groupby(['desc']).sum().sort_values('count',ascending=False)
chart_info2.plot(kind='barh',figsize=(14,10),title='top 20 before remove stopwords')
<AxesSubplot:title={'center':'top 20 before remove stopwords'}, ylabel='desc'>    

在这里插入图片描述

chart_info1=df1.groupby(['desc']).sum().sort_values('count',ascending=False)
chart_info1.plot(kind='barh',figsize=(14,10),title='top 20 before remove stopwords')
<AxesSubplot:title={'center':'top 20 before remove stopwords'}, ylabel='desc'>

在这里插入图片描述

def get_any1_top_n_words_after_stopwords(corpus,n=None):vec=CountVectorizer(stop_words='english',ngram_range=(1,1)).fit(df['desc'])bag_of_words=vec.transform(df['desc'])sum_words=bag_of_words.sum(axis=0)words_freq=[(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]sum_words=sorted(words_freq,key= lambda x: x[1],reverse=True)return sum_words[:n]
common_words=get_any1_top_n_words_after_stopwords(df['desc'],20)
df2=pd.DataFrame(common_words,columns=['desc','count'])
chart_info2=df2.groupby(['desc']).sum().sort_values('count',ascending=False)
chart_info2.plot(kind='barh',figsize=(14,10),title='top 20 before after stopwords')
<AxesSubplot:title={'center':'top 20 before after stopwords'}, ylabel='desc'>

在这里插入图片描述

def get_any2_top_n_words_after_stopwords(corpus,n=None):vec=CountVectorizer(stop_words='english',ngram_range=(2,2)).fit(df['desc'])bag_of_words=vec.transform(df['desc'])sum_words=bag_of_words.sum(axis=0)words_freq=[(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]sum_words=sorted(words_freq,key= lambda x: x[1],reverse=True)return sum_words[:n]common_words=get_any2_top_n_words_after_stopwords(df['desc'],20)
df2=pd.DataFrame(common_words,columns=['desc','count'])
chart_info2=df2.groupby(['desc']).sum().sort_values('count',ascending=False)
chart_info2.plot(kind='barh',figsize=(14,10),title='top 20 before after stopwords')
<AxesSubplot:title={'center':'top 20 before after stopwords'}, ylabel='desc'>

在这里插入图片描述

def get_any3_top_n_words_after_stopwords(corpus,n=None):vec=CountVectorizer(stop_words='english',ngram_range=(3,3)).fit(df['desc'])bag_of_words=vec.transform(df['desc'])sum_words=bag_of_words.sum(axis=0)words_freq=[(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]sum_words=sorted(words_freq,key= lambda x: x[1],reverse=True)return sum_words[:n]common_words=get_any3_top_n_words_after_stopwords(df['desc'],20)
df2=pd.DataFrame(common_words,columns=['desc','count'])
chart_info2=df2.groupby(['desc']).sum().sort_values('count',ascending=False)
chart_info2.plot(kind='barh',figsize=(14,10),title='top 20 before after stopwords')
<AxesSubplot:title={'center':'top 20 before after stopwords'}, ylabel='desc'>

在这里插入图片描述

描述的一些统计信息

df=pd.read_csv("Seattle_Hotels.csv",encoding="latin-1")
df['desc'][0]
"Located on the southern tip of Lake Union, the Hilton Garden Inn Seattle Downtown hotel is perfectly located for business and leisure. \nThe neighborhood is home to numerous major international companies including Amazon, Google and the Bill & Melinda Gates Foundation. A wealth of eclectic restaurants and bars make this area of Seattle one of the most sought out by locals and visitors. Our proximity to Lake Union allows visitors to take in some of the Pacific Northwest's majestic scenery and enjoy outdoor activities like kayaking and sailing. over 2,000 sq. ft. of versatile space and a complimentary business center. State-of-the-art A/V technology and our helpful staff will guarantee your conference, cocktail reception or wedding is a success. Refresh in the sparkling saltwater pool, or energize with the latest equipment in the 24-hour fitness center. Tastefully decorated and flooded with natural light, our guest rooms and suites offer everything you need to relax and stay productive. Unwind in the bar, and enjoy American cuisine for breakfast, lunch and dinner in our restaurant. The 24-hour Pavilion Pantry? stocks a variety of snacks, drinks and sundries."
df['word_count']=df['desc'].apply(   lambda x:len(str(x).split(' '))    )
df.head()                                    
nameaddressdescword_count
0Hilton Garden Seattle Downtown1821 Boren Avenue, Seattle Washington 98101 USALocated on the southern tip of Lake Union, the...184
1Sheraton Grand Seattle1400 6th Avenue, Seattle, Washington 98101 USALocated in the city's vibrant core, the Sherat...152
2Crowne Plaza Seattle Downtown1113 6th Ave, Seattle, WA 98101Located in the heart of downtown Seattle, the ...147
3Kimpton Hotel Monaco Seattle1101 4th Ave, Seattle, WA98101What?s near our hotel downtown Seattle locatio...151
4The Westin Seattle1900 5th Avenue, Seattle, Washington 98101 USASituated amid incredible shopping and iconic a...151
df['word_count'].plot(kind='hist',bins=50)
<AxesSubplot:ylabel='Frequency'>

在这里插入图片描述

文本处理

sub_replace=re.compile('[^0-9a-z#-]')
from nltk.corpus import stopwords
stopwords=set(stopwords.words('english'))
def clean_txt(text):text.lower()text=sub_replace.sub(' ',text)''.join(    word   for word in text.split(' ')  if word not in stopwords               )return  text
df['desc_clean']=df['desc'].apply(clean_txt)
df['desc_clean'][0]
' ocated on the southern tip of  ake  nion  the  ilton  arden  nn  eattle  owntown hotel is perfectly located for business and leisure    he neighborhood is home to numerous major international companies including  mazon   oogle and the  ill    elinda  ates  oundation    wealth of eclectic restaurants and bars make this area of  eattle one of the most sought out by locals and visitors   ur proximity to  ake  nion allows visitors to take in some of the  acific  orthwest s majestic scenery and enjoy outdoor activities like kayaking and sailing  over 2 000 sq  ft  of versatile space and a complimentary business center   tate-of-the-art     technology and our helpful staff will guarantee your conference  cocktail reception or wedding is a success   efresh in the sparkling saltwater pool  or energize with the latest equipment in the 24-hour fitness center   astefully decorated and flooded with natural light  our guest rooms and suites offer everything you need to relax and stay productive   nwind in the bar  and enjoy  merican cuisine for breakfast  lunch and dinner in our restaurant   he 24-hour  avilion  antry  stocks a variety of snacks  drinks and sundries '

相似度计算

df.index
RangeIndex(start=0, stop=152, step=1)
df.head()
nameaddressdescword_countdesc_clean
0Hilton Garden Seattle Downtown1821 Boren Avenue, Seattle Washington 98101 USALocated on the southern tip of Lake Union, the...184ocated on the southern tip of ake nion the...
1Sheraton Grand Seattle1400 6th Avenue, Seattle, Washington 98101 USALocated in the city's vibrant core, the Sherat...152ocated in the city s vibrant core the herat...
2Crowne Plaza Seattle Downtown1113 6th Ave, Seattle, WA 98101Located in the heart of downtown Seattle, the ...147ocated in the heart of downtown eattle the ...
3Kimpton Hotel Monaco Seattle1101 4th Ave, Seattle, WA98101What?s near our hotel downtown Seattle locatio...151hat s near our hotel downtown eattle locatio...
4The Westin Seattle1900 5th Avenue, Seattle, Washington 98101 USASituated amid incredible shopping and iconic a...151ituated amid incredible shopping and iconic a...
df.set_index('name' ,inplace=True)
df.index[:5]
Index(['Hilton Garden Seattle Downtown', 'Sheraton Grand Seattle','Crowne Plaza Seattle Downtown', 'Kimpton Hotel Monaco Seattle ','The Westin Seattle'],dtype='object', name='name')
tf=TfidfVectorizer(analyzer='word',ngram_range=(1,3),stop_words='english')#将原始文档集合转换为TF-IDF特性的矩阵。
tf
TfidfVectorizer(ngram_range=(1, 3), stop_words='english')
tfidf_martix=tf.fit_transform(df['desc_clean'])
tfidf_martix.shape
(152, 27694)
cosine_similarity=linear_kernel(tfidf_martix,tfidf_martix)
cosine_similarity.shape
(152, 152)
cosine_similarity[0]
array([1.        , 0.01354605, 0.02855898, 0.00666729, 0.02915865,0.01258837, 0.0190937 , 0.0152567 , 0.00689703, 0.01852763,0.01241924, 0.00919602, 0.01189826, 0.01234794, 0.01200711,0.01596218, 0.00979221, 0.04374643, 0.01138524, 0.02334485,0.02358692, 0.00829121, 0.00620275, 0.01700472, 0.0191396 ,0.02340334, 0.03193292, 0.00678849, 0.02272962, 0.0176494 ,0.0125159 , 0.03702338, 0.01569165, 0.02001584, 0.03656467,0.03189017, 0.00644231, 0.01008181, 0.02428547, 0.03327365,0.01367507, 0.00827835, 0.01722986, 0.04135263, 0.03315194,0.01529834, 0.03568623, 0.01294482, 0.03480617, 0.01447235,0.02563783, 0.01650068, 0.03328324, 0.01562323, 0.02703264,0.01315504, 0.02248426, 0.02690816, 0.00565479, 0.02899467,0.02900863, 0.00971019, 0.0439659 , 0.03020971, 0.02166199,0.01487286, 0.03182626, 0.00729518, 0.01764764, 0.01193849,0.02405471, 0.01408249, 0.02632335, 0.02027866, 0.01978292,0.04879328, 0.00244737, 0.01937539, 0.01388813, 0.02996677,0.00756079, 0.01429659, 0.0050572 , 0.00630326, 0.01496956,0.04104425, 0.00911942, 0.00259554, 0.00645944, 0.01460694,0.00794788, 0.00592598, 0.0090397 , 0.00532289, 0.01445326,0.01156657, 0.0098189 , 0.02077998, 0.0116756 , 0.02593775,0.01000463, 0.00533785, 0.0026153 , 0.02261775, 0.00680343,0.01859473, 0.03802118, 0.02078981, 0.01196228, 0.03744293,0.05164375, 0.00760035, 0.02627101, 0.01579335, 0.01852171,0.06768183, 0.01619049, 0.03544484, 0.0126264 , 0.01613638,0.00662941, 0.01184946, 0.01843151, 0.0012407 , 0.00687414,0.00873796, 0.04397665, 0.06798914, 0.00794379, 0.01098165,0.01520306, 0.01257289, 0.02087956, 0.01718063, 0.0292332 ,0.00489742, 0.03096065, 0.01163736, 0.01382631, 0.01386944,0.01888652, 0.02391748, 0.02814364, 0.01467017, 0.00332169,0.0023627 , 0.02348599, 0.00762246, 0.00390889, 0.01277579,0.00247891, 0.00854051])

求酒店的推荐

indices=pd.Series(df.index)
indices[:5]
0    Hilton Garden Seattle Downtown
1            Sheraton Grand Seattle
2     Crowne Plaza Seattle Downtown
3     Kimpton Hotel Monaco Seattle 
4                The Westin Seattle
Name: name, dtype: object
def recommendation(name,cosine_similarity):recommend_hotels=[]idx=indices[indices==name].index[0]score_series=pd.Series(cosine_similarity[idx]).sort_values(ascending=False)top_10_indexes=list(score_series[1:11].index)for i in top_10_indexes:recommend_hotels.append(list(df.index)[i])return recommend_hotels                                
recommendation('Hilton Garden Seattle Downtown',cosine_similarity)
['Staybridge Suites Seattle Downtown - Lake Union','Silver Cloud Inn - Seattle Lake Union','Residence Inn by Marriott Seattle Downtown/Lake Union','MarQueen Hotel','The Charter Hotel Seattle, Curio Collection by Hilton','Embassy Suites by Hilton Seattle Tacoma International Airport','SpringHill Suites Seattle\xa0Downtown','Courtyard by Marriott Seattle Downtown/Pioneer Square','The Loyal Inn','EVEN Hotel Seattle - South Lake Union']

这篇关于基于文本来推荐相似酒店的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1014550

相关文章

Java SWT库详解与安装指南(最新推荐)

《JavaSWT库详解与安装指南(最新推荐)》:本文主要介绍JavaSWT库详解与安装指南,在本章中,我们介绍了如何下载、安装SWTJAR包,并详述了在Eclipse以及命令行环境中配置Java... 目录1. Java SWT类库概述2. SWT与AWT和Swing的区别2.1 历史背景与设计理念2.1.

Go语言如何判断两张图片的相似度

《Go语言如何判断两张图片的相似度》这篇文章主要为大家详细介绍了Go语言如何中实现判断两张图片的相似度的两种方法,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下... 在介绍技术细节前,我们先来看看图片对比在哪些场景下可以用得到:图片去重:自动删除重复图片,为存储空间"瘦身"。想象你是一个

Java日期类详解(最新推荐)

《Java日期类详解(最新推荐)》早期版本主要使用java.util.Date、java.util.Calendar等类,Java8及以后引入了新的日期和时间API(JSR310),包含在ja... 目录旧的日期时间API新的日期时间 API(Java 8+)获取时间戳时间计算与其他日期时间类型的转换Dur

MySQL 存储引擎 MyISAM详解(最新推荐)

《MySQL存储引擎MyISAM详解(最新推荐)》使用MyISAM存储引擎的表占用空间很小,但是由于使用表级锁定,所以限制了读/写操作的性能,通常用于中小型的Web应用和数据仓库配置中的只读或主要... 目录mysql 5.5 之前默认的存储引擎️‍一、MyISAM 存储引擎的特性️‍二、MyISAM 的主

C++ HTTP框架推荐(特点及优势)

《C++HTTP框架推荐(特点及优势)》:本文主要介绍C++HTTP框架推荐的相关资料,本文通过实例代码给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价值,需要的朋友参考下吧... 目录1. Crow2. Drogon3. Pistache4. cpp-httplib5. Beast (Boos

Python多进程、多线程、协程典型示例解析(最新推荐)

《Python多进程、多线程、协程典型示例解析(最新推荐)》:本文主要介绍Python多进程、多线程、协程典型示例解析(最新推荐),本文通过实例代码给大家介绍的非常详细,对大家的学习或工作具有一定... 目录一、multiprocessing(多进程)1. 模块简介2. 案例详解:并行计算平方和3. 实现逻

Spring Boot集成SLF4j从基础到高级实践(最新推荐)

《SpringBoot集成SLF4j从基础到高级实践(最新推荐)》SLF4j(SimpleLoggingFacadeforJava)是一个日志门面(Facade),不是具体的日志实现,这篇文章主要介... 目录一、日志框架概述与SLF4j简介1.1 为什么需要日志框架1.2 主流日志框架对比1.3 SLF4

Springboot实现推荐系统的协同过滤算法

《Springboot实现推荐系统的协同过滤算法》协同过滤算法是一种在推荐系统中广泛使用的算法,用于预测用户对物品(如商品、电影、音乐等)的偏好,从而实现个性化推荐,下面给大家介绍Springboot... 目录前言基本原理 算法分类 计算方法应用场景 代码实现 前言协同过滤算法(Collaborativ

Maven中引入 springboot 相关依赖的方式(最新推荐)

《Maven中引入springboot相关依赖的方式(最新推荐)》:本文主要介绍Maven中引入springboot相关依赖的方式(最新推荐),本文给大家介绍的非常详细,对大家的学习或工作具有... 目录Maven中引入 springboot 相关依赖的方式1. 不使用版本管理(不推荐)2、使用版本管理(推

mysql的基础语句和外键查询及其语句详解(推荐)

《mysql的基础语句和外键查询及其语句详解(推荐)》:本文主要介绍mysql的基础语句和外键查询及其语句详解(推荐),本文给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价值,需要的朋... 目录一、mysql 基础语句1. 数据库操作 创建数据库2. 表操作 创建表3. CRUD 操作二、外键