使用requests+xpath爬取nba球员信息

2023-11-24 20:50

本文主要是介绍使用requests+xpath爬取nba球员信息,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

主要目的:

我比较喜欢打篮球,然后这两天要做一个篮球相关的项目,需要爬取nba球员信息,我这里从两个网站获取信息,因为没有哪一个网站拥有所有我需要的信息。
1.http://www.stat-nba.com/
2.nba中文官方网站

环境:

windows7
PyCharm 2019.3.3 (Professional Edition)
python3.7

过程

在nba中国官方网站首页点击现役球员,退役的同样方法也可以拿到,我这里爬的是现役的
在这里插入图片描述
然后来到球员页
在这里插入图片描述
在这里插入图片描述
这里看一下可以发现网页的基本设计思路(个人理解)。这个页面中的每一个字母可以对应一个容器(用来装名字以该字母开头的球员信息),display设置为none,然后在每一个字母上加一个JavaScript,点击的时候对应的容器显示出来。然后在加载页面的时候,通过一个ajax请求向后台请求所有的球员数据,然后返回到前台之后,放到上述容器之中,我个人是这样理解这个页面的设计思路的,通过上述请求也基本可以看出来,大致就是这个思路。
上图中也可以看出playerlist.json就是ajax从后台拿回来的json格式的数据,从该条请求中得到ajax请求地址,下图即为所示。因此直接向这个url发出请求就可以得到相应的数据
在这里插入图片描述
从头部信息也可以看出,这里并没有携带什么数据,所以直接发送请求即可
在这里插入图片描述

 def get_page(self,url):response = requests.get(url=url,headers = self.headers)data = response.content.decode('gbk')# print(data)return datadef get_playerjson(self,url,file):#爬取过程中发现这里使用的是json形式交换数据,然后就直接找到了json数据交换的url直接获取并解析。这里json格式非常规范response = requests.get(url)json_loads = json.loads(response)# print(json.loads(response))if os.path.exists(file):passelse:with open(file, "w",encoding='utf8') as fp:fp.write(json.dumps(json_loads, indent=4,ensure_ascii=False))

这里把得到的json数据存入文件中,注意上图中indent=4是加4个缩进,让格式更加美观,ensure_ascii=False这个是为了让中文写入,如果不加写入的是编码后的结果。

然后再从中读取需要的数据,这里就是按照python中的字典形式读就行,然后写入数据库也不过多赘述,后面直接放上源码

下面是下载图片

这里可以用urllib中的urlretrieve来下载,也可以使用我这里面的方法f.write(response.content)这里content就是直接从网站上拿到的内容没有经过解码,是bytes类型,所以直接写入.png之中就变成了图片。这个url地址规律也比较容易找。

 def get_playerimg(self):db = pymysql.connect(host='localhost', user='root', password='root', port=3306, db='nbadb')cursor = db.cursor()sql = "select playerId,name from players"try:# player_ids = cursor.execute(sql)#这样会返回个数cursor.execute(sql)player_idnames = cursor.fetchall()# print(player_ids)except:print('获取数据异常')db.close()for player_idname in player_idnames:# request = request.Request()print(player_idname)url = 'https://china.nba.com/media/img/players/head/260x190/'+player_idname[0]+'.png'# print(url)# breakif not os.path.exists('player_imgs/'+player_idname[0]+'.png'):response = requests.get(url,headers=self.headers)with open('player_imgs/'+player_idname[0]+'.png','wb') as f:f.write(response.content)print(player_idname[1]+":下载完毕")else:print(player_idname[1]+":已存在")
这里nba中文官方网站已经爬取完毕,接下来爬另外一个网站,从这个网站爬取球员的每个赛季的具体数据,这里也需要借助之前的数据

在这里插入图片描述
这里是按照名字首字母排序的,这个网站因为不是官方网站,有时候会有不太规范的地方,所以我每次用xpath提取的时候都会先进行规范化,下面是函数,这里可以选择返回内容,如果希望返回以后用xpath提取数据,就写上参数xpath,还可以选择pyquery

    def request_page(self,url,extract_function):response = requests.get(url)original_data = response.content.decode('utf-8')standard_html = etree.HTML(original_data)standard_data = etree.tostring(standard_html).decode('utf-8')if extract_function=='xpath':return standard_htmlelif extract_function=='pyquery':return standard_data#返回去之后直接可以pyquery解析

接下来是最重要的一件事,就是这个网站上面的球员非常的多,现役的,退役的,还有教练,我只需要获取从官网爬取的那些球员,所以就要加判断

    def request_letter_page(self,text):#在文件中拿到urlletter_links = []#从文件中拿到每个字母对应的页面地址with open(text,'r',encoding='utf8') as fp:for link in fp.readlines():link = link.replace('\n','')#这里是生成新的,不改变原来的,迷了半天letter_links.append(link)player_names_and_urls = []  # 为了存储得到的playername和url,计划所有都得到以后再与数据库中的数据加以比较for index,letter_link in enumerate(letter_links):# 这里很坑!!!!!!!!!!!!!!!!!!!!!!!!!!!!1# !!!!!!!!!!!!# 因为!!!!1这里的xyz页面中没有教练员这一栏!!!!!!!!!!,所以xpath不正确# print(letter_link)# standard_data = self.request_page(letter_link)# doc = pq(standard_data)# print(doc('title'))# break#!!!!!!!!!!!!!!!!!!!报错,解决不了,就是在print(divs)那里,直接换xpath#UnicodeEncodeError: 'gbk' codec can't encode character '\xa9' in position 1093: illegal multibyte sequence# divs = doc("#background > div:nth-child(16)")# print(divs)# print(divs.items())# for span in spans:#     print(span.html())#     break# response = requests.get(letter_link)# original_data = response.content.decode('utf8')# htmlElement = etree.HTML(original_data)htmlElement = self.request_page(letter_link,'xpath')if (index < 23)&(index!=20):original_names = htmlElement.xpath("//div[@class='playerList'][2]//span/text()")name_urls = htmlElement.xpath("//div[@class='playerList'][2]//div/a/@href")else:original_names = htmlElement.xpath("//div[@class='playerList'][1]//span/text()")name_urls = htmlElement.xpath("//div[@class='playerList'][1]//div/a/@href")# print(original_names[0])for index,original_name in enumerate(original_names):person_name_and_url = {}# print(original_name)#这里会有没有中文名字的…………if re.search(r'.*?/(.*?)\n',original_name):name = re.search(r'.*?/(.*?)\n',original_name).group(1)name_cn = re.search(r'(.*?)/.*?\n',original_name).group(1)else:#这里的话需要单独再去掉最后的换行。这里相当于是只有英文名name = original_name.replace('\n','')name_cn = ''# print(name.group(1))#group()和group(0)都是所有匹配到的,就是写的这个正则能匹配到的所有,1是可以显示第一个括号中的name_url = re.sub(r'^.','http://www.stat-nba.com',name_urls[index])# name_url = name_url.replace('\n','')这个不用要了# print(name)# print('----'+str(name_url)+'---')#这里为了检验是否有换行# print(name_url)person_name_and_url['name'] = nameperson_name_and_url['name_url'] = name_urlperson_name_and_url['name_cn'] = name_cnplayer_names_and_urls.append(person_name_and_url)# print(player_names_and_urls)print(letter_link+"已经爬取完毕")self.write_dict_to_csv('csv文件/web_players.csv', player_names_and_urls)print("从网站上得到的数据长度为:"+str(len(player_names_and_urls)))# print(page_names)#这里从网页中得到了球员名字和对应的urlplayer_nameid_list = spider.get_playername_fromdb()# print(player_nameid_list)# print(player_names_and_urls)index = 0for db_playername_and_id in range(len(player_nameid_list)):#这里相当于是没有用到这里的值,只是当做一个轴#这里不能删除,因为直接删除的话,会有索引值问题#直接自己加一个索引############3for index1,player_name_and_url in enumerate(player_names_and_urls):# print(player_name_and_url['name'],db_playername_and_id['name'])# self.write_file(player_name_and_url['name'],db_playername_and_id['name'])# self.write_file('\n')# print(index,index1)if (player_name_and_url['name'] == player_nameid_list[index]['name'])|(player_name_and_url['name_cn'] == player_nameid_list[index]['name_cn']):print('匹配到球员'+player_name_and_url['name'])# self.write_console(player_name_and_url['name']+'='+db_playername_and_id['name'])# player_names_and_urls[index]['playerId'] = db_playername_and_id['playerId']player_nameid_list[index]['name_url'] = player_name_and_url['name_url']#上面这句话我只是给现在的这个字典加了属性,但是并没有加到之前那个列表中!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!迷了一个多小时breakelif index1 == len(player_names_and_urls)-1:#这里判断是否是最后一个元素,如果是的话就直接删除# print(index,index1)# self.write_console(player_name_and_url['name'],db_playername_and_id['name'])# self.write_console('\n')print('删除球员:'+player_nameid_list[index]['name'])del player_nameid_list[index]#这里面有url,name,name_cn,还有idindex -=1#写外面写里面都一样index +=1# self.write_console(player_names_and_urls)# self.write_console(len(player_names_and_urls))return player_nameid_list# self.write

需要注意的我在注释上也都写了,就是xyz页面中没有教练员一栏,所以xpath表达式会有些不同,要加判断
下面这个问题我也没有解决,本身想用pyquery(单纯的想试一试,但是遇到这个编码问题解决不了就换了xpath)

#standard_data = self.request_page(letter_link)
# doc = pq(standard_data)
# print(doc(‘title’))
# break
#!!!报错,解决不了,就是在print(divs)那里,直接换xpath
#UnicodeEncodeError: ‘gbk’ codec can’t encode character ‘\xa9’ in position 1093: illegal multibyte sequence

还有几处就是因为python中for i in list 这里i都是list中的元素,但是我需要的是下标,就需要用上述函数中的方法进行操作

下面显示的就是有些球员会没有中文名,判断方法需要改变
在这里插入图片描述
还有一个问题
在这里插入图片描述
就是这里有的球员在一个赛季换了多支球队,就会出现这个问题,这个问题解决思路就是从上往下判断,是否和上一个相同,相同的话就删去这一个,这样就可以把最完美的数据存入数据库中

源码:

我分了两个文件写:

  1. 爬nba官网:
# -*- coding: utf-8 -*-
import requests#这里希望用etree做一下规范,所以用了requests,直接pyquery爬担心爬到不规范的网站
from lxml import etree#为了修正一些不规范的网页
from pyquery import PyQuery as pq
from urllib import request#这里希望下载图片,用urlretrieve方法
import time
import sys
import json
import re
import pymysql
import os#判断是否已下载图片class Spider:def __init__(self,file):self.to_file = file# self.to_console = sys.stdoutself.headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36','referer':'https: // china.nba.com /'}def write_console(self,*args):sys.stdout = self.to_consolefor arg in args:print(arg)#必须要,因为改的是sys.stdoutdef write_file(self,*args):sys.stdout = open(self.to_file,'a',encoding='utf8')for arg in args:print(arg)def run_nba_spider(self):# base_url = "https://china.nba.com"# player_url = 'https://china.nba.com/playerindex/'# original_data = self.get_page(player_url)# json_url = 'https://china.nba.com/static/data/league/playerlist.json'# self.get_playerjson(json_url,'tmp.txt')#这个只需要获取数据时执行即可# players = self.get_player_information('tmp.json')# self.write_playerinfo_table('nbadb',players)self.get_playerimg()def get_page(self,url):response = requests.get(url=url,headers = self.headers)data = response.content.decode('gbk')# print(data)return datadef get_playerjson(self,url,file):# htmlElement = etree.HTML(data)#因为这里经过HTML规范化之后# data = etree.tostring(htmlElement).decode('utf8')# print(data)# doc = pq(data)# print('zzz')# print(doc('title'))#爬取过程中发现这里使用的是json形式交换数据,然后就直接找到了json数据交换的url直接获取并解析。这里json格式非常规范response = requests.get(url)json_loads = json.loads(response)# print(json.loads(response))if os.path.exists(file):passelse:with open(file, "w",encoding='utf8') as fp:fp.write(json.dumps(json_loads, indent=4,ensure_ascii=False))def get_player_information(self,file):with open(file, 'r',encoding='utf8') as f:b = f.read()json_loads = json.loads(b)players_list = json_loads['payload']['players']players = []for i in players_list:player = {}playerProfile = i['playerProfile']player['playerId'] = playerProfile["playerId"]player['code'] = playerProfile['code']player['name'] = playerProfile["displayName"].replace(" ",'-')player['displayNameEn'] = playerProfile['displayNameEn'].replace(" ",'-')player['position'] = playerProfile['position']player['height'] = playerProfile['height']player['weight'] = playerProfile['weight'].replace(" ",'')player['country'] = playerProfile["country"]player['jerseyNo'] = playerProfile['jerseyNo']player['draftYear'] = playerProfile['draftYear']player['team_abbr'] = i['teamProfile']['abbr']player['team_city'] = i['teamProfile']['city']player['team'] = i['teamProfile']['name']player['team_name'] = player['team_city']+player['team']print(player)players.append(player)return playersdef create_table(self,table_name):db = pymysql.connect(host='localhost', user='root', password='root', port=3306, db='nbadb')cursor = db.cursor()sql = 'create table if not exists {0} (playerId varchar(10) primary key,code varchar(20) not null,name varchar(100) not null,displayNameEn varchar(20) not null,position varchar(10) not null,height varchar(10) not null,weight varchar(10) not null,country varchar(20) not null,jerseyNo varchar(10) not null,draftYear varchar(10) not null,team_abbr varchar(10) not null,team_name varchar(100) not null)'.format(table_name)cursor.execute(sql)db.close()def write_playerinfo_table(self,db_name,players_info):db = pymysql.connect(host='localhost', user='root', password='root', port=3306, db=db_name)cursor = db.cursor()for player in players_info:print(player['code'])sql = 'insert into players(playerId,code,name,displayNameEn,position,height,weight,country,jerseyNo,draftYear,team_abbr,team_name) values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'#这种稳稳地# sql = "insert into players() values({0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{1},)".format(player['playerId'],player['code'],player['name'],player['displayNameEn'],player['position'],player['height'],player['weight'],player['country'],player['jerseyNo'],player['draftYear'],player['team_abbr'],player['team_name'])try:# cursor.execute(sql)cursor.execute(sql,(player['playerId'],player['code'],player['name'],player['displayNameEn'],player['position'],player['height'],player['weight'],player['country'],player['jerseyNo'],player['draftYear'],player['team_abbr'],player['team_name']))db.commit()except Exception as e:print('插入数据出现异常',e)db.rollback()db.close()def get_playerimg(self):db = pymysql.connect(host='localhost', user='root', password='root', port=3306, db='nbadb')cursor = db.cursor()sql = "select playerId,name from players"try:# player_ids = cursor.execute(sql)#这样会返回个数cursor.execute(sql)player_idnames = cursor.fetchall()# print(player_ids)except:print('获取数据异常')db.close()for player_idname in player_idnames:# request = request.Request()print(player_idname)url = 'https://china.nba.com/media/img/players/head/260x190/'+player_idname[0]+'.png'# print(url)# breakif not os.path.exists('player_imgs/'+player_idname[0]+'.png'):response = requests.get(url,headers=self.headers)with open('player_imgs/'+player_idname[0]+'.png','wb') as f:f.write(response.content)print(player_idname[1]+":下载完毕")else:print(player_idname[1]+":已存在")
if __name__ == "__main__":spider = Spider('nba.txt')spider.run_nba_spider()
  1. 爬nba数据库网站
# -*- coding: utf-8 -*-
import requests
import pymysql
from lxml import etree
from pyquery import PyQuery as pq
import re
import sys
import csv
import jsonclass Spider:def __init__(self):self.url = ''self.headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'}# self.to_file = 'out/inputdb.txt'  #这里可以自己修改self.to_console = sys.stdoutdef write_console(self,*args):sys.stdout = self.to_consolefor arg in args:print(arg,end='')def write_file(self,file,*args):sys.stdout = open(file,'w',encoding='utf8')for arg in args:print(arg,end='')sys.stdout =self.to_consoledef get_playername_fromdb(self):#从player表中取出id和nameplayer_idnamelist=[]db = pymysql.connect(host='localhost', user='root', password='root', port=3306, db='nbadb')cursor = db.cursor()sql = 'select playerId,displaynameEn,name from players'cursor.execute(sql)player_idnames = cursor.fetchall()for player_idname in player_idnames:player_dict = {}# player_idnamelist.append([player_idname[0],player_idname[1].replace('-',' ')])player_dict['name'] = player_idname[1].replace('-',' ')player_dict['playerId'] = player_idname[0]player_dict['name_cn'] = player_idname[2]player_idnamelist.append(player_dict)# print(player_idnamelist)return player_idnamelistdef get_page_info(self,text,url):#请求主页,得到每个字母开头的页面的url存进文件中# base_url = 'http://www.stat-nba.com/playerList.php'# response = requests.get(base_url,headers=self.headers)# print(response.content)# original_data = response.content.decode('utf-8')# print(original_data)这里会报错,没解决,但是不影响,只是中文解析会有问题,英文不影响# standard_html = etree.HTML(original_data)# print(standard_html.xpath('//title/text()'))# standard_data = etree.tostring(standard_html).decode('utf-8')# print(standard_data)standard_data = self.request_page(url,'pyquery')doc = pq(standard_data)# print(doc('title'))#很神奇,这里可以显示中文了#个人一些见解:pyquery中pq()需要解析的是str类型,而xpath需要对htmlelement进行解析,所以也就造成了上面的结果,#tostring是转换为bytes,decode是把bytes解码成string,以后进一步学习编码!!!dom_as = doc('.pagination>div>a')letter_links = []for dom_a in dom_as.items():print(dom_a)letter_links.append(re.sub(r'^\.','',dom_a.attr('href')))#第一次直接用了去除点,出问题了#上面这个写不写反斜杠都可以的,因为这里一个.意思就是第一个,所以无关紧要letter_links = letter_links[1::]with open(text,'w',encoding='utf8') as fp:for letter_link in letter_links:fp.write('http://www.stat-nba.com'+letter_link+'\n')print(letter_links)#这里直接存到本地文件里面了,防止爬的多了被封ip,而且效率也高def request_letter_page(self,text):#在文件中拿到urlletter_links = []#从文件中拿到每个字母对应的页面地址with open(text,'r',encoding='utf8') as fp:for link in fp.readlines():link = link.replace('\n','')#这里是生成新的,不改变原来的,迷了半天letter_links.append(link)player_names_and_urls = []  # 为了存储得到的playername和url,计划所有都得到以后再与数据库中的数据加以比较for index,letter_link in enumerate(letter_links):# 这里很坑!!!!!!!!!!!!!!!!!!!!!!!!!!!!1# !!!!!!!!!!!!# 因为!!!!1这里的xyz页面中没有教练员这一栏!!!!!!!!!!,所以xpath不正确# print(letter_link)# standard_data = self.request_page(letter_link)# doc = pq(standard_data)# print(doc('title'))# break#!!!!!!!!!!!!!!!!!!!报错,解决不了,就是在print(divs)那里,直接换xpath#UnicodeEncodeError: 'gbk' codec can't encode character '\xa9' in position 1093: illegal multibyte sequence# divs = doc("#background > div:nth-child(16)")# print(divs)# print(divs.items())# for span in spans:#     print(span.html())#     break# response = requests.get(letter_link)# original_data = response.content.decode('utf8')# htmlElement = etree.HTML(original_data)htmlElement = self.request_page(letter_link,'xpath')if (index < 23)&(index!=20):original_names = htmlElement.xpath("//div[@class='playerList'][2]//span/text()")name_urls = htmlElement.xpath("//div[@class='playerList'][2]//div/a/@href")else:original_names = htmlElement.xpath("//div[@class='playerList'][1]//span/text()")name_urls = htmlElement.xpath("//div[@class='playerList'][1]//div/a/@href")# print(original_names[0])for index,original_name in enumerate(original_names):person_name_and_url = {}# print(original_name)#这里会有没有中文名字的…………if re.search(r'.*?/(.*?)\n',original_name):name = re.search(r'.*?/(.*?)\n',original_name).group(1)name_cn = re.search(r'(.*?)/.*?\n',original_name).group(1)else:#这里的话需要单独再去掉最后的换行。这里相当于是只有英文名name = original_name.replace('\n','')name_cn = ''# print(name.group(1))#group()和group(0)都是所有匹配到的,就是写的这个正则能匹配到的所有,1是可以显示第一个括号中的name_url = re.sub(r'^.','http://www.stat-nba.com',name_urls[index])# name_url = name_url.replace('\n','')这个不用要了# print(name)# print('----'+str(name_url)+'---')#这里为了检验是否有换行# print(name_url)person_name_and_url['name'] = nameperson_name_and_url['name_url'] = name_urlperson_name_and_url['name_cn'] = name_cnplayer_names_and_urls.append(person_name_and_url)# print(player_names_and_urls)print(letter_link+"已经爬取完毕")self.write_dict_to_csv('csv文件/web_players.csv', player_names_and_urls)print("从网站上得到的数据长度为:"+str(len(player_names_and_urls)))# print(page_names)#这里从网页中得到了球员名字和对应的urlplayer_nameid_list = spider.get_playername_fromdb()# print(player_nameid_list)# print(player_names_and_urls)index = 0for db_playername_and_id in range(len(player_nameid_list)):#这里相当于是没有用到这里的值,只是当做一个轴#这里不能删除,因为直接删除的话,会有索引值问题#直接自己加一个索引############3#这里有一个问题,因为此时如果相等需要跳出两层循环!!!!!1for index1,player_name_and_url in enumerate(player_names_and_urls):# print(player_name_and_url['name'],db_playername_and_id['name'])# self.write_file(player_name_and_url['name'],db_playername_and_id['name'])# self.write_file('\n')# print(index,index1)if (player_name_and_url['name'] == player_nameid_list[index]['name'])|(player_name_and_url['name_cn'] == player_nameid_list[index]['name_cn']):print('匹配到球员'+player_name_and_url['name'])# self.write_console(player_name_and_url['name']+'='+db_playername_and_id['name'])# player_names_and_urls[index]['playerId'] = db_playername_and_id['playerId']player_nameid_list[index]['name_url'] = player_name_and_url['name_url']#上面这句话我只是给现在的这个字典加了属性,但是并没有加到之前那个列表中!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!迷了一个多小时breakelif index1 == len(player_names_and_urls)-1:#这里判断是否是最后一个元素,如果是的话就直接删除# print(index,index1)# self.write_console(player_name_and_url['name'],db_playername_and_id['name'])# self.write_console('\n')print('删除球员:'+player_nameid_list[index]['name'])del player_nameid_list[index]#这里面有url,name,name_cn,还有idindex -=1#写外面写里面都一样index +=1# self.write_console(player_names_and_urls)# self.write_console(len(player_names_and_urls))return player_nameid_list# self.write#这个函数是获取从nba官网获得的数据,但是在nba数据库网站中没有的球员名字。def get_missing_players(self,player_db,player_web):missing_players = []for i in player_db:for index,j in enumerate(player_web):if i['playerId'] == j['playerId']:breakelif index==len(player_web)-1:missing_players.append(i)# with open('missing_players.txt','w',encoding='utf8')as fp:#     for i in missing_players:#         fp.write("name:"+i)#         fp.write(i['name']+' ')#         fp.write("playerId:" + i)#         fp.write(i['playerId'])#         fp.write('\n')return missing_playersdef request_page(self,url,extract_function):response = requests.get(url)original_data = response.content.decode('utf-8')standard_html = etree.HTML(original_data)standard_data = etree.tostring(standard_html).decode('utf-8')if extract_function=='xpath':return standard_htmlelif extract_function=='pyquery':return standard_data#返回去之后直接可以pyquery解析# def write_list_tofile(self,list,file):#     with open(file,'w',encoding='utf8') as fp:#         for i in list:#             fp.write(i+'\n')def write_dict_to_csv(self,file,lists):#这里index是判断是否是第一个需要加列名if re.search(r'\.csv$',file):#有的话就有返回值,没有的话就会返回none# fieldnames = []这个方法也可以但是太笨# for arg in args:#     fieldnames.append(arg)fieldnames = list(lists[0].keys())with open(file,'w',encoding='utf8')as csvfile:writer = csv.DictWriter(csvfile,lineterminator='\n',fieldnames=fieldnames)#lineterminator控制换行的,minator是结尾,终结者的意思。writer.writeheader()for i in lists:writer.writerow(i)if i['name_cn']:print('写入'+i['name_cn']+'数据成功')else:print('写入'+i['name']+'数据成功')else:self.write_console('请传入一个csv文件')def update_playersdb(self,new_player_list):db = pymysql.connect(host='localhost', user='root', password='root', port=3306, db='nbadb')cursor = db.cursor()# sql = "alter table players add  column name_url varchar(200)"# cursor.execute(sql)for i in new_player_list:try:sql = "update players set name_url = %s where playerId = %s"cursor.execute(sql,(i['name_url'],i['playerId']))db.commit()except Exception as e:print(e)db.rollback()db.close()# player_idnames = cursor.fetchall()def read_data_from_csv(self,file):if re.search(r'\.csv$',file):with open(file,'r',encoding='utf8')as csvfile:reader = list(csv.reader(csvfile))data_list = reader[1:]player_list = []for i in data_list:player_dict = {}for j in range(len(reader[0])):player_dict[reader[0][j]] = i[j]player_list.append(player_dict)return player_listelse:print('请传入一个csv文件')def spider_player_page(self):db = pymysql.connect(host='localhost',user='root',password='root',port=3306,db='nbadb')cursor = db.cursor()sql = 'select name_url,playerId from players'cursor.execute(sql)db.close()data = cursor.fetchall()player_url_id = []#保存playerId和name_urlfor i in data:player_dict = {}player_dict['name_url'] = i[0]player_dict['playerId'] = i[1]player_url_id.append(player_dict)# breakprint(player_url_id)print('从数据库中提取数据完毕')for index,player in enumerate(player_url_id):#这里需要修改原列表print('正在处理:'+player['playerId']+'的数据')player_url_id[index]['playerSeasonData'] = []player_url_id[index]['playerCareerData'] = []name_url = player['name_url']standard_data = self.request_page(name_url,'xpath')oringinal_season_title = standard_data.xpath('//*[@id="stat_box_avg"]/thead/tr//text()')# print(oringinal_season_title)#这里会有换行空格什么的需要清除一下,直接定义一个函数拉倒title = self.clear_n_inlist(oringinal_season_title)# self.write_file('out/player_data.txt')player_url_id[index]['playerSeasonData'].append(title)oringinal_season_datas = standard_data.xpath('//*[@id="stat_box_avg"]/tbody/tr[@class="sort"]')#这个for循环是为了去除重复数据,因为有的球员一个赛季换两个以上球队的话,这个数据库中会把每个球队的数据分开存储,然后再写一个总计的数据,我只需要总计就行,总计在最上面,所以这种方式就可以index1 = 0for i in range(len(oringinal_season_datas)):if (index1 != 0) & (oringinal_season_datas[index1].xpath('./td[2]//text()')[0] == oringinal_season_datas[index1 - 1].xpath('./td[2]//text()')[0]):print('删除'+player_url_id[index]['playerId']+'的'+oringinal_season_datas[index1].xpath('./td[2]//text()')[0]+'数据')del oringinal_season_datas[index1]index1 -= 1index1 +=1print('还剩:'+str(index1)+'个赛季的数据')for i in oringinal_season_datas:oringinal_season_data = i.xpath('.//text()')season_data = self.clear_n_inlist(oringinal_season_data)player_url_id[index]['playerSeasonData'].append(season_data)# print(player['playerSeasonData'])# print(player_url_id)oringinal_career_datas = standard_data.xpath('//*[@id="stat_box_avg"]/tbody/tr[position()>last()-2][position()<last()+1]')for j in oringinal_career_datas:oringinal_season_data = j.xpath('.//text()')self.clear_n_inlist(oringinal_season_data)player_url_id[index]['playerCareerData'].append(oringinal_season_data)print(player['playerId']+'的数据处理完毕')# print(player_url_id)#输出在player_page.txt中# breakprint('所有数据爬取完毕')self.write_file(player_url_id,'out/inputdb.txt')#这里最好是直接存进json中,但是也无妨出来之后json.loads()直接转换为list类型可以进行数据库插入操作NB!!!因为这样可以节约一次次爬取网站消耗的事件,爬的其实不快,估计得两分钟return player_url_iddef clear_n_inlist(self,list):#这里是去掉提取信息中的换行index = 0for i in range(len(list)):# if list[index]=="\n":#后来发现还有' \n'和'\n 'if re.search(r'.*?\n.*?',list[index]):del list[index]index -=1index +=1return listdef write_player_season_data_todb(self):#先从文件中提取数据with open('out/inputdb.txt','r',encoding='utf8')as fp:player_data_str = fp.read()#此时是字符串型的而且都是单引号,直接转json会有报错player_data_str = player_data_str.replace("'",'"')#单引号转双引号player_data_json = json.loads(player_data_str)# print(len(player_data_json))#写入playerdata表中db = pymysql.connect(host='localhost',user='root',password='root',port=3306,db='nbadb')cursor = db.cursor()# 下面是球员的场均数据,我觉得这个可以写到同一个表里try:sql_create_table = 'CREATE TABLE if not exists playerCareerdata (playerId varchar(20) primary key,season VARCHAR(20),team VARCHAR(20),chuchang_times VARCHAR(20),starting_times VARCHAR(20),play_time VARCHAR(20),hit_rate VARCHAR(20),hit_times VARCHAR(20),shoot_times VARCHAR(20),three_hit_rate VARCHAR(20),three_hit_times VARCHAR(20),three_shoot_times VARCHAR(20),free_hit_rate VARCHAR(20),free_hit_times VARCHAR(20),free_shoot_times VARCHAR(20),rebound VARCHAR(20),offensive_rebound VARCHAR(20),defensive_rebound VARCHAR(20),assist VARCHAR(20),steal VARCHAR(20),block VARCHAR(20),fault VARCHAR(20),foul VARCHAR(20),score VARCHAR(20),win VARCHAR(20),lose VARCHAR(20)) character set utf8'cursor.execute(sql_create_table)print('球员生涯场均表建立完成!')for player_info in player_data_json:#这里是挨个取出每个球员的信息,每一个球员都有一个表# print(player_info)sql_create_table = 'CREATE TABLE if not exists `%s` (years VARCHAR(10),team_num VARCHAR(10),chuchang_times VARCHAR(20),starting_times VARCHAR(20),play_time VARCHAR(20),hit_rate VARCHAR(20),hit_times VARCHAR(20),shoot_times VARCHAR(20),three_hit_rate VARCHAR(20),three_hit_times VARCHAR(20),three_shoot_times VARCHAR(20),free_hit_rate VARCHAR(20),free_hit_times VARCHAR(20),free_shoot_times VARCHAR(20),rebound VARCHAR(20),offensive_rebound VARCHAR(20),defensive_rebound VARCHAR(20),assist VARCHAR(20),steal VARCHAR(20),block VARCHAR(20),fault VARCHAR(20),foul VARCHAR(20),score VARCHAR(20),win VARCHAR(20),lose VARCHAR(20)) character set utf8'%player_info["playerId"]cursor.execute(sql_create_table)print(player_info['playerId']+'赛季数据库已创建好')for player_season_data in player_info['playerSeasonData'][1:]:print(len(player_info['playerSeasonData'][1:]))print(player_season_data)if len(player_season_data) != 25:print(" "+player_info['playerId']+'缺少'+player_season_data[0]+'赛季数据')breaksql_insert = 'insert into `%s`'%(player_info["playerId"]) +' values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'# print('sds')cursor.execute(sql_insert, (player_season_data[0], player_season_data[1], player_season_data[2], player_season_data[3],player_season_data[4], player_season_data[5], player_season_data[6], player_season_data[7],player_season_data[8], player_season_data[9], player_season_data[10], player_season_data[11],player_season_data[12], player_season_data[13], player_season_data[14], player_season_data[15],player_season_data[16], player_season_data[17], player_season_data[18], player_season_data[19],player_season_data[20], player_season_data[21], player_season_data[22], player_season_data[23],player_season_data[24]))print(" "+player_season_data[0]+'赛季插入数据库完成')for player_career_data in player_info['playerCareerData'][1:]:if len(player_career_data) != 25:print(" "+player_info['playerId'] + '缺少场均数据')breaksql_insert = 'insert into playerCareerdata values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'cursor.execute(sql_insert, (player_info['playerId'],player_career_data[0], player_career_data[1], player_career_data[2], player_career_data[3],player_career_data[4], player_career_data[5], player_career_data[6], player_career_data[7],player_career_data[8], player_career_data[9], player_career_data[10], player_career_data[11],player_career_data[12], player_career_data[13], player_career_data[14], player_career_data[15],player_career_data[16], player_career_data[17], player_career_data[18], player_career_data[19],player_career_data[20], player_career_data[21], player_career_data[22], player_career_data[23],player_career_data[24]))print(" "+player_info['playerId']+"插入场均数据库完毕")db.commit()except Exception as e:db.rollback()db.close()if __name__ == "__main__":spider = Spider()# print(player_list)# spider.get_page_info('letter_link.txt','http://www.stat-nba.com/playerList.php')# db_players = spider.get_playername_fromdb()# print(len(db_players))# player_list = spider.request_letter_page('txt文件/letter_link.txt')#得到的是最终的球员姓名,和url# print(len(player_list))# spider.write_dict_to_csv('csv文件/player_list.csv',player_list)# print(len(player_list))#这个实际上是两个库中都有的球员信息#单纯的为了检验有没有数据错误# missing_players = spider.get_missing_players(db_players,player_list)# spider.write_dict_to_csv('csv文件/missing_players.csv',missing_players)# print(missing_players)#因为这里有一些拿不到,通过中文名加英文名的组合依旧拿不到,无能为力,去除这些,还剩488个现役球员# print(len(missing_players))# player_data_list = spider.read_data_from_csv('csv文件/player_list.csv')# spider.update_playersdb(player_data_list)#数据库更新完毕,接下来要爬取球员数据# print(len(player_data_list))# player_game_data = spider.spider_player_page()#这里从网站上获得了所有的数据,spider.write_player_season_data_todb()

这篇关于使用requests+xpath爬取nba球员信息的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/422550

相关文章

Python使用Tenacity一行代码实现自动重试详解

《Python使用Tenacity一行代码实现自动重试详解》tenacity是一个专为Python设计的通用重试库,它的核心理念就是用简单、清晰的方式,为任何可能失败的操作添加重试能力,下面我们就来看... 目录一切始于一个简单的 API 调用Tenacity 入门:一行代码实现优雅重试精细控制:让重试按我

MySQL中EXISTS与IN用法使用与对比分析

《MySQL中EXISTS与IN用法使用与对比分析》在MySQL中,EXISTS和IN都用于子查询中根据另一个查询的结果来过滤主查询的记录,本文将基于工作原理、效率和应用场景进行全面对比... 目录一、基本用法详解1. IN 运算符2. EXISTS 运算符二、EXISTS 与 IN 的选择策略三、性能对比

使用Python构建智能BAT文件生成器的完美解决方案

《使用Python构建智能BAT文件生成器的完美解决方案》这篇文章主要为大家详细介绍了如何使用wxPython构建一个智能的BAT文件生成器,它不仅能够为Python脚本生成启动脚本,还提供了完整的文... 目录引言运行效果图项目背景与需求分析核心需求技术选型核心功能实现1. 数据库设计2. 界面布局设计3

SQL Server跟踪自动统计信息更新实战指南

《SQLServer跟踪自动统计信息更新实战指南》本文详解SQLServer自动统计信息更新的跟踪方法,推荐使用扩展事件实时捕获更新操作及详细信息,同时结合系统视图快速检查统计信息状态,重点强调修... 目录SQL Server 如何跟踪自动统计信息更新:深入解析与实战指南 核心跟踪方法1️⃣ 利用系统目录

使用IDEA部署Docker应用指南分享

《使用IDEA部署Docker应用指南分享》本文介绍了使用IDEA部署Docker应用的四步流程:创建Dockerfile、配置IDEADocker连接、设置运行调试环境、构建运行镜像,并强调需准备本... 目录一、创建 dockerfile 配置文件二、配置 IDEA 的 Docker 连接三、配置 Do

Android Paging 分页加载库使用实践

《AndroidPaging分页加载库使用实践》AndroidPaging库是Jetpack组件的一部分,它提供了一套完整的解决方案来处理大型数据集的分页加载,本文将深入探讨Paging库... 目录前言一、Paging 库概述二、Paging 3 核心组件1. PagingSource2. Pager3.

python运用requests模拟浏览器发送请求过程

《python运用requests模拟浏览器发送请求过程》模拟浏览器请求可选用requests处理静态内容,selenium应对动态页面,playwright支持高级自动化,设置代理和超时参数,根据需... 目录使用requests库模拟浏览器请求使用selenium自动化浏览器操作使用playwright

python使用try函数详解

《python使用try函数详解》Pythontry语句用于异常处理,支持捕获特定/多种异常、else/final子句确保资源释放,结合with语句自动清理,可自定义异常及嵌套结构,灵活应对错误场景... 目录try 函数的基本语法捕获特定异常捕获多个异常使用 else 子句使用 finally 子句捕获所

C++11右值引用与Lambda表达式的使用

《C++11右值引用与Lambda表达式的使用》C++11引入右值引用,实现移动语义提升性能,支持资源转移与完美转发;同时引入Lambda表达式,简化匿名函数定义,通过捕获列表和参数列表灵活处理变量... 目录C++11新特性右值引用和移动语义左值 / 右值常见的左值和右值移动语义移动构造函数移动复制运算符

Python对接支付宝支付之使用AliPay实现的详细操作指南

《Python对接支付宝支付之使用AliPay实现的详细操作指南》支付宝没有提供PythonSDK,但是强大的github就有提供python-alipay-sdk,封装里很多复杂操作,使用这个我们就... 目录一、引言二、准备工作2.1 支付宝开放平台入驻与应用创建2.2 密钥生成与配置2.3 安装ali