scrapy 爬取诗词记录code

本文主要是介绍scrapy 爬取诗词记录code，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

创建项目

scrapy startproject poems

创建爬虫应用

cd poems\poems\spidersscrapy genspider 名字 域名
scrapy genspider poem_spider www.gushiwen.org在poem_spider.py中 修改启始URL
start_urls = ['https://www.gushiwen.org/default_1.aspx']

在items中定义数据结构

class PoemsItem(scrapy.Item):title = scrapy.Field()  # 题目dynasty = scrapy.Field()  # 朝代author = scrapy.Field()  # 作者content = scrapy.Field()  # 内容tags = scrapy.Field()  # 标签 tags

settings中设置

USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3648.400 QQBrowser/10.4.3319.400"

设置一个启动文件main.py

from scrapy.cmdline import executeimport sys
import ossys.path.append(os.path.dirname(os.path.abspath(__file__)))execute(["scrapy","crawl","poem_spider"])

编写爬虫

shell调试
scrapy shell https://www.gushiwen.org/default_1.aspx

# -*- coding: utf-8 -*-
import scrapy
from poems.items import PoemsItemclass PoemSpiderSpider(scrapy.Spider):name = 'poem_spider' # 爬虫名allowed_domains = ['www.gushiwen.org'] # 允许的域名start_urls = ['https://www.gushiwen.org/default_1.aspx'] # 入口urldef parse(self, response):docs = response.css(".left .sons")for doc in docs:poem_tiem = PoemsItem()poem_tiem['title'] = doc.css("b::text").extract()[0]poem_tiem['dynasty'],poem_tiem['author'] = doc.css(".source  a::text").extract()poem_tiem['content'] = "".join(doc.css(".contson::text").extract()).strip()poem_tiem['tags'] = ",".join(doc.css(".tag a::text").extract())yield poem_tiemnext_link = response.css(".pagesright .amore::attr(href)")if next_link:next_link = next_link[0].extract()yield scrapy.Request("https://www.gushiwen.org" + next_link)

保存到json文件里
scrapy crawl poem_spider -o test.json保存到scv里
scrapy crawl poem_spider -o test.csv

这篇关于scrapy 爬取诗词记录code的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

scrapy 爬取诗词记录code

相关文章

解决Nginx启动报错Job for nginx.service failed because the control process exited with error code问题

基于Spring Boot 的小区人脸识别与出入记录管理系统功能

java中pdf模版填充表单踩坑实战记录(itextPdf、openPdf、pdfbox)

Zabbix在MySQL性能监控方面的运用及最佳实践记录

在Spring Boot中集成RabbitMQ的实战记录

k8s上运行的mysql、mariadb数据库的备份记录(支持x86和arm两种架构)

SpringBoot3应用中集成和使用Spring Retry的实践记录

Python UV安装、升级、卸载详细步骤记录

统一返回JsonResult踩坑的记录

Go学习记录之runtime包深入解析

scrapy 爬取诗词 记录code

相关文章

scrapy 爬取诗词记录code