AI网络爬虫-自动获取百度实时热搜榜

本文主要是介绍AI网络爬虫-自动获取百度实时热搜榜，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

工作任务和目标：自动获取百度实时热搜榜的标题和热搜指数

标题：<div class="c-single-text-ellipsis"> 东部战区台岛战巡演练模拟动画 </div>

第一步，在deepseek中输入如下提示词：

你是一个Python爬虫专家，完成以下网页爬取的Python脚本任务：

在F:\aivideo文件夹里面新建一个Excel文件：topbaidu.xlsx

设置chromedriver的路径为："D:\Program Files\chromedriver125\chromedriver.exe"

用selenium打开网页：https://top.baidu.com/board?tab=realtime；

请求标头为：

Accept:

text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7

Accept-Encoding:

gzip, deflate, br, zstd

Accept-Language:

zh-CN,zh;q=0.9,en;q=0.8

Cache-Control:

max-age=0

Connection:

keep-alive

Host:

http://top.baidu.com

Referer:

https://top.baidu.com/board?platform=pc&tab=homepage&sa=pc_index_homepage_all

Sec-Ch-Ua:

"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"

Sec-Ch-Ua-Mobile:

Sec-Ch-Ua-Platform:

"Windows"

Sec-Fetch-Dest:

document

Sec-Fetch-Mode:

navigate

Sec-Fetch-Site:

same-origin

Sec-Fetch-User:

Upgrade-Insecure-Requests:

User-Agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36

解析源代码，并打印输出；

定位class="c-single-text-ellipsis"的div标签，提取其文本内容作为热搜标题，保存到topbaidu.xlsx的第1列；

定位class="hot-index_1Bl1a"的div标签，提取其文本内容作为热搜指数，保存到topbaidu.xlsx的第2列；

注意：

每一步都要输出相关信息到屏幕；

每解析一个网页随机暂停1-10秒；

设置请求头，来应对反爬虫机制；

DataFrame.append 方法在 pandas 1.4.0 版本中已经被弃用，并且在后续版本中被移除。为了解决这个问题，我们可以使用 concat 函数来代替 append

第二步，查看审阅deepseek生成的Python代码：

import os

import time

import random

import pandas as pd

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

from selenium.webdriver.common.by import By

from selenium.webdriver.chrome.options import Options

# 设置chromedriver的路径

chromedriver_path = "D:\\Program Files\\chromedriver125\\chromedriver.exe"

# 设置请求头

headers = {

'authority': 'top.baidu.com',

'method': 'GET',

'path': '/board?tab=realtime',

'scheme': 'https',

'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',

'accept-encoding': 'gzip, deflate, br, zstd',

'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',

'cache-control': 'max-age=0',

'cookie': '__root_domain_v=.baidu.com; _qddaz=QD.484716194472545; _ntes_origin_from=sogou; _ntes_nuid=af525edefbd3d363cd5876dbe902d85b; s_n_f_l_n3=01cb466d244b7db51716517702302; _antanalysis_s_id=1716517702995; UserProvince=%u5168%u56FD; ne_analysis_trace_id=1716518441938; vinfo_n_f_l_n3=01cb466d244b7db5.1.0.1716517702302.0.1716518672850',

'if-modified-since': 'Fri, 24 May 2024 02:40:05 GMT',

'priority': 'u=0, i',

'referer': 'https://www.sogou.com/link?url=hedJjaC291M4oaTBlXc5yCiioa5eDVgw',

'sec-ch-ua': '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',

'sec-ch-ua-mobile': '?0',

'sec-ch-ua-platform': '"Windows"',

'sec-fetch-dest': 'document',

'sec-fetch-mode': 'navigate',

'sec-fetch-site': 'cross-site',

'sec-fetch-user': '?1',

'upgrade-insecure-requests': '1',

'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'

}

# 设置Chrome选项

chrome_options = Options()

chrome_options.add_argument(f'user-agent={headers["user-agent"]}')

# 设置Chrome服务

service = Service(chromedriver_path)

# 启动Chrome浏览器

driver = webdriver.Chrome(service=service, options=chrome_options)

# 打开网页

url = 'https://top.baidu.com/board?tab=realtime'

driver.get(url)

# 等待页面加载完成

time.sleep(5)

# 获取网页源代码

html = driver.page_source

print("网页源代码：")

print(html)

# 创建Excel文件

excel_path = os.path.join('F:\\aivideo', 'topbaidu.xlsx')

df = pd.DataFrame(columns=['热搜标题', '热搜指数'])

df.to_excel(excel_path, index=False)

# 定位热搜标题和指数

titles = driver.find_elements(By.CLASS_NAME, 'c-single-text-ellipsis')

indices = driver.find_elements(By.CLASS_NAME, 'hot-index_1Bl1a')

# 保存热搜标题和指数到Excel文件

data = []

for title, index in zip(titles, indices):

title_text = title.text

index_text = index.text

# 输出相关信息

print(f"热搜标题: {title_text}")

print(f"热搜指数: {index_text}")

# 添加数据到列表

data.append({'热搜标题': title_text, '热搜指数': index_text})

# 随机暂停

time.sleep(random.randint(1, 10))

# 使用concat将数据添加到DataFrame

df = pd.concat([df, pd.DataFrame(data)], ignore_index=True)

df.to_excel(excel_path, index=False)

# 关闭浏览器

driver.quit()

print("热搜信息已保存到Excel文件。")

第三步，打开visual studio code软件，新建一个py文件，将Python代码复制到这个文件中，按下F5键运行程序：

程序运行结果：

这篇关于AI网络爬虫-自动获取百度实时热搜榜的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

AI网络爬虫-自动获取百度实时热搜榜

相关文章

Python版本信息获取方法详解与实战

Python爬虫HTTPS使用requests,httpx,aiohttp实战中的证书异步等问题

Java发送SNMP至交换机获取交换机状态实现方式

JAVA实现Token自动续期机制的示例代码

MyBatis/MyBatis-Plus同事务循环调用存储过程获取主键重复问题分析及解决

C#使用iText获取PDF的trailer数据的代码示例

Python实现简单封装网络请求的示例详解

Spring Boot中获取IOC容器的多种方式

linux部署NFS和autofs自动挂载实现过程

MyBatis Plus实现时间字段自动填充的完整方案