web scraper_Web Scraper和服指南

2024-01-26 16:30
文章标签 web 指南 scraper 和服

本文主要是介绍web scraper_Web Scraper和服指南,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

web scraper

Being a frequent reader of Hacker News, I noticed an item on the front page earlier this year which read, “Kimono – Never write a web scraper again.” Although it got a great number of upvotes, the tech junta was quick to note issues, especially if you are a developer who knows how to write scrapers. The biggest concern was a non-intuitive UX, followed by the inability of the first beta version to extract data items from websites as smoothly as the demo video suggested.

作为Hacker News的常客,我注意到今年早些时候头版上的一句话是 “ 和服–不再写网络刮板 ”。 尽管得到了很多支持,但是技术联盟很快就注意到了问题,特别是如果您是知道如何编写刮板的开发人员。 最大的问题是用户界面不直观,其次是Beta版本无法像演示视频所建议的那样顺利地从网站提取数据项。

I decided to give it a few months before I tested it out, and I finally got the chance to do so recently.

我决定在测试之前几个月给它,最后我终于有机会这样做。

Kimono is a Y-Combinator backed startup trying to do something in a field where others have failed. Kimono is focused on creating APIs for websites which don’t have one, another term would be web scraping. Imagine you have a website which shows some data you would like to dynamically process in your website or application. If the website doesn’t have an API, you can create one using Kimono by extracting the data items from the website.

和服是Y-Combinator支持的创业公司,试图在其他人失败的领域做点事情。 和服致力于为没有网站的网站创建API,另一个术语是网站抓取。 假设您有一个网站,其中显示了一些您想在网站或应用程序中动态处理的数据。 如果网站没有API,则可以使用和服通过从网站提取数据项来创建API。

Kimono provides an FAQ section, which says that web scraping from public websites “is 100% legal” as long as you check the robots.txt file to see which URL patterns they have disallowed. However, I would advise you to proceed with caution because some websites can pose a problem.

和服提供了一个“常见问题解答”部分,其中说,只要您检查robots.txt文件以查看禁止使用哪些URL模式,从公共网站上进行网络抓取是“ 100%合法的”。 但是,我建议您谨慎行事,因为某些网站可能会引起问题。

A robots.txt is a file that gives directions to crawlers (usually of search engines) visiting the website. If a webmaster wants a page to be available on search engines like Google, he would not disallow robots in the robots.txt file. If they’d prefer no one scrapes their content, they’d specifically mention it in their Terms of Service. You should always look at the terms before creating an API through Kimono.

robots.txt是一个文件,可为爬网程序(通常是搜索引擎)提供访问网站的指导。 如果网站站长希望某个网页在Google等搜索引擎上可用,则他不会在robots.txt文件中禁止使用机器人。 如果他们不希望任何人抓取其内容,则可以在其服务条款中特别提及。 通过和服创建API之前,应始终仔细阅读术语。

An example of this is Medium. Their robots.txt file doesn’t mention anything about their public posts, but the following quote from their TOS page shows you shouldn’t scrape them (since it involves extracting data from their HTML/CSS).

这方面的一个例子是中。 他们的robots.txt文件未提及他们的公开帖子,但他们的TOS页面中的以下引号表明您不应抓取它们(因为它涉及从其HTML / CSS中提取数据)。

For the remainder of the site, you may not duplicate, copy, or reuse any portion of the HTML/CSS, JavaScipt, logos, or visual design elements without express written permission from Medium unless otherwise permitted by law.

在本网站的其余部分,未经法律明确许可,不得复制,复制或重复使用HTML / CSS,JavaScipt,徽标或视觉设计元素的任何部分。

If you check the #BuiltWithKimono section of their website, you’d notice a few straightforward applications. For instance, there is a price comparison API, which is built by extracting the prices from product pages on different websites.

如果查看他们网站的#BuiltWithKimono部分 ,您会注意到一些简单的应用程序。 例如,有一个价格比较API ,它是通过从不同网站上的产品页面提取价格而构建的。

Let us move on and see how we can use this service.

让我们继续前进,看看如何使用此服务。

我们要做什么? (What are we about to do?)

Let’s try to accomplish a task, while exploring Kimono. The Blog Bowl is a blog directory where you can share and discover blogs. The posts that have been shared by users are available on the feeds page. Let us try to get a list of blog posts from the page.

让我们尝试探索和服的同时完成一项任务。 博客碗是一个博客目录,您可以在其中共享和发现博客。 供稿页面上提供了用户共享的帖子。 让我们尝试从该页面获取博客文章列表。

The simple thought process when scraping the data is parsing the HTML (or searching through it, in simpler terms) and extracting the information we require. In this case, let’s try to get the title of the post, its link, and the blogger’s name and profile page.

抓取数据时,一个简单的思考过程就是解析HTML(或更简单地搜索它)并提取我们需要的信息。 在这种情况下,让我们尝试获取帖子的标题,其链接以及博客的名称和个人资料页面。

入门 (Getting Started)

Sign Up Form

The first step is, of course, to register. Once you’ve signed up, choose either of two options to run Kimono: through a Chrome extension or a bookmarklet.

当然,第一步是注册。 注册后,可以选择两个选项来运行和服:通过Chrome扩展程序或小书签。

报废的舞台物品 (Stage items to be scraped)

We’ll start by using the bookmarklet, where we start with our base URL (http://theblogbowl.in/feeds/). The next step is to select items we would like to store. In our case, we just store the titles of posts and the names of bloggers. The respective links (or any other HTML attributes) associated with these texts are automatically picked up by Kimono. Once you have selected the data you want, you can check the advanced view or the sample data output by changing the views.

我们将从使用小书签开始,从我们的基本URL( http://theblogbowl.in/feeds/ )开始。 下一步是选择我们要存储的项目。 在我们的案例中,我们只存储帖子的标题和博客作者的姓名。 与这些文本关联的相应链接(或任何其他HTML属性)由和服自动提取。 选择所需的数据后,您可以通过更改视图来检查高级视图或示例数据输出。

Selecting the Data

This is where you’ll start to notice some hiccups in the process. It’s not very intuitive at making selections, but you should be able to figure out the correct process eventually. Once you click on an item, all other similar items on the page are highlighted, and you need to point out whether the selections were correct, by selecting one of the pair of small icons (a tick and a cross) that appear next to the selections. If you need to add more items to your list, click the on the “+” icon at the top and repeat the process.

在这里,您将开始注意到此过程中的一些问题。 进行选择不是很直观,但是您最终应该能够找出正确的过程。 单击某个项目后,页面上所有其他类似的项目都会突出显示,您需要通过选择出现在旁边的一对小图标(对勾和叉号)之一来指出选择是否正确。选择。 如果需要在列表中添加更多项目,请单击顶部的“ +”图标,然后重复该过程。

Kimono gives you the ability to create collections, and group similar data items into one. Although it doesn’t make a difference from the point of view of the scraper, it helps in simplifying the data conceptually, which might help others understand what you did.

和服使您能够创建集合,并将相似的数据项分组为一个。 尽管从刮板的角度来看并没有什么区别,但它有助于从概念上简化数据,这可能有助于其他人了解您的工作。

分页 (Pagination)

For any web scraper, managing pagination is a very important issue. As a developer, you’ll either check the URL pattern of the pages (http://theblogbowl.in/feeds/?p=[page_no] in our case) and iterate through the pages, or you save the pagination links and open them one by one. Naturally, the former way is better. Kimono allows pagination and you need to click the icon on the top right to activate the feature.

对于任何刮板来说,管理分页都是非常重要的问题。 作为开发人员,您将检查页面的URL模式(在我们的情况下为http://theblogbowl.in/feeds/?p=[page_no] )并循环浏览页面,或者保存分页链接并打开他们一个接一个。 当然,前一种方法更好。 和服允许分页,您需要单击右上角的图标以激活该功能。

Click on the button or link that takes you to the next page. In this page, the “>” link does this work, so we select the item after activating the pagination feature.

单击按钮或链接,将您带到下一页。 在此页面中,“>”链接可完成此工作,因此我们在激活分页功能后选择该项目。

Click the tick icon as shown in the screenshot below once you are done selecting the next page link.

选择完下一页链接后,请单击下面的屏幕快照中所示的勾号图标。

Implementing Pagination Data

When you are all done, click the “Done” button to proceed.

完成后,单击“完成”按钮继续。

Although it looked like Kimono understood what to look for, I’ll explain a loophole in their pagination feature later in the post.

尽管看上去和服理解了要寻找的内容,但在后面的文章中,我将解释其分页功能中的漏洞。

运行刮板 (Running the scraper)

Once we save the scraper, we can either set it to run at regular intervals, or run it on demand. In our case, we chose the latter. Although there were 92 pages, I set the limit to 500 to see how it goes.

保存刮板后,我们可以将其设置为定期运行,也可以按需运行。 在我们的案例中,我们选择了后者。 尽管有92页,但我将限制设置为500,以查看其效果。

Running the Scraper

结果 (Results)

Once the scraping task is complete, let us look at the results.

抓取任务完成后,让我们看一下结果。

Scraping Results

Although I put a limit of 50 pages, I stopped it at approximately 18 pages to see the results. Here they are.

尽管我限制了50页,但我在大约18页的页面上停了下来以查看结果。 他们在这里 。

Output

We were successfully able to extract the data that we required. But is it a perfect service?

我们能够成功提取所需的数据。 但这是一项完美的服务吗?

什么时候会出错? (When can it go wrong?)

In our task, we conveniently selected the next button for the pagination. The official documentation mentions that we must feed the Next link to the API for Kimono to understand the pagination.

在我们的任务中,我们方便地选择了下一个用于分页的按钮。 官方文档中提到,我们必须将“ Next链接提供给和服API,以了解分页。

This means that Kimono’s pagination works only in presence of a “next” button. This means that websites which do not have a “next” button can’t be scraped by Kimono.

这意味着和服的分页仅在存在“下一个”按钮的情况下有效。 这意味着和服无法删除没有“下一步”按钮的网站。

For instance, this website contains a lot of information and the list of pages is displayed below. However, a “Next” or “Previous” button is absent.

例如, 此网站包含很多信息,页面列表显示在下面。 但是,没有“下一个”或“上一个”按钮。

和服的未来是什么? (What’s the future for Kimono?)

Kimono is great to build APIs for single page applications. If you require it to scrape multiple pages, with different structures, Kimono might not be able to accomplish it.

和服非常适合为单页应用程序构建API。 如果需要它刮擦具有不同结构的多页,和服可能无法完成。

If you need to scrape complex web pages with a lot of logic in between, Kimono is not advanced enough to accomplish your needs. Kimono is constantly evolving (after all, it’s backed by YC!) and there might be a day when “you don’t have to write a web scraper again.”

如果您需要在中间有很多逻辑的情况下抓取复杂的网页,和服的功能还不足以满足您的需求。 和服不断发展(毕竟,它得到了YC的支持!),可能有一天“您不必再次编写网络抓取工具”。

Until then, you just have to depend on your regex skills and an HTML parser if you want to undertake these complex tasks!

在此之前,如果您要执行这些复杂的任务,则只需要依靠正则表达式技能和HTML解析器即可!

Have you had any experience with Kimono? What do you think of the service?

您有和服方面的经验吗? 您如何看待这项服务?

翻译自: https://www.sitepoint.com/web-scrapers-guide-kimono/

web scraper

这篇关于web scraper_Web Scraper和服指南的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!


原文地址:https://blog.csdn.net/culi3182/article/details/108342530
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.chinasem.cn/article/647379

相关文章

SpringBoot结合Docker进行容器化处理指南

《SpringBoot结合Docker进行容器化处理指南》在当今快速发展的软件工程领域,SpringBoot和Docker已经成为现代Java开发者的必备工具,本文将深入讲解如何将一个SpringBo... 目录前言一、为什么选择 Spring Bootjavascript + docker1. 快速部署与

创建Java keystore文件的完整指南及详细步骤

《创建Javakeystore文件的完整指南及详细步骤》本文详解Java中keystore的创建与配置,涵盖私钥管理、自签名与CA证书生成、SSL/TLS应用,强调安全存储及验证机制,确保通信加密和... 目录1. 秘密键(私钥)的理解与管理私钥的定义与重要性私钥的管理策略私钥的生成与存储2. 证书的创建与

Python包管理工具pip的升级指南

《Python包管理工具pip的升级指南》本文全面探讨Python包管理工具pip的升级策略,从基础升级方法到高级技巧,涵盖不同操作系统环境下的最佳实践,我们将深入分析pip的工作原理,介绍多种升级方... 目录1. 背景介绍1.1 目的和范围1.2 预期读者1.3 文档结构概述1.4 术语表1.4.1 核

PowerShell中15个提升运维效率关键命令实战指南

《PowerShell中15个提升运维效率关键命令实战指南》作为网络安全专业人员的必备技能,PowerShell在系统管理、日志分析、威胁检测和自动化响应方面展现出强大能力,下面我们就来看看15个提升... 目录一、PowerShell在网络安全中的战略价值二、网络安全关键场景命令实战1. 系统安全基线核查

Java操作Word文档的全面指南

《Java操作Word文档的全面指南》在Java开发中,操作Word文档是常见的业务需求,广泛应用于合同生成、报表输出、通知发布、法律文书生成、病历模板填写等场景,本文将全面介绍Java操作Word文... 目录简介段落页头与页脚页码表格图片批注文本框目录图表简介Word编程最重要的类是org.apach

Python设置Cookie永不超时的详细指南

《Python设置Cookie永不超时的详细指南》Cookie是一种存储在用户浏览器中的小型数据片段,用于记录用户的登录状态、偏好设置等信息,下面小编就来和大家详细讲讲Python如何设置Cookie... 目录一、Cookie的作用与重要性二、Cookie过期的原因三、实现Cookie永不超时的方法(一)

Linux中压缩、网络传输与系统监控工具的使用完整指南

《Linux中压缩、网络传输与系统监控工具的使用完整指南》在Linux系统管理中,压缩与传输工具是数据备份和远程协作的桥梁,而系统监控工具则是保障服务器稳定运行的眼睛,下面小编就来和大家详细介绍一下它... 目录引言一、压缩与解压:数据存储与传输的优化核心1. zip/unzip:通用压缩格式的便捷操作2.

Linux中SSH服务配置的全面指南

《Linux中SSH服务配置的全面指南》作为网络安全工程师,SSH(SecureShell)服务的安全配置是我们日常工作中不可忽视的重要环节,本文将从基础配置到高级安全加固,全面解析SSH服务的各项参... 目录概述基础配置详解端口与监听设置主机密钥配置认证机制强化禁用密码认证禁止root直接登录实现双因素

深度解析Spring Boot拦截器Interceptor与过滤器Filter的区别与实战指南

《深度解析SpringBoot拦截器Interceptor与过滤器Filter的区别与实战指南》本文深度解析SpringBoot中拦截器与过滤器的区别,涵盖执行顺序、依赖关系、异常处理等核心差异,并... 目录Spring Boot拦截器(Interceptor)与过滤器(Filter)深度解析:区别、实现

MySQL追踪数据库表更新操作来源的全面指南

《MySQL追踪数据库表更新操作来源的全面指南》本文将以一个具体问题为例,如何监测哪个IP来源对数据库表statistics_test进行了UPDATE操作,文内探讨了多种方法,并提供了详细的代码... 目录引言1. 为什么需要监控数据库更新操作2. 方法1:启用数据库审计日志(1)mysql/mariad