使用 XPATH 和 HTML Cleaner 解析 HTML/XML

2024-01-08 20:48

本文主要是介绍使用 XPATH 和 HTML Cleaner 解析 HTML/XML,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

使用 XPATH 和 HTML Cleaner 解析 HTML/XML
(Using XPATH and HTML Cleaner to parse HTML / XML)

太阳火神的美丽人生 (http://blog.csdn.net/opengl_es)

本文遵循“署名-非商业用途-保持一致”创作公用协议

转载请保留此句:太阳火神的美丽人生 -  本博客专注于 敏捷开发及移动和物联设备研究:iOS、Android、Html5、Arduino、pcDuino否则,出自本博客的文章拒绝转载或再转载,谢谢合作。



使用 XPATH 和 HTML Cleaner 解析 HTML/XML
(Using XPATH and HTML Cleaner to parse HTML / XML)

JANUARY 5, 2010
tags: android, examples, HTML, parse, scraping, XML, XPATH

大家好
Hey everyone,

有时我发现有一种能力十分有用,尤其在 Web 相关的应用中,那就是从 web 站点获取 HTML 并且从 HTML 解析数据,或是任何你要想得到的内容(对于我的情况大多总是数据)。
So something that I’ve found to be extremely useful (especially in web related applications) is the ability to retrieve HTML from websites and parse their HTML for data or whatever you may be looking for (in my case it is almost always data).


I actually use this technique to do the real time stock/option imports for my Black-Scholes/Implied Volatility applications, so if you’re looking for an example on how to retrieve and parse HTML and run “queries” over it using, say, XPATH, then this post is for you.

Now, before we begin, in order to do this you will have to reference an external JAR in your project’s build path. The JAR that I use comes from HtmlCleaner which even gives you an example of how they use it here HtmlCleaner Example, but in addition to that I’ll show you an example of how I use it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
public class OptionScraper {
     // EXAMPLE XPATH QUERIES IN THE FORM OF STRINGS - WILL BE USED LATER
     private static final String NAME_XPATH = "//div[@class='yfi_quote']/div[@class='hd']/h2" ;
     private static final String TIME_XPATH = "//table[@id='time_table']/tbody/tr/td[@class='yfnc_tabledata1']" ;
     private static final String PRICE_XPATH = "//table[@id='price_table']//tr//span" ;
     // TAGNODE OBJECT, ITS USE WILL COME IN LATER
     private static TagNode node;
     // A METHOD THAT HELPS ME RETRIEVE THE STOCK OPTION'S DATA BASED OFF THE NAME (I.E. GOUAA IS ONE OF GOOGLE'S STOCK OPTIONS)
     public static Option getOptionFromName(String name) throws XPatherException, ParserConfigurationException,SAXException, IOException, XPatherException {
         // THE URL WHOSE HTML I WANT TO RETRIEVE AND PARSE
         String option_url = "http://finance.yahoo.com/q?s=" + name.toUpperCase();
         // THIS IS WHERE THE HTMLCLEANER COMES IN, I INITIALIZE IT HERE
         HtmlCleaner cleaner = new HtmlCleaner();
         CleanerProperties props = cleaner.getProperties();
         props.setAllowHtmlInsideAttributes( true );
         props.setAllowMultiWordAttributes( true );
         props.setRecognizeUnicodeChars( true );
         props.setOmitComments( true );
         // OPEN A CONNECTION TO THE DESIRED URL
         URL url = new URL(option_url);
         URLConnection conn = url.openConnection();
         //USE THE CLEANER TO "CLEAN" THE HTML AND RETURN IT AS A TAGNODE OBJECT
         node = cleaner.clean( new InputStreamReader(conn.getInputStream()));
         // ONCE THE HTML IS CLEANED, THEN YOU CAN RUN YOUR XPATH EXPRESSIONS ON THE NODE, WHICH WILL THEN RETURN AN ARRAY OF TAGNODE OBJECTS (THESE ARE RETURNED AS OBJECTS BUT GET CASTED BELOW)
         Object[] info_nodes = node.evaluateXPath(NAME_XPATH);
         Object[] time_nodes = node.evaluateXPath(TIME_XPATH);
         Object[] price_nodes = node.evaluateXPath(PRICE_XPATH);
         // HERE I JUST DO A SIMPLE CHECK TO MAKE SURE THAT MY XPATH WAS CORRECT AND THAT AN ACTUAL NODE(S) WAS RETURNED
         if (info_nodes.length > 0 ) {
             // CASTED TO A TAGNODE
             TagNode info_node = (TagNode) info_nodes[ 0 ];
             // HOW TO RETRIEVE THE CONTENTS AS A STRING
             String info = info_node.getChildren().iterator().next().toString().trim();
             // SOME METHOD THAT PROCESSES THE STRING OF INFORMATION (IN MY CASE, THIS WAS THE STOCK QUOTE, ETC)
             processInfoNode(o, info);
         }
         if (time_nodes.length > 0 ) {
             TagNode time_node = (TagNode) time_nodes[ 0 ];
             String date = time_node.getChildren().iterator().next().toString().trim();
             // DATE RETURNED IN 15-JAN-10 FORMAT, SO THIS IS SOME METHOD I WROTE TO JUST PARSE THAT STRING INTO THE FORMAT THAT I USE
             processDateNode(o, date);
         }
         if (price_nodes.length > 0 ) {
             TagNode price_node = (TagNode) price_nodes[ 0 ];
             double price = Double.parseDouble(price_node.getChildren().iterator().next().toString().trim());
             o.setPremium(price);
         }
         return o;
     }
}

So that’s it! Once you include the JAR in your build path, everything else is pretty easy! It’s a great tool to use. However, it does require knowledge of XPATH but XPATH isn’t too hard to pick up and is useful to know so if you don’t know it then take a look at the link.

Now, a warning to everyone. It’s documented that the XPATH expressions recognized by HtmlCleaner is not complete in the sense that only “basic” XPATH is recognized. What’s excluded? For instance, you can’t use any of the “axes” operators (i.e. parent, ancestor, following, following-sibling, etc), but in my experience everything else is fair game. Yes, it sucks, and many times it can make your life a little bit harder, but usually it just requires you to be a tad more clever with your XPATH expressions before you can pull the desired information.

And of course, this technique works for XML documents as well!

Hope this was helpful to everyone. Let me know if you’re confused anywhere.

- jwei



这篇关于使用 XPATH 和 HTML Cleaner 解析 HTML/XML的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!


原文地址:
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.chinasem.cn/article/584772

相关文章

Python多进程、多线程、协程典型示例解析(最新推荐)

《Python多进程、多线程、协程典型示例解析(最新推荐)》:本文主要介绍Python多进程、多线程、协程典型示例解析(最新推荐),本文通过实例代码给大家介绍的非常详细,对大家的学习或工作具有一定... 目录一、multiprocessing(多进程)1. 模块简介2. 案例详解:并行计算平方和3. 实现逻

Spring Boot拦截器Interceptor与过滤器Filter深度解析(区别、实现与实战指南)

《SpringBoot拦截器Interceptor与过滤器Filter深度解析(区别、实现与实战指南)》:本文主要介绍SpringBoot拦截器Interceptor与过滤器Filter深度解析... 目录Spring Boot拦截器(Interceptor)与过滤器(Filter)深度解析:区别、实现与实

使用Vue-ECharts实现数据可视化图表功能

《使用Vue-ECharts实现数据可视化图表功能》在前端开发中,经常会遇到需要展示数据可视化的需求,比如柱状图、折线图、饼图等,这类需求不仅要求我们准确地将数据呈现出来,还需要兼顾美观与交互体验,所... 目录前言为什么选择 vue-ECharts?1. 基于 ECharts,功能强大2. 更符合 Vue

如何合理使用Spring的事务方式

《如何合理使用Spring的事务方式》:本文主要介绍如何合理使用Spring的事务方式,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录1、介绍1.1、底层构造1.1.事务管理器1.2.事务定义信息1.3.事务状态1.4.联系1.2、特点1.3、原理2. Sprin

Vue中插槽slot的使用示例详解

《Vue中插槽slot的使用示例详解》:本文主要介绍Vue中插槽slot的使用示例详解,本文通过实例代码给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价值,需要的朋友参考下吧... 目录一、插槽是什么二、插槽分类2.1 匿名插槽2.2 具名插槽2.3 作用域插槽三、插槽的基本使用3.1 匿名插槽

springboot+vue项目怎么解决跨域问题详解

《springboot+vue项目怎么解决跨域问题详解》:本文主要介绍springboot+vue项目怎么解决跨域问题的相关资料,包括前端代理、后端全局配置CORS、注解配置和Nginx反向代理,... 目录1. 前端代理(开发环境推荐)2. 后端全局配置 CORS(生产环境推荐)3. 后端注解配置(按接口

使用WPF实现窗口抖动动画效果

《使用WPF实现窗口抖动动画效果》在用户界面设计中,适当的动画反馈可以提升用户体验,尤其是在错误提示、操作失败等场景下,窗口抖动作为一种常见且直观的视觉反馈方式,常用于提醒用户注意当前状态,本文将详细... 目录前言实现思路概述核心代码实现1、 获取目标窗口2、初始化基础位置值3、创建抖动动画4、动画完成后

Vue 2 项目中配置 Tailwind CSS 和 Font Awesome 的最佳实践举例

《Vue2项目中配置TailwindCSS和FontAwesome的最佳实践举例》:本文主要介绍Vue2项目中配置TailwindCSS和FontAwesome的最... 目录vue 2 项目中配置 Tailwind css 和 Font Awesome 的最佳实践一、Tailwind CSS 配置1. 安

PyQt5 QDate类的具体使用

《PyQt5QDate类的具体使用》QDate是PyQt5中处理日期的核心类,本文主要介绍了PyQt5QDate类的具体使用,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价... 目录核心功能常用方法及代码示例​1. 创建日期对象​2. 获取日期信息​3. 日期计算与比较​4. 日

Go语言使用slices包轻松实现排序功能

《Go语言使用slices包轻松实现排序功能》在Go语言开发中,对数据进行排序是常见的需求,Go1.18版本引入的slices包提供了简洁高效的排序解决方案,支持内置类型和用户自定义类型的排序操作,本... 目录一、内置类型排序:字符串与整数的应用1. 字符串切片排序2. 整数切片排序二、检查切片排序状态: