golang解析网页的第三方包—

本文主要是介绍golang解析网页的第三方包——goquery(爬虫必备)，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

goquery是一个使用go语言写成的HTML解析库，可以让你像jQuery那样的方式来操作DOM文档，使用起来非常的简便。

一、官网下载地址

https://github.com/PuerkitoBio/goquery

二、goquery提供的主要结构体和方法

2.1. Document 代表一个HTML文档，

      type Document struct {
   *Selection
   Url      *url.URL
     rootNode *html.Node
}

1 Document 继承了Selection 类型，因此，Document 可以直接使用 Selection 类型的方法。

2 Document初始化的五种方式

1)根据根节点初始化

  func NewDocumentFromNode(root *html.Node) *Document {
     return newDocument(root, nil)
  }

2)根据url初始化，比较常用

  func NewDocument(url string) (*Document, error) {
   // Load the URL
   res, e := http.Get(url)   //根据url获取该网页的内容  res
   if e != nil {
      return nil, e
   }
   return NewDocumentFromResponse(res)
}

3）根据io的Reader初始化

   func NewDocumentFromReader(r io.Reader) (*Document, error) {
     root, e := html.Parse(r)
     if e != nil {
        return nil, e
     }
     return newDocument(root, nil), nil
  }

4) 根据http的Response初始化，也比较常用

func NewDocumentFromResponse(res *http.Response) (*Document, error) {
   if res == nil { //res为空，返回错误
      return nil, errors.New("Response is nil")
   }
   defer res.Body.Close()  //读取错误或者返回结果后关闭
   if res.Request == nil { //如果res中Request为空，返回
      return nil, errors.New("Response.Request is nil")
   }

   // Parse the HTML into nodes
   root, e := html.Parse(res.Body)   //将html结果解析并返回一个根节点rootNode
   if e != nil {
      return nil, e
   }

   // Create and fill the document
   return newDocument(root, res.Request.URL), nil
}

5) 复制一个文档对象

func CloneDocument(doc *Document) *Document {
return newDocument(cloneNode(doc.rootNode), doc.Url)
}

2.2.Selection Selection匹配一些条件后的节点集合(Nodes)

  type Selection struct {
     Nodes    []*html.Node
     document *Document
     prevSel  *Selection
  }

2.3 Selection类型提供的方法，这些方法是页面解析最重要，最核心的方法

1）类似函数的位置操作

- Eq(index int) *Selection //根据索引获取某个节点集

- First() *Selection //获取第一个子节点集

- Last() *Selection //获取最后一个子节点集

- Next() *Selection //获取下一个兄弟节点集

- NextAll() *Selection //获取后面所有兄弟节点集

- Prev() *Selection //前一个兄弟节点集

- Get(index int) *html.Node //根据索引获取一个节点

- Index() int //返回选择对象中第一个元素的位置

- Slice(start, end int) *Selection //根据起始位置获取子节点集

2）扩大 Selection 集合（增加选择的节点）

- Add(selector string) *Selection //将匹配到的节点添加当前节点集合中

- AndSelf() *Selection //将堆栈上的前一组元素添加到当前的

- Union() *Selection //which is an alias for AddSelection()

3）过滤方法，减少节点集合

- End() *Selection

- Filter…() //过滤

    - Has…()
- Intersection()   //which is an alias of FilterSelection()
     - Not…()

4）循环遍历选择的节点

- Each(f func(int, *Selection)) *Selection //遍历

- EachWithBreak(f func(int, *Selection) bool) *Selection //可中断遍历

- Map(f func(int, *Selection) string) (result []string) //返回字符串数组

5）修改文档

- After…()            //在匹配元素之后追加元素
- Append…()         //将选择器指定的元素添加到匹配元素集合的每个元素的末尾
- Before…()          //在匹配元素之前追加元素
- Clone()             //创建匹配节点的副本
- Empty()            //清空子节点
- Prepend…()
- Remove…()
- ReplaceWith…()
- Unwrap()
- Wrap…()
- WrapAll…()
- WrapInner…()

6）检测或获取节点属性值

- Attr(), RemoveAttr(), SetAttr()  //获取，移除，设置属性的值
- AddClass(), HasClass(), RemoveClass(), ToggleClass()
- Html()  //获取该节点的html
- Length() //返回该Selection的元素个数
- Size(), which is an alias for Length()
- Text()  //获取该节点的文本值

7）查询或显示一个节点的身份

- Contains() //包含
- Is…()

8）在文档树之间来回跳转（常用的查找节点方法）

- Children…()
- Contents()
- Find…()
- Next…()
- Parent[s]…()
- Prev…()
- Siblings…()

三、例子

func main() {
   doc, err := goquery.NewDocument(url)
   if err!=nil{
      log4go.Error(err)
   }else{

  doc.Find(selector string).Each(func(i int,s *goquery.Selection) { //获取节点集合并遍历
   text:=s.Find(selector string).Text() //获取匹配节点的文本值
         fmt.Println(text)
  })