Python模块lxml使用

xpath解析html文档

从字符串解析HTML文档,返回根节点

lxml.etree.HTML(text, parser=None, base_url=None)

Parses an HTML document from a string constant. Returns the root node (or the result returned by a parser target). This function can be used to embed “HTML literals” in Python code.

To override the parser with a different HTMLParser you can pass it to the parser keyword argument.

The base_url keyword argument allows to set the original base URL of the document to support relative Paths when looking up external entities (DTD, XInclude, …).

文档  https://lxml.de/apidoc/lxml.etree.html#lxml.etree.HTML

# 典型使用场景:从requests返回的html字符串 resp.text
html='<html><head><title>a-test-page</title><body><li>line1</li><li>line22</li></body></html>'
tree=etree.HTML(html)
tree.xpath('//title/text()')   #
tree.xpath('//li')  #

xpath解析xml文档

 

 

发表评论

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据