Use lxml Before , We need to be able to use XPath. utilize XPath, It can be html Document as xml The document is processed and parsed .

One 、XPath Simple use :

XPath (XML Path Language) Is a door in XML The language in which information is found in a document , Can be used in XML Traversing elements and attributes in a document .

1. Installation of development tools

Chrome browser , Can install Xpath Helper plug-in unit . If you download the plug-in from the Internet , The documents obtained are in the form of .crx ending , Can't be added directly to browser extensions , We need to change this file to .zip ending , Then create a new folder , take .zip Unzip the file into a new folder . Through browser extensions - Load the unzipped extender - Select this folder to install the plug-in .

2. grammar

XPath Use path expressions to select XML A node or set of nodes in a document . Nodes are created by following paths (path) Or step (steps) To select . These path expressions are very similar to those we see in regular computer file systems .

  • XML The instance document

    <?xml version="1.0" encoding="ISO-8859-1"?>
    
    <bookstore>
    
    <book>
    <title lang="eng">Harry Potter</title>
    <price>29.99</price>
    </book> <book>
    <title lang="eng">Learning XML</title>
    <price>39.95</price>
    </book> </bookstore>

    This document is used for demonstration in the following examples .

  • Select node

XPath Using path expressions in XML Select node in document . The node is either by following the path or step To select .

Common path expressions :

expression describe
The node name Must be the root node , Select all children of this node .
/ Select from root node .
// Select the node in the document from the current node that matches the selection , Regardless of their location .
. Select the current node .
.. Select the parent of the current node .
@ Select Properties .

Example :

Path expression result
bookstore selection bookstore All children of the element .
/bookstore

Select the root element bookstore.

notes : If the path starts with a forward slash ( / ), This path always represents the absolute path to an element !

bookstore/book Choose to belong to bookstore All of the child elements of book Elements .
//book Select all book Subelement , Regardless of where they are in the document .
bookstore//book Choose to belong to bookstore All of the descendants of the element book Elements , No matter where they are bookstore What's down there .
//@lang Choose the name lang All attributes of .
  • Predicate

Predicate is used to find a specific node or a node containing a specified value , Embedded in square brackets .

Example :

Path expression result
/bookstore/book[1] Choose to belong to bookstore The first of the child elements book Elements .
/bookstore/book[last()] Choose to belong to bookstore The last of the child elements book Elements .
/bookstore/book[last()-1] Choose to belong to bookstore The penultimate of a child element book Elements .
/bookstore/book[position()<3] Select the first two of bookstore Of a child element book Elements .
//title[@lang] Select all owners named lang Property of title Elements .
//title[@lang='eng'] Select all title Elements , And these elements have a value of eng Of lang attribute .
/bookstore/book[price>35.00] selection bookstore All of the elements book Elements , And one of them price The value of the element must be greater than 35.00.
/bookstore/book[price>35.00]/title selection bookstore In the element book All of the elements title Elements , And one of them price The value of the element must be greater than 35.00.
  • Select unknown nodes and attributes

XPath Wildcards can be used to select unknown XML Elements and attributes .

wildcard :

wildcard describe
* Match any node .
@* Match any property

Example :

Path expression result
/bookstore/* selection bookstore All child elements of the element .
//* Select all elements in the document .
//title[@*] Select all of the title Elements .
  • Select several paths

By using in a path expression “|” Operator , You can choose several paths

Example :

Path expression result
//book/title | //book/price selection book All of the elements title and price Elements .
//title | //price Select all... In the document title and price Elements .
/bookstore/book/title | //price Choose to belong to bookstore Elemental book All of the elements title Elements , And all the price Elements .

3. Operator

The following is a list of the available XPath Operators in expressions :

Two 、lxml library

lxml yes One HTML/XML The parser , The main function is how to parse and extract HTML/XML data .

lxml Just like regular , Also use C Realized , It's a high-performance Python HTML/XML Parser , We can use XPath grammar , To quickly locate specific elements and node information .

1. install

  • Need to install C Language library , You can use pip install

    sudo pip3 install lxml

2. Easy to use ( Just list some common operations )

  • etree

    • analysis html data , It mainly uses lxml In the library etree
  • etree.HTML()

    • Parameter is a string , Read string , return html Elements , And it will automatically correct html Code , For example, lack of html Labels and body label , It will automatically add
  • etree.parse()

    • The parameter is the file name , Read from file , return _ElementTree
  • etree.tostring()

    • The parameters are elements or element trees , Serialization into byte type
  • Element.xpath() perhaps _ElementTree.xpath()

    • Parameter is xpath Expression string , Back to the list . If the expression selects an element , Then the list consists of elements , If the expression selects an attribute , Then the list consists of the values of the property
  • Element.tag

    • Elements tag attribute , Returns the element tag name
  • Element.text

    • Elements text attribute , Returns the element content
  • Example :

    In [1]: from lxml import etree # Import etree
    
    In [2]: text = '''
    ...: <div>
    ...: <ul>
    ...: <li class="item-0"><a href="link1.html">first item</a></li>
    ...: <li class="item-1"><a href="link2.html">second item</a></li>
    ...: <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
    ...: <li class="item-1"><a href="link4.html">fourth item</a></li>
    ...: <li class="item-0"><a href="link5.html">fifth item</a></li>
    ...: </ul>
    ...: </div>
    ...: ''' In [3]: html = etree.HTML(text) # Read string In [4]: html # return html Elements
    Out[4]: <Element html at 0x7f3ad0bb8340> In [5]: etree.tostring(html)# Serialization into byte type , And automatically added html Labels and body label
    Out[5]: b'<html><body><div>\n <ul>\n <li class="item-0"><a href="link1.html">first item</a></li>\n <li class="item-1"><a href="link2.html">second item</a></li>\n <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>\n <li class="item-1"><a href="link4.html">fourth item</a></li>\n <li class="item-0"><a href="link5.html">fifth item</a></li>\n </ul>\n</div>\n</body></html>' In [6]: html2 = etree.parse('./test.html')# Read from file In [7]: html2 # Return to element tree
    Out[7]: <lxml.etree._ElementTree at 0x7fc54d818d00> In [8]: etree.tostring(html2)
    Out[8]: b'<body>\n <div>\n <ul>\n <li class="item-0"><a href="link1.html">first item</a></li>\n <li class="item-1"><a href="link2.html">second item</a></li>\n <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>\n <li class="item-1"><a href="link4.html">fourth item</a></li>\n <li class="item-0"><a href="link5.html">fifth item</a></li>\n </ul>\n </div>\n</body>' In [9]: element_list = html.xpath('//a')# Calling element's xpath Method , Select all... In the document a Elements In [10]: element_list # Back to all a A list of elements
    Out[10]:
    [<Element a at 0x7fc54d849ec0>,
    <Element a at 0x7fc54d91b080>,
    <Element a at 0x7fc54d86fc80>,
    <Element a at 0x7fc54d878e40>,
    <Element a at 0x7fc54d878040>] In [11]: element_list[0].tag # Elements tag attribute , Return tag name
    Out[11]: 'a' In [12]: element_list[0].text # Elements text attribute , Returns the element content
    Out[12]: 'first item' In [13]: attr_value_list = html.xpath('//a/@href') # Calling element's xpath Method , Select all... In the document a Elemental href attribute In [14]: attr_value_list # return href A list of attribute values
    Out[14]: ['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

Reptiles - Use lxml analysis html More articles on data

  1. Python Reptiles —— Use lxml The parser crawls the used car information of car home

    The target of this crawler is used car sales information of auto home , The scope is the whole country , But it's a pity , Car home only shows 100 Page information , each page 48 strip , That is to say, it can only crawl at most 4800 Bar information . Because the main purpose of this crawler is to use lxml Parser , So in information ...

  2. python Simple reptile use lxml Parse the table in the page

    The goal is : Take Hunan University 2018 The admission marks in each province in , Stored in txt In file Some of the tables are shown in the figure : part html Code : <table cellspacing="0" cellpadding= ...

  3. In the second quarter :web Reptiles lxml Parsing library

    lxml yes python A parsing library of , Support HTML and XML Parsing , Support XPath Analytical way , And the parsing efficiency is very high .

  4. Web crawler Selenium Module and Xpath expression +Lxml Use of parsing library

    In the actual production environment , We usually use lxml Of xpath To parse the data we want , This blog will focus on Selenium and Xpath expression , About CSS Selectors , I'll sort out another one ! One . Introduce : selenium At first it was a day ...

  5. Python Reptile tutorial -18- Page parsing and data extraction

    This article aims at the data that already exists on the page , Not including dynamically generated data , Today is right HTML To extract data that is useful to us , Remove useless data Python Reptile tutorial -18- Page parsing and data extraction Structured data : The first structure , Let's talk about data ...

  6. Python Reptiles 10- Page parsing data extraction ideas and methods and simple regular application

    GitHub Code practice address : Regular 1:https://github.com/Neo-ML/PythonPractice/blob/master/SpiderPrac15_RE1.py Regular 2:match. ...

  7. pyspider Sample code : analysis JSON data

    pyspider The official website of the sample code is http://demo.pyspider.org/. Too much sample code above , Do not know how to start . So I find out a classic example for a simple explanation , Hope to be helpful for beginners . example : py ...

  8. Python3 Write a web crawler 06- Basic parsing library Beautiful Soup Use

    Two .Beautiful Soup brief introduction Namely python One of the HTML or XML The parsing library of You can use it to easily extract data from web pages 0.1 Provide some simple python Function to handle navigation , Search for , Modify the analysis tree ...

  9. pyspider Example code 3 : use PyQuery Parsing page data

    This series of articles mainly record and explain pyspider Example code for , I hope that we can learn from each other .pyspider The official website of the sample code is http://demo.pyspider.org/. Too much sample code above , Do not know how to start . So I find some ...

  10. pyspider Example code 2 : analysis JSON data

    This series of articles mainly record and explain pyspider Example code for , I hope that we can learn from each other .pyspider The official website of the sample code is http://demo.pyspider.org/. Too much sample code above , Do not know how to start . So I'll find out ...

Random recommendation

  1. Spring Integrate Ehcache Manage cache

    Preface Ehcache It's a mature caching framework , You can use it directly to manage your cache . Spring Provides an abstraction of caching capabilities : That is to allow different caching solutions to be bound ( Such as Ehcache), But it does not directly provide the implementation of cache function . it ...

  2. About SQLServer2008 How to import data SQL2005 Solutions for , High version data is imported into low version .

    Recently, we need to put SqlServer2008 Database import of sqlserver2005 in . Direct backup restore is definitely not going to work . Later I thought that I could generate scripts to execute sql sentence , And select data that can be executed together . Click on the right ---> Mission -- ...

  3. c# Class library Session

    Website development , In order to save the user's information , Sometimes you need to use session. If we were aspx Page usage Session, It only needs Session["key"]=value Can , Get with int u ...

  4. umount.nfs device busy day virsh extend diskSpace, attachDisk

    KVM in linux Virtual machine hard disk add method There are many things running in virtual machines recently , quite a lot . When the virtual machine starts up, it allocates fewer disks , With the accumulation of logs and normal uploaded files , Disk space alert . I checked the information on the Internet , I did some experiments myself . total ...

  5. tomcat Administrators manager app Can't get into the solution

    Browser input http://localhost:8080/ Get into tomcat After the page , Click on manager app enter one user name (admin) password (admin) After the page Jump, the following error appears : remarks :tomcat7.0.39 ...

  6. Java Learning notes 6( Loop and array exercises )

    1. Output 100 To 1000 The number of daffodils : public class LoopTest{ public static void main(String[] args){ int bai = 0; int s ...

  7. Vue in $ref Usage of

    explain :vm.$refs An object , Hold has been registered ref All subcomponents of ( or HTML Elements ) Use : stay HTML Elements in , add to ref attribute , And then in JS Pass through vm.$refs. Attribute to get attention : If you get a subgroup ...

  8. Java--druidAPI Inquire about

    maven rely on <dependency> <groupId>in.zapr.druid</groupId> <artifactId>druidry< ...

  9. Delete node_modules file

    Delete node_modules Folder error : The path is too long , Cannot delete . npm install rimraf -g rimraf node_modules

  10. [Redis_1] Redis Introduce &amp;&amp; install

    0. explain Redis Introduce && install 1. Redis Introduce 2. Redis install (Windows 10) [2.1 decompression redis-2.2.2-win32-win64.rar ...