编程知识 cdmana.com

Front end crawler framework - Introduction to puppeter (1)

Puppeteer

Preface

The reason why I started to learn this technology was that I wanted to do a movie resource website similar to Renren film and television before , Therefore, I want to get the relevant movie resources through learning crawler to download them .

Most of the previous understanding of reptiles was heard to use python To achieve , And because I'm busy at work , I don't have much time to learn a new language , So I went online search to see if there was a front-end crawler framework .

Most of the online recommendations are node library --puppeteer

What is? Puppeteer?

Puppeteer It's a node library , He provided a set of controls Chrome Of API, Generally speaking, it's a headless chrome browser ( Of course, you can also configure it to have UI Of , There is no default ). Since it's a browser , So what we can do on the browser by hand Puppeteer They are all competent for , Like the user's mouse , Keyboard operation, etc .

Puppeteer What can be done ?

1.  Can generate web screenshots and generate PDF
2.  Reptiles ( frequently-used ) You can crawl to pages that load content asynchronously ( Basically, you can climb up to )
3.  Simulate user operation ( Such as : Mouse button operation , Submit Form , open / close / Log on to the web )
4.  Realization UI automated testing , To help analyze the performance of the site 

Operating environment and installation

Because in puppeteer, Most of them are asynchronous operations , So when you look at all kinds of documents, you can see async and await In this way ES7 The grammar of .
The official requirement at the moment is :

 stay puppeteerv1.18.1 Previous needs NODE The version is at least v6.4.0.
 from v1.18.1 To v2.1.0 Of NODE Version at least no less than v8.9.0.
 from v3.0.0 Start ,NODE At least not less than v10.18.1
 And if you want to use async/await,NODE Version at least no less than v7.6.0

You use the latest chrome driver, This is when you go through npm install puppeteer It will automatically check your local driver edition , And then automatically download the latest chrome driver

adopt npm/cnpm/yarn install puppeteer
npm install puppeteer --save
cnpm install puppeteer --save
yarn add puppeteer ( Use yarn Installation may not be able to install the problem )

Easy to use ( Screenshot operation )

When you're done puppeteer After installation , We can write a simple example . Open our way to learning

demo1
// 1.  First introduce puppeteer
const puppeteer = require("puppeteer");

// 2.  start-up puppeteer, Start the browser engine 
puppeteer
  .launch({
    ignoreHTTPSErrors: true,
    headless: false,
    slowMo: 250,
    defaultViewport: {
      width: 1920,
      height: 1080,
    },
    timeout: 0,
  })
  .then(async (browser) => {
    // 3.  Create a new browser page 
    let newPage = await browser.newPage();
    // 4.  Set the jump for this page URL
    await newPage.goto("https://www.chapaofan.com/");
    // 5.  Take a screenshot of this page 
    await newPage.screenshot({
      type: "jpeg",
      path: "../index.jpg",
      fullPage: true,
    });
    // 6.  Close the browser 
    await browser.close();
  });
demo result

image.png
At the top level of the project, we have cut the pictures we need
image.png

Code parsing ( According to the source code above )

1. puppeteer.launch(options)

 This method is used to start chrome browser , It returns a Promise, Use then Method to get browser example , You can operate the browser 

 Parameters options( Here are some common parameters ):
(1) ignoreHTTPSErrors <Boolean>: Whether to ignore during navigation HTTPS error , The default is false;
(2) headless <Boolean>: Whether to run the browser in headless mode , The default is true. The headless mode here is generally speaking whether there is a browser interface ( With UI Form display operation )
(3) slowMo <Number>: take puppeteer Operation to reduce the specified number of milliseconds , So you can see what each operation does , This is very useful 
(4) defaultViewport <Object>:
        width: The width of the page display , The default is 800
        height: The height of the page display , The default is 600
(5) timeout: wait for  Chrome  Maximum time for instance to start . The default is 30000(30 second ). If you pass in  0  No time limit 

2. browser.newPage()

 This method returns a promise, To return a new Page Object to create a new page in the browser 

3. newPage.goto(url,options)

 This method sets the new page in the address bar URL value , And jump to the corresponding address .

 Parameters options:
(1) url <String>: Navigate to the appropriate address , The address should have http Or is it https The agreement , for example :https://
(2) options:
        timeout <Number>: Waiting time for jump , In milliseconds , The default is 30 second , Set up 0 To wait indefinitely until passed 
        {...restOPtions}

4. newPage.screenshot(options)

 This method returns Promise,resolve And then there's a screenshot buffer, It is used to take a screenshot operation on the open page 

 Parameters options:
    (1) path <String>: The path to save the screenshot , The type of screenshot image will be inferred from the file extension name . If it's a relative path , Then we analyze it from the relative path ( The relative path is recommended here ). If no path is specified , Pictures will not be saved to the hard disk 
    (2) type <String>: The type of screenshot specified ,jpeg | png, The default is png
    (3) quality <Number>: Picture quality , Optional 0-100,png Not available in format 
    (4) fullPage <Boolean>: If set to true, Then intercept the complete page ( Including the parts that need to be scrolled ), The default is false
    (5) clip <Object>:
            x <Number>: The crop region is relative to the upper left corner (0, 0) Of x coordinate 
            y <Number>: The crop region is relative to the upper left corner (0, 0) Of y coordinate 
            width <Number>: Cut width 
            height <Number>: Cutting height 
    (6) omitBackground <Boolean>: White background is hidden by default , The background is transparent ( Yes png The format is very useful )
    (7) encoding: Image coding can make base64 or binary, The default is “ Binary system ”, The conversion of image encoding format plays a great role in uploading and downloading pictures 

5. borwser.close()

 close  Chromium  And all of its pages ( If the page is opened ).Browser  The object itself is considered processed and cannot be used again . Unless you build a new one yourself browser

版权声明
本文为[Treat you as before]所创,转载请带上原文链接,感谢

Scroll to Top