编程知识 cdmana.com

Crawl to vipshop moon cake details

I'm participating in the Mid Autumn Festival Creative submission competition , Details please see : Mid Autumn Festival Creative submission contest . This is my first crawler , Welcome to the discussion .

Preface

The price and detailed data of vipshop mooncakes are obtained by python Language , If you haven't used it python, Please install first. python. When you first learn reptiles, you use requests Third party Library , However, all major e-commerce platforms have anti climbing mechanisms , use requests Unable to get page data . Therefore, this paper uses selenium.

Third party Library

Introduce the installation of the third-party library and its use in the program

selenium

Selenium Test runs directly in browser , It's like a real user operating , Support most mainstream browsers , Include IE(7,8,9,10,11),Firefox,Safari,Chrome,Opera etc. . We can use it to simulate users to click to visit the website , Bypass some complex authentication scenarios adopt selenium+ The combination of driving browser can render and parse directly js, Bypass most of the parameter construction and anti climbing .

1、 install : Install well python after pip You can use . Use pip install , Carry out orders :pip install selenium

2、 download chromedriver drive : download chromedriver.exe, Download address Click here Be careful : The version needs to correspond to your browser version , After downloading, you need to put it in python Under the table of contents .

BeautifulSoup

BeautifulSoup It is mainly used to help us structure web page data , It is convenient for us to retrieve the data we need, such as product name 、 Price 、 Article No., etc .

Use pip install , Carry out orders :pip install beautifulsoup4

Actual combat climbing moon cake related data

Access to the page

Open vipshop's official website and enter moon cakes , Copy url as follows :category.vip.com/suggest.php…

Install well python Open after cmd Input python, Then enter the following code , Whether the browser is executed and opened correctly

from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://category.vip.com/suggest.php?keyword=%E6%9C%88%E9%A5%BC&ff=235|12|1|1" #vip Address 
driver = webdriver.Chrome()
r = driver.get(url)
 Copy code 

Parsing the page , Get product data

keyboard f12 Open the debugging tool , Move the mouse into the complete commodity block to observe the web page data , Use BeautifulSoup Medium select Method to obtain all moon cake commodity data of the current page , And print and view the results . image.png

html = driver.page_source
bs = BeautifulSoup(html, "lxml")
course_data = bs.select('div[data-product-id]')
print(course_data)
 Copy code 

Analyze product details

In order to obtain commodity code data , You need to click to open the details page , therefore for Cycle through each item and use BeautifulSoup get data , Then build the object and stuff the useful information into .

The logic of judging the size can be ignored , Write it down so that you can read it later python How to write it . The author is a front-end developer , Write python Fewer opportunities .

for each_item in course_data:
 detailUrl = each_item.find("a")
 id = each_item.attrs['data-product-id']
 goodsName = each_item.find("div", class_="c-goods-item__name") #  Locate moon cake name 
 r1 = driver.get('https:'+detailUrl.attrs["href"]) # Click on the product details page 
 html1 = driver.page_source
 bs1 = BeautifulSoup(html1, "lxml")
 size_arr = [] # Place mooncake specification data 
 sizes = bs1.find_all("li", class_="size-list-item J-sizeID")
 if sizes: 
  for each_item in sizes:
   size = each_item.find('span', class_="size-list-item-name")
   size_arr.append(size.getText())
 elif bs1.find_all("li", class_="selector_opt"):
  sizes = bs1.find_all("li", class_="selector_opt")
  for each_item in sizes:
   size = each_item.find('a').getText()
   size_arr.append(size)
 elif bs1.find_all("li", class_="size-list-item J-sizeID sli-selected size-list-item-small"):
  sizes = bs1.find_all("li", class_="size-list-item J-sizeID sli-selected size-list-item-small")
  for each_item in sizes:
   size = each_item.find('span', class_="size-list-item-name")
   size_arr.append(size.getText())
 elif bs1.find_all("li", class_="size-list-item J-sizeID sli-selected"):
  sizes = bs1.find_all("li", class_="size-list-item J-sizeID sli-selected")
  for each_item in sizes:
   size = each_item.find('span', class_="size-list-item-name")
   size_arr.append(size.getText())
 else:
  size_arr = [' No goods ']
 infoCode = bs1.find("p", class_="other-infoCoding").getText()
 str=','.join(size_arr) # Array to string 
 goods_dict = {"goodsName": goodsName.getText(), "detailUrl": detailUrl.attrs["href"], "id":  id, "infoCode": infoCode, "size": str}
 list_data.append(goods_dict)
 Copy code 

Output data

Usually we will store the crawled data in the database , Because this is just a demo, For the time being, write the moon cake data into the text file .


print(list_data)
with open('mooncake_classes.txt', "a+") as f:  #  Write course information to a text file 
 for text in list_data:
  print(text)
  f.write(' Name of commodity :'+text['goodsName']+'  goods id:'+text['id']+'  Article No :'+text['infoCode']+'  specifications :'+text['size']+'\n')

 Copy code 

image.png

summary

If you have followed my article once , You can try to get price data , Same as getting name data . at present demo Just the data on the first page , You can click the button at the bottom to get all page data , Same as clicking details . Welcome to learn and communicate together .

版权声明
本文为[Unsweetened cocoa]所创,转载请带上原文链接,感谢
https://cdmana.com/2021/09/20210909124112714r.html

Scroll to Top