Scrapy is a fast high-level web crawling and web scraping framework.
ref:
https://doc.scrapy.org/en/latest/
Install
Usage
Spider
去爬資料的程式,用 parse() 定義你要 parse 哪些資料
Item
定義抓回來的資料欄位,可以想成是 django 的 model
Pipeline
對抓回來的資料進行加工,可能是清除 html 或是檢查重複之類的
scrapy 底層是用 lxml 和 Twisted
ref:
https://github.com/vinta/BlackWidow
Tips
Debugging
These 2 lines will invoke the interative shell.
相對路徑 XPath
Access Django Model in Scrapy
then you can import Django's modules in scrapy, like this:
State
http://doc.scrapy.org/en/latest/topics/jobs.html#keeping-persistent-state-between-batches
Close Spider
Login in Spider
Others
XPath 的選擇節點語法
http://mi.hosp.ncku.edu.tw/km/index.php/dotnet/48-netdisk/57-xml-xpath
Avoiding getting banned
http://doc.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned
Python 抓取框架:Scrapy 的架构
http://biaodianfu.com/scrapy-architecture.html
Download images
https://scrapy.readthedocs.org/en/latest/topics/images.html