{"id":115,"date":"2015-01-11T06:42:16","date_gmt":"2015-01-10T22:42:16","guid":{"rendered":"http:\/\/vinta.ws\/code\/?p=115"},"modified":"2026-03-17T01:17:44","modified_gmt":"2026-03-16T17:17:44","slug":"scrapy-web-scraping-framework-for-python","status":"publish","type":"post","link":"https:\/\/vinta.ws\/code\/scrapy-web-scraping-framework-for-python.html","title":{"rendered":"Scrapy: The Web Scraping Framework for Python"},"content":{"rendered":"<p>Scrapy is a fast high-level web crawling and web scraping framework.<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/doc.scrapy.org\/en\/latest\/\">https:\/\/doc.scrapy.org\/en\/latest\/<\/a><\/p>\n<h2>Install<\/h2>\n<pre class=\"line-numbers\"><code class=\"language-bash\"># on Ubuntu\n$ sudo apt-get install libxml2-dev libxslt1-dev libffi-dev\n\n# on Mac\n$ brew install libffi\n\n$ pip install scrapy service_identity<\/code><\/pre>\n<h2>Usage<\/h2>\n<pre class=\"line-numbers\"><code class=\"language-bash\"># interative shell\n# http:\/\/doc.scrapy.org\/en\/latest\/intro\/tutorial.html#trying-selectors-in-the-shell\n$ scrapy shell \"http:\/\/www.wendyslookbook.com\/2013\/09\/the-frame-a-digital-glossy\/\"\n# or\n$ scrapy shell --spider=seemodel\n&gt;&gt;&gt; view(response)\n&gt;&gt;&gt; fetch(req_or_url)\n\n# create a project\n$ scrapy startproject blackwindow\n\n# create a spider\n$ scrapy genspider fancy www.fancy.com\n\n# run spider\n$ scrapy crawl fancy\n$ scrapy crawl pinterest -L ERROR<\/code><\/pre>\n<p>Spider<br \/>\n\u53bb\u722c\u8cc7\u6599\u7684\u7a0b\u5f0f\uff0c\u7528 parse() \u5b9a\u7fa9\u4f60\u8981 parse \u54ea\u4e9b\u8cc7\u6599<\/p>\n<p>Item<br \/>\n\u5b9a\u7fa9\u6293\u56de\u4f86\u7684\u8cc7\u6599\u6b04\u4f4d\uff0c\u53ef\u4ee5\u60f3\u6210\u662f django \u7684 model<\/p>\n<p>Pipeline<br \/>\n\u5c0d\u6293\u56de\u4f86\u7684\u8cc7\u6599\u9032\u884c\u52a0\u5de5\uff0c\u53ef\u80fd\u662f\u6e05\u9664 html \u6216\u662f\u6aa2\u67e5\u91cd\u8907\u4e4b\u985e\u7684<\/p>\n<p>scrapy \u5e95\u5c64\u662f\u7528 lxml \u548c Twisted<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/github.com\/vinta\/BlackWidow\">https:\/\/github.com\/vinta\/BlackWidow<\/a><\/p>\n<h2>Tips<\/h2>\n<h3>Debugging<\/h3>\n<pre class=\"line-numbers\"><code class=\"language-py\">from scrapy.shell import inspect_response\ninspect_response(response, self)<\/code><\/pre>\n<p>These 2 lines will invoke the interactive shell.<\/p>\n<h3>\u76f8\u5c0d\u8def\u5f91 XPath<\/h3>\n<pre class=\"line-numbers\"><code class=\"language-py\">divs = response.xpath('\/\/div')\nfor p in divs.xpath('.\/\/p'):  # extracts all &lt;p&gt; inside\n    print p.extract()<\/code><\/pre>\n<h3>Access Django Model in Scrapy<\/h3>\n<pre class=\"line-numbers\"><code class=\"language-py\">def setup_django_env(django_settings_dir):\n    import imp\n    import sys\n\n    from django.core.management import setup_environ\n\n    django_project_path = os.path.abspath(os.path.join(django_settings_dir, '..'))\n    sys.path.append(django_project_path)\n    sys.path.append(django_settings_dir)\n\n    f, filename, desc = imp.find_module('settings', [django_settings_dir, ])\n    project = imp.load_module('settings', f, filename, desc)\n\n    setup_environ(project)\n\n# where Django settings.py placed\nDJANGO_SETTINGS_DIR = '\/all_projects\/heelsfetishism\/heelsfetishism'\nsetup_django_env(DJANGO_SETTINGS_DIR)<\/code><\/pre>\n<p>then you can import Django's modules in scrapy, like this:<\/p>\n<pre class=\"line-numbers\"><code class=\"language-py\">from django.contrib.auth.models import User\n\nfrom app.models import SomeModel<\/code><\/pre>\n<h3>State<\/h3>\n<p><a href=\"http:\/\/doc.scrapy.org\/en\/latest\/topics\/jobs.html#keeping-persistent-state-between-batches\">http:\/\/doc.scrapy.org\/en\/latest\/topics\/jobs.html#keeping-persistent-state-between-batches<\/a><\/p>\n<pre class=\"line-numbers\"><code class=\"language-py\">def parse_item(self, response):\n    # parse item here\n    self.state['items_count'] = self.state.get('items_count', 0) + 1<\/code><\/pre>\n<h3>Close Spider<\/h3>\n<pre class=\"line-numbers\"><code class=\"language-py\">from scrapy.exceptions import CloseSpider\n\n# \u53ea\u80fd\u5728 spider \u88e1\u982d\u547c\u53eb\uff0c\u4e0d\u80fd\u7528\u5728 pipeline \u88e1\nraise CloseSpider('Stop')<\/code><\/pre>\n<h3>Login in Spider<\/h3>\n<pre class=\"line-numbers\"><code class=\"language-py\">from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor\nfrom scrapy.contrib.spiders import CrawlSpider, Rule\nfrom scrapy.http import Request, FormRequest\n\nfrom blackwidow.items import HeelsItem\n\nclass SeeModelSpider(CrawlSpider):\n    name = 'seemodel'\n    allowed_domains = ['www.seemodel.com', ]\n    login_page = 'http:\/\/www.seemodel.com\/member.php?mod=logging&amp;action=login'\n    start_urls = [\n        'http:\/\/www.seemodel.com\/forum.php?mod=forumdisplay&amp;fid=41&amp;filter=heat&amp;orderby=heats',\n        'http:\/\/www.seemodel.com\/forum.php?mod=forumdisplay&amp;fid=42&amp;filter=heat&amp;orderby=heats',\n    ]\n\n    rules = (\n        Rule(\n            SgmlLinkExtractor(allow=r'forum.php?mod=viewthread&amp;tid=d+'),\n            callback='parse_item',\n            follow=False,\n        ),\n    )\n\n    def start_requests(self):\n        self.username = self.settings['SEEMODEL_USERNAME']\n        self.password = self.settings['SEEMODEL_PASSWORD']\n\n        yield Request(\n            url=self.login_page,\n            callback=self.login,\n            dont_filter=True,\n        )\n\n    def login(self, response):\n        return FormRequest.from_response(\n            response,\n            formname='login',\n            formdata={\n                'username': self.username,\n                'password': self.password,\n                'cookietime': 'on',\n            },\n            callback=self.check_login_response,\n        )\n\n    def check_login_response(self, response):\n        if self.username not in response.body:\n            self.log(\"Login failed\")\n            return\n\n        self.log(\"Successfully logged in\")\n\n        return [Request(url=url, dont_filter=True) for url in self.start_urls]\n\n    def parse_item(self, response):\n        item = HeelsItem()\n        item['comment'] = response.xpath('\/\/*[@id=\"thread_subject\"]\/text()').extract()\n        item['image_urls'] = response.xpath('\/\/ignore_js_op\/\/img\/@zoomfile').extract()\n        item['source_url'] = response.url\n\n        return item<\/code><\/pre>\n<p>ref:<br \/>\n<a href=\"https:\/\/doc.scrapy.org\/en\/latest\/topics\/request-response.html#topics-request-response-ref-request-userlogin\">https:\/\/doc.scrapy.org\/en\/latest\/topics\/request-response.html#topics-request-response-ref-request-userlogin<\/a><\/p>\n<h3>Others<\/h3>\n<p>XPath \u7684\u9078\u64c7\u7bc0\u9ede\u8a9e\u6cd5<br \/>\n<a href=\"http:\/\/mi.hosp.ncku.edu.tw\/km\/index.php\/dotnet\/48-netdisk\/57-xml-xpath\">http:\/\/mi.hosp.ncku.edu.tw\/km\/index.php\/dotnet\/48-netdisk\/57-xml-xpath<\/a><\/p>\n<p>Avoiding getting banned<br \/>\n<a href=\"http:\/\/doc.scrapy.org\/en\/latest\/topics\/practices.html#avoiding-getting-banned\">http:\/\/doc.scrapy.org\/en\/latest\/topics\/practices.html#avoiding-getting-banned<\/a><\/p>\n<p>Python \u6293\u53d6\u6846\u67b6\uff1aScrapy \u7684\u67b6\u6784<br \/>\n<a href=\"http:\/\/biaodianfu.com\/scrapy-architecture.html\">http:\/\/biaodianfu.com\/scrapy-architecture.html<\/a><\/p>\n<p>Download images<br \/>\n<a href=\"https:\/\/scrapy.readthedocs.org\/en\/latest\/topics\/images.html\">https:\/\/scrapy.readthedocs.org\/en\/latest\/topics\/images.html<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Scrapy is a fast high-level web crawling and web scraping framework.<\/p>\n<p>ref:<br \/>\nhttps:\/\/doc.scrapy.org\/en\/latest\/<\/p>\n<p>## Install<\/p>\n<p>```bash<br \/>\n# on Ubuntu<br \/>\n$ sudo apt-get install libxml2-dev libxslt1-dev libffi-dev<\/p>\n<p># on Mac<br \/>\n$ brew install libffi<\/p>\n<p>$ pip install scrapy service_identity<br \/>\n```<\/p>\n<p>## Usage<\/p>\n<p>```bash<br \/>\n# interative shell<br \/>\n# http:\/\/doc.scrapy.org\/en\/latest\/intro\/tutorial.html#trying-selectors-in-the-shell<br \/>\n$ scrapy shell \"http:\/\/www.wendyslookbook.com\/2013\/09\/the-frame-a-digital-glossy\/\"<br \/>\n# or<br \/>\n$ scrapy shell --spider=seemodel<br \/>\n>>> view(response)<br \/>\n>>> fetch(req_or_url)<\/p>\n<p># create a project<br \/>\n$ scrapy startproject blackwindow<\/p>\n<p># create a spider<br \/>\n$ scrapy genspider fancy www.fancy.com<\/p>\n<p># run spider<br \/>\n$ scrapy crawl fancy<br \/>\n$ scrapy crawl pinterest -L ERROR<br \/>\n```<\/p>\n<p>Spider<br \/>\n\u53bb\u722c\u8cc7\u6599\u7684\u7a0b\u5f0f\uff0c\u7528 parse() \u5b9a\u7fa9\u4f60\u8981 parse \u54ea\u4e9b\u8cc7\u6599<\/p>\n<p>Item<br \/>\n\u5b9a\u7fa9\u6293\u56de\u4f86\u7684\u8cc7\u6599\u6b04\u4f4d\uff0c\u53ef\u4ee5\u60f3\u6210\u662f django \u7684 model<\/p>\n<p>Pipeline<br \/>\n\u5c0d\u6293\u56de\u4f86\u7684\u8cc7\u6599\u9032\u884c\u52a0\u5de5\uff0c\u53ef\u80fd\u662f\u6e05\u9664 html \u6216\u662f\u6aa2\u67e5\u91cd\u8907\u4e4b\u985e\u7684<\/p>\n<p>scrapy \u5e95\u5c64\u662f\u7528 lxml \u548c Twisted<\/p>\n<p>ref:<br \/>\nhttps:\/\/github.com\/vinta\/BlackWidow<\/p>\n<p>## Tips<\/p>\n<p>### Debugging<\/p>\n<p>```py<br \/>\nfrom scrapy.shell import inspect_response<br \/>\ninspect_response(response, self)<br \/>\n```<\/p>\n<p>These 2 lines will invoke the interative shell.<\/p>\n<p>### \u76f8\u5c0d\u8def\u5f91 XPath<\/p>\n<p>```py<br \/>\ndivs = response.xpath('\/\/div')<br \/>\nfor p in divs.xpath('.\/\/p'):  # extracts all <\/p>\n<p> inside<br \/>\n    print p.extract()<br \/>\n```<\/p>\n<p>### Access Django Model in Scrapy<\/p>\n<p>```py<br \/>\ndef setup_django_env(django_settings_dir):<br \/>\n    import imp<br \/>\n    import sys<\/p>\n<p>    from django.core.management import setup_environ<\/p>\n<p>    django_project_path = os.path.abspath(os.path.join(django_settings_dir, '..'))<br \/>\n    sys.path.append(django_project_path)<br \/>\n    sys.path.append(django_settings_dir)<\/p>\n<p>    f, filename, desc = imp.find_module('settings', [django_settings_dir, ])<br \/>\n    project = imp.load_module('settings', f, filename, desc)<\/p>\n<p>    setup_environ(project)<\/p>\n<p># where Django settings.py placed<br \/>\nDJANGO_SETTINGS_DIR = '\/all_projects\/heelsfetishism\/heelsfetishism'<br \/>\nsetup_django_env(DJANGO_SETTINGS_DIR)<br \/>\n```<\/p>\n<p>then you can import Django's modules in scrapy, like this:<\/p>\n<p>```py<br \/>\nfrom django.contrib.auth.models import User<\/p>\n<p>from app.models import SomeModel<br \/>\n```<\/p>\n<p>### State<\/p>\n<p>http:\/\/doc.scrapy.org\/en\/latest\/topics\/jobs.html#keeping-persistent-state-between-batches<\/p>\n<p>```py<br \/>\ndef parse_item(self, response):<br \/>\n    # parse item here<br \/>\n    self.state['items_count'] = self.state.get('items_count', 0) + 1<br \/>\n```<\/p>\n<p>### Close Spider<\/p>\n<p>```py<br \/>\nfrom scrapy.exceptions import CloseSpider<\/p>\n<p># \u53ea\u80fd\u5728 spider \u88e1\u982d\u547c\u53eb\uff0c\u4e0d\u80fd\u7528\u5728 pipeline \u88e1<br \/>\nraise CloseSpider('Stop')<br \/>\n```<\/p>\n<p>### Login in Spider<\/p>\n<p>```py<br \/>\nfrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor<br \/>\nfrom scrapy.contrib.spiders import CrawlSpider, Rule<br \/>\nfrom scrapy.http import Request, FormRequest<\/p>\n<p>from blackwidow.items import HeelsItem<\/p>\n<p>class SeeModelSpider(CrawlSpider):<br \/>\n    name = 'seemodel'<br \/>\n    allowed_domains = ['www.seemodel.com', ]<br \/>\n    login_page = 'http:\/\/www.seemodel.com\/member.php?mod=logging&#038;action=login'<br \/>\n    start_urls = [<br \/>\n        'http:\/\/www.seemodel.com\/forum.php?mod=forumdisplay&#038;fid=41&#038;filter=heat&#038;orderby=heats',<br \/>\n        'http:\/\/www.seemodel.com\/forum.php?mod=forumdisplay&#038;fid=42&#038;filter=heat&#038;orderby=heats',<br \/>\n    ]<\/p>\n<p>    rules = (<br \/>\n        Rule(<br \/>\n            SgmlLinkExtractor(allow=r'forum\\.php\\?mod=viewthread&#038;tid=\\d+'),<br \/>\n            callback='parse_item',<br \/>\n            follow=False,<br \/>\n        ),<br \/>\n    )<\/p>\n<p>    def start_requests(self):<br \/>\n        self.username = self.settings['SEEMODEL_USERNAME']<br \/>\n        self.password = self.settings['SEEMODEL_PASSWORD']<\/p>\n<p>        yield Request(<br \/>\n            url=self.login_page,<br \/>\n            callback=self.login,<br \/>\n            dont_filter=True,<br \/>\n        )<\/p>\n<p>    def login(self, response):<br \/>\n        return FormRequest.from_response(<br \/>\n            response,<br \/>\n            formname='login',<br \/>\n            formdata={<br \/>\n                'username': self.username,<br \/>\n                'password': self.password,<br \/>\n                'cookietime': 'on',<br \/>\n            },<br \/>\n            callback=self.check_login_response,<br \/>\n        )<\/p>\n<p>    def check_login_response(self, response):<br \/>\n        if self.username not in response.body:<br \/>\n            self.log(\"Login failed\")<br \/>\n            return<\/p>\n<p>        self.log(\"Successfully logged in\")<\/p>\n<p>        return [Request(url=url, dont_filter=True) for url in self.start_urls]<\/p>\n<p>    def parse_item(self, response):<br \/>\n        item = HeelsItem()<br \/>\n        item['comment'] = response.xpath('\/\/*[@id=\"thread_subject\"]\/text()').extract()<br \/>\n        item['image_urls'] = response.xpath('\/\/ignore_js_op\/\/img\/@zoomfile').extract()<br \/>\n        item['source_url'] = response.url<\/p>\n<p>        return item<br \/>\n```<\/p>\n<p>ref:<br \/>\nhttps:\/\/doc.scrapy.org\/en\/latest\/topics\/request-response.html#topics-request-response-ref-request-userlogin<\/p>\n<p>### Others<\/p>\n<p>XPath \u7684\u9078\u64c7\u7bc0\u9ede\u8a9e\u6cd5<br \/>\nhttp:\/\/mi.hosp.ncku.edu.tw\/km\/index.php\/dotnet\/48-netdisk\/57-xml-xpath<\/p>\n<p>Avoiding getting banned<br \/>\nhttp:\/\/doc.scrapy.org\/en\/latest\/topics\/practices.html#avoiding-getting-banned<\/p>\n<p>Python \u6293\u53d6\u6846\u67b6\uff1aScrapy \u7684\u67b6\u6784<br \/>\nhttp:\/\/biaodianfu.com\/scrapy-architecture.html<\/p>\n<p>Download images<br \/>\nhttps:\/\/scrapy.readthedocs.org\/en\/latest\/topics\/images.html<\/p>\n","protected":false},"author":1,"featured_media":764,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4,116],"tags":[2,54],"class_list":["post-115","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-about-python","category-about-web-development","tag-python","tag-web-crawler"],"_links":{"self":[{"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/posts\/115","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/comments?post=115"}],"version-history":[{"count":0,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/posts\/115\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/media\/764"}],"wp:attachment":[{"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/media?parent=115"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/categories?post=115"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/tags?post=115"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}