Vinta

Read and Write Files in Django and Python

2015-04-282019-10-22VintaPython, Web Development

File 和 ImageFile 接受 Python 的 file 或 StringIO 物件而 ContentFile 接受 string ref: https://docs.djangoproject.com/en/dev/ref/files/file/#the-file-object Django Form image_file = request.FILES['file'] # 方法一 profile.mugshot.save(image_file.name, image_file) # 方法二 profile.mugshot = image_file profile.save() open('/path/to/file.png') from django.core.files import File with open('/home/vinta/image.png', 'rb') as f: profile.mugshot = File(f) profile.save() Django ContentFile import os import uuid from django.core.files.base import ContentFile… Read More

HTTP Cache Headers in Django

2015-01-292019-10-22VintaPython, Web Development

no-store vs no-cache

Scrapy: The Web Scraping Framework for Python

2015-01-112019-10-22VintaPython, Web Development

Scrapy is a fast high-level web crawling and web scraping framework.

ref:
https://doc.scrapy.org/en/latest/

## Install

```bash
# on Ubuntu
$ sudo apt-get install libxml2-dev libxslt1-dev libffi-dev

# on Mac
$ brew install libffi

$ pip install scrapy service_identity
```

## Usage

```bash
# interative shell
# http://doc.scrapy.org/en/latest/intro/tutorial.html#trying-selectors-in-the-shell
$ scrapy shell "http://www.wendyslookbook.com/2013/09/the-frame-a-digital-glossy/"
# or
$ scrapy shell --spider=seemodel
>>> view(response)
>>> fetch(req_or_url)

# create a project
$ scrapy startproject blackwindow

# create a spider
$ scrapy genspider fancy www.fancy.com

# run spider
$ scrapy crawl fancy
$ scrapy crawl pinterest -L ERROR
```

Spider
去爬資料的程式，用 parse() 定義你要 parse 哪些資料

Item
定義抓回來的資料欄位，可以想成是 django 的 model

Pipeline
對抓回來的資料進行加工，可能是清除 html 或是檢查重複之類的

scrapy 底層是用 lxml 和 Twisted

ref:
https://github.com/vinta/BlackWidow

## Tips

### Debugging

```py
from scrapy.shell import inspect_response
inspect_response(response, self)
```

These 2 lines will invoke the interative shell.

### 相對路徑 XPath

```py
divs = response.xpath('//div')
for p in divs.xpath('.//p'): # extracts all

inside
print p.extract()
```

### Access Django Model in Scrapy

```py
def setup_django_env(django_settings_dir):
import imp
import sys

from django.core.management import setup_environ

django_project_path = os.path.abspath(os.path.join(django_settings_dir, '..'))
sys.path.append(django_project_path)
sys.path.append(django_settings_dir)

f, filename, desc = imp.find_module('settings', [django_settings_dir, ])
project = imp.load_module('settings', f, filename, desc)

setup_environ(project)

# where Django settings.py placed
DJANGO_SETTINGS_DIR = '/all_projects/heelsfetishism/heelsfetishism'
setup_django_env(DJANGO_SETTINGS_DIR)
```

then you can import Django's modules in scrapy, like this:

```py
from django.contrib.auth.models import User

from app.models import SomeModel
```

### State

http://doc.scrapy.org/en/latest/topics/jobs.html#keeping-persistent-state-between-batches

```py
def parse_item(self, response):
# parse item here
self.state['items_count'] = self.state.get('items_count', 0) + 1
```

### Close Spider

```py
from scrapy.exceptions import CloseSpider

# 只能在 spider 裡頭呼叫，不能用在 pipeline 裡
raise CloseSpider('Stop')
```

### Login in Spider

```py
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request, FormRequest

from blackwidow.items import HeelsItem

class SeeModelSpider(CrawlSpider):
name = 'seemodel'
allowed_domains = ['www.seemodel.com', ]
login_page = 'http://www.seemodel.com/member.php?mod=logging&action=login'
start_urls = [
'http://www.seemodel.com/forum.php?mod=forumdisplay&fid=41&filter=heat&orderby=heats',
'http://www.seemodel.com/forum.php?mod=forumdisplay&fid=42&filter=heat&orderby=heats',
]

rules = (
Rule(
SgmlLinkExtractor(allow=r'forum\.php\?mod=viewthread&tid=\d+'),
callback='parse_item',
follow=False,
),
)

def start_requests(self):
self.username = self.settings['SEEMODEL_USERNAME']
self.password = self.settings['SEEMODEL_PASSWORD']

yield Request(
url=self.login_page,
callback=self.login,
dont_filter=True,
)

def login(self, response):
return FormRequest.from_response(
response,
formname='login',
formdata={
'username': self.username,
'password': self.password,
'cookietime': 'on',
},
callback=self.check_login_response,
)

def check_login_response(self, response):
if self.username not in response.body:
self.log("Login failed")
return

self.log("Successfully logged in")

return [Request(url=url, dont_filter=True) for url in self.start_urls]

def parse_item(self, response):
item = HeelsItem()
item['comment'] = response.xpath('//*[@id="thread_subject"]/text()').extract()
item['image_urls'] = response.xpath('//ignore_js_op//img/@zoomfile').extract()
item['source_url'] = response.url

return item
```

ref:
https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-userlogin

### Others

XPath 的選擇節點語法
http://mi.hosp.ncku.edu.tw/km/index.php/dotnet/48-netdisk/57-xml-xpath

Avoiding getting banned
http://doc.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned

Python 抓取框架：Scrapy 的架构
http://biaodianfu.com/scrapy-architecture.html

Download images
https://scrapy.readthedocs.org/en/latest/topics/images.html

Parse datetime in Python and JavaScript

2014-12-302019-10-22VintaJavaScript, Python, Web Development

Use dateutil and moment.js

Send Emails in Django

2014-12-302019-10-22VintaPython, Web Development

Sending emails with Amazon SES, Mailgun, Zoho, or Gmail in Django.