Database Routers in Django

Database Routers in Django

把部分 models / tables 獨立到一台資料庫

Database Router

沒辦法在 Model class 裡指定這個 model 只能用某個 database
而是要用 database router
就是判斷 model._meta.app_label == 'xxx' 的時候指定使用某一個 database
database 是指定義在 settings.DATABASES 的那些

不過 django 不支援跨 database 的 model relation
你不能用 foreign key 或 m2m 指向另一個 database 裡的 model
但是其實你直接用 user_id, song_id 之類的 int 欄位來記錄就好了

in settings.py

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'default',
        'USER': 'whatever',
        'PASSWORD': 'whatever',
        'HOST': '',
        'PORT': '',
    },
    'warehouse': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'warehouse',
        'USER': 'whatever',
        'PASSWORD': 'whatever',
        'HOST': '',
        'PORT': '',
    },
}

DATABASE_ROUTERS = [
    'app.db_routers.SVDatabaseRouter',
]

in app/db_routers.py

from oauth2_provider.models import AccessToken as OauthAccessToken
from oauth2_provider.oauth2_backends import get_oauthlib_core
from rest_framework2.authentication import BaseAuthentication

class SVOauthAuthentication(BaseAuthentication):
    www_authenticate_realm = 'api'

    def authenticate(self, request):
        oauthlib_core = get_oauthlib_core()
        valid, result = oauthlib_core.verify_request(request, scopes=[])

        if valid:
            return (result.user, result.access_token)
        else:
            access_token = request.GET.get('access_token', None)
            if access_token:
                try:
                    access_token_obj = OauthAccessToken.objects.get(token=access_token)
                except OauthAccessToken.DoesNotExist:
                    pass
                else:
                    return (access_token_obj.user, access_token_obj)

        return None

    def authenticate_header(self, request):
        return 'Bearer realm="{0}"'.format(self.www_authenticate_realm)

ref:
https://docs.djangoproject.com/en/dev/topics/db/multi-db/

Models

把所有需要放到另一台 db 的 models 都放在同一個 app 下
方便管理

in warehouse/models.py

class PlayRecord(SVWarehouseDBMixin, models.Model):
    song_id = models.IntegerField()
    user_id = models.IntegerField(null=True, blank=True)
    is_full = models.BooleanField(default=False)
    ip_address = models.IPAddressField()
    location = models.CharField(max_length=2)
    created_at = models.DateTimeField()

繼承 SVWarehouseDBMixin 這個 minxin 的 class 會被放到 warehouse 這台資料庫!
所以 migrate 的時候要注意,記得在 migration 檔案裡加上:

from south.db import dbs
warehouse_db = dbs['warehouse']

這樣 south 才會在 warehouse 這台資料庫上建立 table
不然就是你自己手動去 CREATE TABLE

Migration

如果你要把舊有的 app 的 models 搬到另一台資料庫
但是 models 不動(還是放在本來的 app 底下)
你可能會需要 reset 整個 migration 紀錄
從頭開始建立一個新的 migration
因為 schema 會錯亂
所以還是建議新開一個 app 來放那些要搬到另一台資料庫的 models
這樣 database router 和 migration 都會比較單純

in app/migrations/0001_initial.py

$ pip install south==1.0.2

# syncdb 默認只會作用到 default 資料庫,你要明確指定要用哪個 database 才行
$ ./manage.py syncdb --noinput
$ ./manage.py syncdb --noinput --database=warehouse

# migrate 卻可以作用到其他資料庫
# 因為 migrate 哪個資料庫是 migration file 裡的 `db` 參數在決定的
$ ./manage.py migrate

ref:
http://stackoverflow.com/questions/7029228/is-using-multiple-databases-and-south-together-possible

Unit Tests

Django only flushes the default database at the start of each test run. If your setup contains multiple databases, and you have a test that requires every database to be clean, you can use the multi_db attribute on the test suite to request a full flush.

在使用 --keepdb 的情況下,如果你的測試執行到一半就因為錯誤而中斷了,可能會發生資料庫裡有資料還沒被 flush 的問題,導致下次執行測試時失敗。不過如果你沒有用 --keepdb 的話,因為每次都會重建資料庫,所以不會有這個問題。

from django.test import TestCase

class YourBTestCase(TestCase):
    multi_db = True

    def setUp(self):
        do_shit()

ref:
https://docs.djangoproject.com/en/dev/topics/testing/tools/
https://docs.djangoproject.com/en/dev/topics/testing/advanced/#topics-testing-advanced-multidb

elasticsearch-dsl-py: The Official Elasticsearch ORM in Python

elasticsearch-dsl-py: The Official Elasticsearch ORM in Python

Query DSL 是 Elasticsearch 的查詢用 Domain-specific Language (DSL),實際上就是一堆 JSON。elasticsearch-dsl 是官方發佈的一套用來操作 Query DSL 的 Python package,可以當成是 Elasticsearch 的 ORM。

希望之後可以直接支援用 SQL 來查詢,不然 Query DSL 真的有夠難寫。

ref:
https://github.com/elastic/elasticsearch-dsl-py

Installation

$ pip install elasticsearch-dsl>=5.0.0,<6.0.0

ref:
https://elasticsearch-dsl.readthedocs.org/en/latest/index.html

Schema

in app/mappings.py

from elasticsearch_dsl import DocType, String, Boolean
from elasticsearch_dsl.connections import connections
connections.create_connection(hosts=['127.0.0.1', ])

class AlbumDoc(DocType):
    upc = String(index='not_analyzed')
    title = String(analyzer='ik', fields={'raw': String(index='not_analyzed')})
    artist = String(analyzer='ik')
    is_ready = Boolean()

    class Meta:
        index = 'dps'
        doc_type = 'album'

    @classmethod
    def sync(cls, album):
        album_doc = AlbumDoc(meta={'id': album.id})
        album_doc.upc = album.get_upcs(output_str=False)
        album_doc.title = album.name
        album_doc.artist = album.artist.name
        album_doc.is_ready = album.is_ready
        album_doc.save()

    def save(self, *args, **kwargs):
        return super(AlbumDoc, self).save(*args, **kwargs)

    def get_model_obj(self):
        from svapps.dps.models import Album
        return Album.objects.get(id=self.meta.id)

# to create mappings
AlbumDoc.init()

一定要執行一次 YourDocType.init(),這樣 Elasticsearch 才會根據你的 DocType 產生對應的 mapping。否則 Elasticsearch 就會在你第一次倒資料進去的時候根據你的資料的 data type 建立對應的 mapping,所以 analyzer 之類的設定就會是預設的 standard,你可以透過 _mapping API 來檢查。

需要全文搜尋的欄位要設為 analyzed(string 欄位默認都是 analyzed),不需要全文搜尋的欄位,也就是要求精確的欄位,例如:usernameemailzip code,就可以設成 not_analyzed,但是你就不能對 analyzed 的欄位使用 term 了,除非你對該欄位額外再建立一個 raw 欄位。

ref:
https://elasticsearch-dsl.readthedocs.org/en/latest/persistence.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html#CO59-2

Store Data

album_doc = AlbumDoc(meta={'id': 42})
album_doc.upc = ['887375000619', '887375502069']
album_doc.title = 'abc'
album_doc.artist = 'xyz'
album_doc.is_ready = True
album_doc.save()

# 可以如常地 query,不用管它是不是 list
search = AlbumDoc.search().filter('term', upc='887375000619')
response = search.execute()

因為 Elasticsearch 是 schemaless,所以即使你定義了 String 欄位,還是可以存一個 list 進去。

Search Data

  • must:必須符合所有條件
  • should:符合其中一個條件即可
search = TrackDoc.search() \
    .filter('term', is_ready=True) \
    .query('match', title=u'沒有的啊')

search = TrackDoc.search() \
    .filter('term', is_ready=True) \
    .query(
        Q('match', title='沒有的啊') & \
        Q('match', artist='那我懂你意思了') & \
        Q('match', album='沒有的, 啊!?')
    )

q = Q(
    'bool',
    must=[
        Q('match', title={'query': track_name, 'fuzziness': 'AUTO'}),
    ],
    should=[
        Q('match', album={'query': album_name, 'minimum_should_match': '60%'}),
        Q('match', artist={'query': artist_name, 'minimum_should_match': '80%'}),
    ],
    minimum_should_match=1
)
search = TrackDoc.search().filter('term', is_ready=True).query(q)

q = Q(
    'bool',
    should=[
        Q('term', isrc=q),
        Q('term', upc=q),
        Q('match', **{'title.raw': {'query': q}}),
        Q('multi_match', query=q, fields=['title', 'artist', 'album']),
    ],
)
search = Search(index='dps', doc_type=['track', 'album']).query(q)
search = search[:20]

# print the raw Query DSL
import uniout
from pprint import pprint
pprint(search.to_dict())

response = search.execute()

print(response.hits.total)
print(response[0].title)
print(response[0].artist)
print(response[0].album)
print(response[0].is_ready)

ref:
https://elasticsearch-dsl.readthedocs.org/en/latest/search_dsl.html

Read and Write Files in Django and Python

Read and Write Files in Django and Python

File 和 ImageFile 接受 Python 的 file 或 StringIO 物件
而 ContentFile 接受 string

ref:
https://docs.djangoproject.com/en/dev/ref/files/file/#the-file-object

Django Form

image_file = request.FILES['file']

# 方法一
profile.mugshot.save(image_file.name, image_file)

# 方法二
profile.mugshot = image_file

profile.save()

open('/path/to/file.png')

from django.core.files import File

with open('/home/vinta/image.png', 'rb') as f:
    profile.mugshot = File(f)
    profile.save()

Django ContentFile

import os
import uuid

from django.core.files.base import ContentFile

import requests

url = 'http://vinta.ws/static/photo.jpg'
r = requests.get(url)
file_url, file_ext = os.path.splitext(r.url)
file_name = '%s%s' % (str(uuid.uuid4()).replace('-', ''), file_ext)

profile.mugshot.save('123.png', ContentFile(r.content), save=False)

# 如果 profile.mugshot 是 ImageField 欄位的話
# 可以用以下的方式來判斷它是不是合法的圖檔
try:
    profile.mugshot.width
except TypeError:
    raise RuntimeError('圖檔格式不正確')

profile.save()

Data URI, Base64

from binascii import a2b_base64

from django.core.files.base import ContentFile

data_uri = 'data:image/jpeg;base64,/9j/4AAQSkZJRg....'
head, data = data_uri.split(',')
binary_data = a2b_base64(data)

# 方法一
profile.mugshot.save('whatever.jpg', ContentFile(binary_data), save=False)
profile.save()

# 不能用這種方式,因為少了 file name
profile.mugshot = ContentFile(binary_data)
profile.save()

# 方法二
f = open('image.png', 'wb')
f.write(binary_data)
f.close()

# 方法三
from StringIO import StringIO
from PIL import Image
img = Image.open(StringIO(binary_data))
print img.size

ref:
https://stackoverflow.com/questions/19395649/python-pil-create-and-save-image-from-data-uri

StringIO, PIL image

你就把 StringIO 想成是 open('/home/vinta/some_file.txt', 'rb') 的 file 物件

from StringIO import StringIO

from PIL import Image
import requests

r = requests.get('http://vinta.ws/static/photo.jpg')
img = Image.open(StringIO(r.content))
print pil_image.size

StringIO, PIL image, Django

from StringIO import StringIO

from django.core.files.base import ContentFile

from PIL import Image

img = Image.open(instance.file)
# or
raw_img_io = StringIO(binary_data)
img = Image.open(raw_img_io)
img = img.resize((524, 328), Image.ANTIALIAS)
img_io = StringIO()
img.save(img_io, 'PNG', quality=100)

profile.image.save('whatever.png', ContentFile(img_io.getvalue()), save=False)
profile.save()

ref:
https://stackoverflow.com/questions/3723220/how-do-you-convert-a-pil-image-to-a-django-file

Download file from URL, tempfile

import os
import tempfile
import requests
import xlrd

try:
    file_path = report.file.path
    temp = None
except NotImplementedError:
    url = report.file.url
    r = requests.get(url, stream=True)
    file_url, file_ext = os.path.splitext(r.url)

    # delete=True 會在 temp.close() 之後自己刪掉
    temp = tempfile.NamedTemporaryFile(prefix='report_file_', suffix=file_ext, dir='/tmp', delete=False)
    file_path = temp.name

    with open(file_path, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
                f.flush()

wb = xlrd.open_workbook(file_path)

# 因為是 tempfile.NamedTemporaryFile(delete=False)
# 所以你要自己刪掉
try:
    os.remove(temp.name)
except AttributeError:
    pass

ref:
https://stackoverflow.com/questions/16694907/download-large-file-in-python-with-requests

HTTP Cache Headers in Django

HTTP Cache Headers in Django

Conditional View Processing

def latest_entry(request, blog_id):
    return Entry.objects.filter(blog=blog_id).latest("published").published

@condition(last_modified_func=latest_entry)
def front_page(request, blog_id):
    ...

ref:
https://docs.djangoproject.com/en/dev/topics/conditional-view-processing/

Cache Middlewares

如果有啟用 Django 的 cache middleware(就是 UpdateCacheMiddlewareFetchFromCacheMiddleware
每一個 request 都會被標上 Cache-Control: max-age=600
那個 600 是根據 CACHE_MIDDLEWARE_SECONDS 設定的值

只要設置了 max-age > 0
response header 中就會被自動加入 Cache-ControlExpiresLast-Modified 兩個欄位

ref:
https://docs.djangoproject.com/en/dev/ref/settings/#std:setting-CACHE_MIDDLEWARE_SECONDS

Never Cache Decorator

from django.views.decorators.cache import never_cache

@never_cache
def myview(request):
    pass

如果你單純的就是不希望被 cache
就使用這種方式
@never_cache 只會設置 Cache-Control: max-age=0max-age=0 是馬上過期)

Cache Control Decorator

from django.views.decorators.cache import cache_control

class SongDetail(SVAPIDetailView):
    serializer_class = api_serializers.SongDetailSerializer

    @cache_control(no_store=True, no_cache=True, max_age=0)
    def get(self, request, song_id):
        do_something()

        return Response(data)

不知道為什麼,只設置 no-store 和 no-cache 的話
iOS 的 AFNetworking 還是會 cache
照道理說 no-store 的優先權應該是最高的
目前的解法是使用 Cache-Control: max-age=0

ref:
https://docs.djangoproject.com/en/dev/topics/cache/#controlling-cache-using-other-headers

Scrapy: The Web Scraping Framework for Python

Scrapy: The Web Scraping Framework for Python

Scrapy is a fast high-level web crawling and web scraping framework.

ref:
https://doc.scrapy.org/en/latest/

Install

# on Ubuntu
$ sudo apt-get install libxml2-dev libxslt1-dev libffi-dev

# on Mac
$ brew install libffi

$ pip install scrapy service_identity

Usage

# interative shell
# http://doc.scrapy.org/en/latest/intro/tutorial.html#trying-selectors-in-the-shell
$ scrapy shell "http://www.wendyslookbook.com/2013/09/the-frame-a-digital-glossy/"
# or
$ scrapy shell --spider=seemodel
>>> view(response)
>>> fetch(req_or_url)

# create a project
$ scrapy startproject blackwindow

# create a spider
$ scrapy genspider fancy www.fancy.com

# run spider
$ scrapy crawl fancy
$ scrapy crawl pinterest -L ERROR

Spider
去爬資料的程式,用 parse() 定義你要 parse 哪些資料

Item
定義抓回來的資料欄位,可以想成是 django 的 model

Pipeline
對抓回來的資料進行加工,可能是清除 html 或是檢查重複之類的

scrapy 底層是用 lxml 和 Twisted

ref:
https://github.com/vinta/BlackWidow

Tips

Debugging

from scrapy.shell import inspect_response
inspect_response(response, self)

These 2 lines will invoke the interative shell.

相對路徑 XPath

divs = response.xpath('//div')
for p in divs.xpath('.//p'):  # extracts all <p> inside
    print p.extract()

Access Django Model in Scrapy

def setup_django_env(django_settings_dir):
    import imp
    import sys

    from django.core.management import setup_environ

    django_project_path = os.path.abspath(os.path.join(django_settings_dir, '..'))
    sys.path.append(django_project_path)
    sys.path.append(django_settings_dir)

    f, filename, desc = imp.find_module('settings', [django_settings_dir, ])
    project = imp.load_module('settings', f, filename, desc)

    setup_environ(project)

# where Django settings.py placed
DJANGO_SETTINGS_DIR = '/all_projects/heelsfetishism/heelsfetishism'
setup_django_env(DJANGO_SETTINGS_DIR)

then you can import Django's modules in scrapy, like this:

from django.contrib.auth.models import User

from app.models import SomeModel

State

http://doc.scrapy.org/en/latest/topics/jobs.html#keeping-persistent-state-between-batches

def parse_item(self, response):
    # parse item here
    self.state['items_count'] = self.state.get('items_count', 0) + 1

Close Spider

from scrapy.exceptions import CloseSpider

# 只能在 spider 裡頭呼叫,不能用在 pipeline 裡
raise CloseSpider('Stop')

Login in Spider

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request, FormRequest

from blackwidow.items import HeelsItem

class SeeModelSpider(CrawlSpider):
    name = 'seemodel'
    allowed_domains = ['www.seemodel.com', ]
    login_page = 'http://www.seemodel.com/member.php?mod=logging&action=login'
    start_urls = [
        'http://www.seemodel.com/forum.php?mod=forumdisplay&fid=41&filter=heat&orderby=heats',
        'http://www.seemodel.com/forum.php?mod=forumdisplay&fid=42&filter=heat&orderby=heats',
    ]

    rules = (
        Rule(
            SgmlLinkExtractor(allow=r'forum\.php\?mod=viewthread&tid=\d+'),
            callback='parse_item',
            follow=False,
        ),
    )

    def start_requests(self):
        self.username = self.settings['SEEMODEL_USERNAME']
        self.password = self.settings['SEEMODEL_PASSWORD']

        yield Request(
            url=self.login_page,
            callback=self.login,
            dont_filter=True,
        )

    def login(self, response):
        return FormRequest.from_response(
            response,
            formname='login',
            formdata={
                'username': self.username,
                'password': self.password,
                'cookietime': 'on',
            },
            callback=self.check_login_response,
        )

    def check_login_response(self, response):
        if self.username not in response.body:
            self.log("Login failed")
            return

        self.log("Successfully logged in")

        return [Request(url=url, dont_filter=True) for url in self.start_urls]

    def parse_item(self, response):
        item = HeelsItem()
        item['comment'] = response.xpath('//*[@id="thread_subject"]/text()').extract()
        item['image_urls'] = response.xpath('//ignore_js_op//img/@zoomfile').extract()
        item['source_url'] = response.url

        return item

ref:
https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-userlogin

Others

XPath 的選擇節點語法
http://mi.hosp.ncku.edu.tw/km/index.php/dotnet/48-netdisk/57-xml-xpath

Avoiding getting banned
http://doc.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned

Python 抓取框架:Scrapy 的架构
http://biaodianfu.com/scrapy-architecture.html

Download images
https://scrapy.readthedocs.org/en/latest/topics/images.html