Archive for January 2015

Cache headers in Django

ref:
https://blog.othree.net/log/2012/12/22/cache-control-and-etag/ << 建議讀一下
http://zhangxiaofei576342.blog.163.com/blog/static/2086199020101026113241602/
http://blog.toright.com/posts/3414/%E5%88%9D%E6%8E%A2-http-1-1-cache-%E6%A9%9F%E5%88%B6.html

Cache-Control header

優先級(由高至低):

no-store >> 完全不存下來,所以完全沒有 cache
no-cache >> 還是可能會 cache,但還是會每次都問有沒有新內容
max-age=60 >> 在 60 秒內不會再 request

ref:
http://tools.ietf.org/html/rfc2616#section-14.9
http://www.php-oa.com/2008/12/03/http-head.html

Cache Middleware

如果有啟用 Django 的 cache middleware(例如 UpdateCacheMiddleware 和 FetchFromCacheMiddleware)
每一個 request 都會被標上 Cache-Control: max-age=600
那個 600 是根據 CACHE_MIDDLEWARE_SECONDS

只要設置了 max-age > 0
response header 中就會被自動加入 Cache-ControlExpiresLast-Modified 兩個欄位

ref:
https://docs.djangoproject.com/en/dev/ref/settings/#std:setting-CACHE_MIDDLEWARE_SECONDS

@never_cache decorator

from django.views.decorators.cache import never_cache

@never_cache
def myview(request):
    pass

如果你單純的就是不希望被 cache
就使用這種方式

上面這個 decorator 只會設置 Cache-Control: max-age=0
max-age=0 是馬上過期

@cache_control decorator

from django.views.decorators.cache import cache_control

class SongDetail(SVAPIDetailView):
    serializer_class = api_serializers.SongDetailSerializer

    @cache_control(no_store=True, no_cache=True, max_age=0)
    def get(self, request, song_id):
        do_something()

        return Response(data)

不知道為什麼,只設置 no-store 和 no-cache 的話
iOS 的 AFNetworking 還是會 cache
照道理說 no-store 的優先權應該是最高的
目前的解法是使用 Cache-Control: max-age=0

ref:
https://docs.djangoproject.com/en/dev/topics/cache/#controlling-cache-using-other-headers

Last-Modified 和 Etag

Last-Modified 要跟 response 的 If-Modified-Since 一起用
ETag 要跟 response 的 If-None-Match 一起用

Last-Modified 是說這個 URI 在什麼時候被修改了
是 GMT 時間

ETag 是 Entity Tag
是這個 URI 的 hash 值(但是用什麼來 hash 你可以自己決定,檔案內容之類的)
因為 hash 可能是個耗時的操作
所以 YSlow 建議不要用 ETag
用 Last-Modified、Expires 或 Cache-Control: max-age=xxx 就可以了

Vary header

Varnish, Squid 這一類的 cache proxy 多半會根據 URL 和 Vary header 提到的 header 來做一個 hash
用這個 hash 來判斷緩存有沒有命中
Vary header 可能會長這樣:

Vary: Accept-Encoding
Vary: Accept-Encoding,User-Agent
Vary: X-Some-Custom-Header,Host
Vary: *

假設有兩個 requests 請求同一個檔案(URL 一樣)
但是這兩個 requests 的 User-Agent 不同(比如說是兩個不同的 browser)
第二個 request 的緩存就不會命中
因為 hash 結果會不一樣

ref:
http://shunter.blog.51cto.com/2183398/1076521

Build an Oauth 2.0 provider with django-oauth-toolkit

What is Oauth?
http://coding.anyun.tw/2012/03/13/oauth-2/
http://blog.yorkxin.org/posts/2013/09/30/oauth2-1-introduction

Install

$ pip install djangorestframework django-oauth-toolkit

Configuration

in settings.py

REST_FRAMEWORK = {
    'DEFAULT_AUTHENTICATION_CLASSES': [
        'api.authentications.SVOauthAuthentication',
        'api.authentications.SVTokenAuthentication',
        'api.authentications.SVAppAuthentication',
        'rest_framework.authentication.SessionAuthentication',
    ],
    'DEFAULT_RENDERER_CLASSES': [
        'rest_framework.renderers.JSONRenderer',
    ],
    'PAGINATE_BY': 10,
    'PAGINATE_BY_PARAM': 'page_size',
    'MAX_PAGINATE_BY': 100,
    'EXCEPTION_HANDLER': 'api.exceptions.sv_exception_handler',
}

OAUTH2_PROVIDER_APPLICATION_MODEL = 'api.Application'

OAUTH2_PROVIDER = {
    'AUTHORIZATION_CODE_EXPIRE_SECONDS': 60 * 60,
    'ACCESS_TOKEN_EXPIRE_SECONDS': 60 * 60 * 24 * 7,
    'SCOPES': {
        'read': 'Read scope',
        'write': 'Write scope',
    }
}

in models.py

你可能會需要一個自己的 Application model

from oauth2_provider.models import AbstractApplication

class Application(AbstractApplication):
    logo = models.ImageField(_(u'LOGO'), upload_to=utils.unique_path('app_logo/'), storage=svmedia_storage, null=True, blank=True)
    identity = models.PositiveSmallIntegerField(choices=api_settings.IDENTITY_CHOICES, db_index=True)

    class Meta:
        verbose_name = _('Application')
        verbose_name_plural = _('Applications')
        unique_together = (('client_id', 'client_secret'), )

    def __unicode__(self):
        return self.name

ref:
https://django-oauth-toolkit.readthedocs.org/
http://tomchristie.github.io/rest-framework-2-docs/api-guide/authentication#oauthauthentication

in urls.py

url(r'^oauth/', include('oauth2_provider.urls', namespace='oauth2_provider')),

Usage

註冊 application(也稱為 client)

client type:

confidential: Client 可以自我保密 client 的 credentials(例如跑在 Server 上面,且可以限制 credentials 的存取),或是可以用別的手段來確保認證過程的安全性。
public: Client 無法保密 credentials (Native App 或是跑在 Browser 裡面的 App),或是無法用任何手段來保護 client 的認證。

grant type:

authorization code: 最常用的方式,先拿到 authorization code 之後,再用它去換 access token 和 refresh token
implicit: Authorization Server 直接向 Client 核發 Access Token ,而不像 Authorization Code Grant Flow ,先核發 Grant ,再另外去拿 Access Token。
resource owner password-based: 直接拿用戶的帳號、密碼來換 access token
client credentials: 只使用 client id 和 client secret 來換 access token

ref:
http://blog.yorkxin.org/posts/2013/09/30/oauth2-2-cilent-registration/

獲得 access token

第一步:

GET:
http://local.streetvoice.com:8001/oauth/authorize/?client_id=6d01bea01ab7e46eace3&response_type=code&scope=read%20write&redirect_uri=http://local.packer.streetvoice.com:8003/process_oauth/

client_id=6d01bea01ab7e46eace3&
response_type=code&
scope=read%20write&
state=optional&
redirect_uri=http://local.packer.streetvoice.com:8003/process_oauth/

scope 要以空格分開

Response:
http://local.packer.streetvoice.com:8003/process_oauth/?code=lJvivguZOGFghSxjIzpRHdnwH2opwP

第二步:

POST:
http://local.streetvoice.com:8001/oauth/token/?client_id=6d01bea01ab7e46eace3&grant_type=authorization_code&code=lJvivguZOGFghSxjIzpRHdnwH2opwP&redirect_uri=http://local.packer.streetvoice.com:8003/process_oauth/

client_id=6d01bea01ab7e46eace3&
grant_type=authorization_code&
code=lJvivguZOGFghSxjIzpRHdnwH2opwP&
redirect_uri=http://local.packer.streetvoice.com:8003/process_oauth/

Response:

{
    "access_token": "KwB2XADuQVzcGYCFC7TzfP67NBn9Ud",
    "token_type": "Bearer",
    "expires_in": 604800,
    "refresh_token": "2kiVHEAKWNn7U57YFIw0TqXgVN1TQW",
    "scope": "read write"
}

ref:
http://blog.yorkxin.org/posts/2013/09/30/oauth2-4-1-auth-code-grant-flow/

刷新 access token

POST:
http://local.streetvoice.com:8001/oauth/token/?client_id=6d01bea01ab7e46eace3&grant_type=refresh_token&refresh_token=2kiVHEAKWNn7U57YFIw0TqXgVN1TQW&scope=read%20write

client_id=6d01bea01ab7e46eace3&
grant_type=refresh_token&
refresh_token=2kiVHEAKWNn7U57YFIw0TqXgVN1TQW&
scope=read%20write

Response:

{
    "access_token": "NGJ29T95qonMRKO91at6Oroke1d0J6",
    "token_type": "Bearer",
    "expires_in": 604800,
    "refresh_token": "WujMQX8GU4dd1obXDpG5quDxiIbbV7",
    "scope": "read write"
}

使用 access token

$ curl -H "Authorization: Bearer NGJ29T95qonMRKO91at6Oroke1d0J6" -X GET http://local.streetvoice.com:8001/api/v1/auth/me/
import requests

url = 'http://local.streetvoice.com:8001/api/v1/auth/me/'
headers = {
    'Authorization': 'Bearer NGJ29T95qonMRKO91at6Oroke1d0J6',
}
r = requests.get(url, headers=headers)
print(r.content)

Scrapy: web scraping framework for Python

http://doc.scrapy.org/
http://scrapy-chs.readthedocs.org/

Install

# on Ubuntu
$ sudo apt-get install libxml2-dev libxslt1-dev libffi-dev

# on Mac
$ brew install libffi

$ pip install scrapy service_identity

Usage

# interative shell
# http://doc.scrapy.org/en/latest/intro/tutorial.html#trying-selectors-in-the-shell
$ scrapy shell "http://www.wendyslookbook.com/2013/09/the-frame-a-digital-glossy/"
# or
$ scrapy shell --spider=seemodel
>>> view(response)
>>> fetch(req_or_url)

# create a project
$ scrapy startproject blackwindow

# create a spider
$ scrapy genspider fancy www.fancy.com

# run spider
$ scrapy crawl fancy
$ scrapy crawl pinterest -L ERROR

Spider
去爬資料的程式,用 parse() 定義你要 parse 哪些資料

Item
定義抓回來的資料欄位,可以想成是 django 的 model

Pipeline
對抓回來的資料進行加工,可能是清除 html 或是檢查重複之類的

scrapy 底層是用 lxml 和 Twisted

ref:
https://github.com/vinta/BlackWidow

Tips

Debugging

from scrapy.shell import inspect_response
inspect_response(response, self)

These 2 lines will invoke the interative shell.

相對路徑 XPath

use . dot

divs = response.xpath('//div')
for p in divs.xpath('.//p'):  # extracts all <p> inside
    print p.extract()

Access Django Model in Scrapy

def setup_django_env(django_settings_dir):
    import imp
    import sys

    from django.core.management import setup_environ

    django_project_path = os.path.abspath(os.path.join(django_settings_dir, '..'))
    sys.path.append(django_project_path)
    sys.path.append(django_settings_dir)

    f, filename, desc = imp.find_module('settings', [django_settings_dir, ])
    project = imp.load_module('settings', f, filename, desc)

    setup_environ(project)

# where Django settings.py placed
DJANGO_SETTINGS_DIR = '/all_projects/heelsfetishism/heelsfetishism'
setup_django_env(DJANGO_SETTINGS_DIR)

then you can import Django's modules in scrapy, like this:

from django.contrib.auth.models import User

from app.models import SomeModel

State

http://doc.scrapy.org/en/latest/topics/jobs.html#keeping-persistent-state-between-batches

def parse_item(self, response):
    # parse item here
    self.state['items_count'] = self.state.get('items_count', 0) + 1

Close Spider

from scrapy.exceptions import CloseSpider

# 只能在 spider 裡頭呼叫,不能用在 pipeline 裡
raise CloseSpider('Stop')

Others

XPath 的選擇節點語法
http://mi.hosp.ncku.edu.tw/km/index.php/dotnet/48-netdisk/57-xml-xpath

Avoiding getting banned
http://doc.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned

Python 抓取框架:Scrapy 的架构
http://biaodianfu.com/scrapy-architecture.html

Download images
https://scrapy.readthedocs.org/en/latest/topics/images.html

Login in Spider
http://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-userlogin