Lazy evaluation in Django middlewares

Lazy evaluation in Django middlewares

Attach a lazy evaluated function as a property of request in a middleware.

from django.contrib.gis.geoip import GeoIP
from django.utils.functional import SimpleLazyObject


def get_country_code(request):
    g = GeoIP()
    location = g.country(request.META['REMOTE_ADDR'])
    country_code = location.get('country_code', 'TW')

    return country_code


class CountryAndSiteMiddleware(object):

    def process_request(self, request):
        request.COUNTRY_CODE = SimpleLazyObject(lambda: get_country_code(request))

Then you could use request.COUNTRY_CODE whenever you want.

Integrate with webpages using CasperJS (built on top of PhantomJS)

Integrate with webpages using CasperJS (built on top of PhantomJS)

PhantomJS is a headless and scriptable WebKit runtime (aka browser) with JavaScript API.

Usage

in script.js

Login and delete spare movie tags on Douban.

var casper = require('casper').create({
  pageSettings: {
    loadImages: true,
    loadPlugins: false
  },
  logLevel: 'debug',
  verbose: true
});

// save session cookies
var fs = require('fs');
var page = require('webpage').create();

var cookieFile = 'cookies.json';

var saveSessionCookie = function() {
  try {
    fs.statSync(cookieFile);
  } catch (e) {
    fs.write(cookieFile, JSON.stringify(phantom.cookies), 'w');
  }
}

if (fs.isFile(cookieFile)) {
  Array.prototype.forEach.call(JSON.parse(fs.read(cookieFile)), function(x) {
    phantom.addCookie(x);
  });
}

// script
var loginUrl = 'https://accounts.douban.com/login';
var startUrl = 'https://movie.douban.com/people/vinta/all';

var tags_do_not_delete = [
  '丹麦', '新西兰', '新加坡', '以色列', '印度', '意大利', '瑞典', '墨西哥', '俄罗斯', '西班牙', '比利时'
];

casper.start(loginUrl, function() {
  this.echo(this.getCurrentUrl());
  this.echo(this.getTitle());

  this.capture('login.png');

  var data = {
    form_email: 'xxx',
    form_password: 'xxx'
  };

  // 可能會被豆瓣要求輸入驗證碼
  // 可以用 casperjs script.js --remote-debugger-port=9000
  // 先打開 login.png 看驗證碼是什麼
  // 到 http://127.0.0.1:9000/ 的 console 手動輸入驗證碼
  // data['captcha-solution'] = '123';

  this.waitForSelector('form#lzform');
  this.fill('form#lzform', data, true);
});

casper.then(function() {
  this.echo(this.getCurrentUrl());
  this.echo(this.getTitle());

  saveSessionCookie();

  this.capture('all.png');

  this.open(startUrl).then(function() {
    this.waitForSelector('#open_tags', function() {
      this.click('#open_tags');
    });

    this.waitWhileSelector('#open_tags');
  });
});

casper.then(function() {
  this.echo(this.getCurrentUrl());
  this.echo(this.getTitle());

  var links = this.evaluate(function() {
    var tagList = document.querySelectorAll('ul.tag-list li a');
    var theLinks = Array.prototype.map.call(tagList, function(elem) {
        return {
          tag: elem.textContent.trim(),
          href: elem.getAttribute('href'),
          count: parseInt(elem.nextElementSibling.textContent, 10)
        };
    });

    return theLinks;
  });

  var filteredLinks = links.filter(function(link) {
    if (link.count < 5 && tags_do_not_delete.indexOf(link.tag) == -1) {
      return true;
    }
    return false;
  });

  this.each(filteredLinks, function(self, link) {
    this.echo(link.tag + ', ' + link.count);

    self.thenOpen(link.href, function() {
      this.echo(this.getCurrentUrl());
      this.echo(this.getTitle());

      this.waitForSelector('#tag-del', function() {
        this.click('#tag-del');
      });

      this.waitForSelector('input[name="del_submit"]', function() {
        this.click('input[name="del_submit"]');
      });
    });
  });
});

casper.run();

To evaluate JavaScript code in the context of the webpage, you must use evaluate() function. The context is a sandbox.

ref:
http://docs.casperjs.org/en/latest/modules/index.html

ref:
https://github.com/vinta/playground/blob/master/casperjs/script.js

Save session cookies

--cookies-file=xxx.txt only store non-session cookies (which remain your logged-in or authenticated status). You have to save every cookie manually.

var casper = require('casper').create();

// save session cookies
var fs = require('fs');
var page = require('webpage').create();

var cookieFile = 'cookies.json';

var saveSessionCookie = function() {
  try {
    fs.statSync(cookieFile);
  } catch (e) {
    fs.write(cookieFile, JSON.stringify(phantom.cookies), 'w');
  }
}

if (fs.isFile(cookieFile)) {
  Array.prototype.forEach.call(JSON.parse(fs.read(cookieFile)), function(x) {
    phantom.addCookie(x);
  });
}

casper.start('yourUrl', function() {
  // do your shit
});

ref:
http://stackoverflow.com/questions/18739354/how-can-i-use-persisted-cookies-from-a-file-using-phantomjs

Run

$ docker run --rm -v `pwd`:/data vinta/casperjs:1.1.3 script.js

# or

$ brew install casperjs
$ casperjs script.js --disk-cache=true

ref:
https://hub.docker.com/r/vinta/casperjs/
https://hub.docker.com/r/zopanix/casperjs/

ref:
http://phantomjs.org/api/command-line.html

Run in debugging mode

$ casperjs script.js --remote-debugger-port=9000
$ open http://127.0.0.1:9000/
  • Click the first link (something like "file:///usr/local/Cellar/xxx").
  • In Sources tab, press "Enable Debugging" button.
  • In Console tab, type "__run();" to start.
  • Once breakpoints worked, you could go to Console tab to debug.

ref:
http://phantomjs.org/troubleshooting.html

Django's get_or_create() may raise IntegrityError but subsequent get() raises DoesNotExist

Django's get_or_create() may raise IntegrityError but subsequent get() raises DoesNotExist

Django 的 SomeModel.objects.get_or_create() 實際上的步驟是:

get, if-problem: save+some_trickery, if-still-problem: get again, if-still-problem: surrender and raise.

get_or_create() 拋出 IntegrityError 但是接著 get() 一次卻會是 DoesNotExist
這個錯誤常常會發生在 MySQL 的 AUTOCOMMIT=OFF 且 isolation level 是 REPEATABLE READ 的情況下
MySQL 默認採用 REPEATABLE READ(而 PostgreSQL 預設是 READ COMMITTED)
而 REPEATABLE READ 的定義是在同一個 transaction 之內 執行相同的 SELECT 一定會拿到相同的結果

with transaction.commit_on_success():
    try:
        user, created = User.objects.get_or_create(id=self.id, username=self.username)
    except IntegrityError:
        # 這個 get() 可能會發生 DoesNotExist
        user = User.objects.get(id=self.id)

會發生那個 DoesNotExist 通常是因為:

  • thread a 執行 get(),但是 DoesNotExist
  • 所以 thread a 接著 create(),但是在資料建立之前
  • thread b 幾乎在同一個時間點也執行 get(),也是 DoesNotExist
  • thread a 成功地 create()
  • 因為在 3 拿不到資料,所以 thread b 也執行 create(),但是因為 UNIQUE CONSTRAINT,導致失敗了
  • 所以 thread b 又會 get() 一次,但是因為 REPEATABLE READ,thread b 不會拿到 thread a 建立的資料,所以 DoesNotExist
  • 所以那個 User.DoesNotExist 就是 thread b 拋出來的錯誤
  • thread a 是成功的,你事後去資料庫裡查詢,會看到那個 User 的資料好端端地在那裡

ref:
http://stackoverflow.com/questions/2235318/how-do-i-deal-with-this-race-condition-in-django
https://blog.ionelmc.ro/2014/12/28/terrible-choices-mysql/

解決的方法 除了就乾脆不要在同一個 transaction 裡之外
基本上就只能把 isolation level 改成 READ COMMITTED 了
似乎沒有其他辦法了
因為這就是 isolation level 的定義

As of MariaDB/MySQL 5.1, if you use READ COMMITTED or enable innodb_locks_unsafe_for_binlog, you must use row-based binary logging.

SHOW VARIABLES LIKE 'binlog_format';
SET SESSION binlog_format = 'ROW';
SET GLOBAL binlog_format = 'ROW';

ref:
http://dba.stackexchange.com/questions/2678/how-do-i-show-the-binlog-format-on-a-mysql-server

Django 和 Celery 都建議使用 READ COMMITTED
https://docs.djangoproject.com/en/dev/ref/models/querysets/#get-or-create
https://code.djangoproject.com/ticket/13906
https://code.djangoproject.com/ticket/6641
http://docs.celeryproject.org/en/latest/faq.html#mysql-is-throwing-deadlock-errors-what-can-i-do

方法一

SHOW VARIABLES LIKE 'tx_isolation';
SET SESSION tx_isolation='READ-COMMITTED';
SET GLOBAL tx_isolation='READ-COMMITTED';

# or

SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;
SET GLOBAL TRANSACTION ISOLATION LEVEL READ COMMITTED;

ref:
http://dev.mysql.com/doc/refman/5.5/en/set-transaction.html

方法二

/etc/mysql/my.cnf

[mysqld]
transaction-isolation = READ-COMMITTED

方法三

in settings.py

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        ...
        'OPTIONS': {
            'init_command': 'SET SESSION tx_isolation="READ-COMMITTED"',
        },
    }
}

方法四

自己寫一個 get_or_create()

from django.db import transaction

@transaction.atomic
# or
@transaction.commit_on_success
def my_get_or_create(...):
    try:
        obj = MyObj.objects.create(...)
    except IntegrityError:
        transaction.commit()
        obj = MyObj.objects.get(...)
    return obj

ref:
http://stackoverflow.com/questions/2235318/how-do-i-deal-with-this-race-condition-in-django

--

MySQL 默認是 AUTOCOMMIT
http://dev.mysql.com/doc/refman/5.5/en/commit.html

Django 默認也是 AUTOCOMMIT
除非你使用了 TransactionMiddleware(Django 1.5 以前)或是設置 ATOMIC_REQUESTS = False(Django 1.6 以後)
TransactionMiddleware 的效果類似 @transaction.commit_on_success@transaction.atomic
都是把 AUTOCOMMIT 關掉
差別在於範圍不同
http://django.readthedocs.io/en/1.5.x/topics/db/transactions.html#django-s-default-transaction-behavior
http://django.readthedocs.io/en/1.6.x/topics/db/transactions.html#managing-autocommit
http://django.readthedocs.io/en/1.6.x/ref/settings.html#std:setting-DATABASE-ATOMIC_REQUESTS