碼天狗週刊 第 99 期 @vinta - Apache Spark, Python, Machine Learning, Feature Engineering, Testing, Linux

碼天狗週刊 第 99 期 @vinta - Apache Spark, Python, Machine Learning, Feature Engineering, Testing, Linux

本文同步發表於 CodeTengu Weekly - Issue 99

Spark SQL cookbook (Python)

最近在為 StreetVoice 開發一個音樂的推薦系統,採用 Apache Spark,不過因為老是忘記 DataFrame 某某功能的用法,所以就乾脆仿效 O'Reilly 著名的 Cookbook 系列,幫自己寫了一篇 Spark SQL cookbook,複習、速查兩相宜啊。

因為 Spark 支援 Scala、Java、Python 和 R,一開始是打算用 Scala 來練練功的,不過畢竟是公司的專案,考慮到後續其他人的參與和維護,好像還是採用一個團隊成員都熟悉的語言比較好吶(成熟的大人.jpg)。

延伸閱讀:

How to Size Executors, Cores and Memory for a Spark application running in memory

在使用 spark-submit 的時候可以指定 --driver-memory--executor-memory--executor-cores--num-executors 等參數來配置你的 Spark app 可以使用的運算資源,這篇文章指出了幾個需要注意的地方以及 One executor per core 和 One executor per node 這兩種做法會有什麼問題。

P.S. 現在 Spark 除了 Standalone 和 YARN 模式之外,也開始實驗性地支援 Kubernetes 了:apache-spark-on-k8s,看樣子 k8s 真的有一統江湖之勢了啊。

Mastering Feature Engineering

整個推薦系統的 pipeline 可以很粗略地分成 candidate generation 和 ranking 兩個部分,而 ranking 常用的模型之一就是簡單粗暴的 Logistic Regression(通常還會搭配 GBDT 或 Deep Neural Networks)。因為要用 LR 需要大量的 Feature Engineering,所以我就特地找了一本專門在講特徵工程的書,上週末去剪頭髮的時候終於讀完,正好可以推薦給大家。

不過這本書講的是比較基礎的部分(不要想一步登天嘛),例如針對數值特徵的 Binning 或標準化、針對文字特徵的 TF-IDF 和針對類別特徵的 One-hot encoding 或 Feature hashing,對創建出非線性特徵的 Feature Construction 則沒有什麼著墨。可以搭配前幾期推薦過的「机器学习中的数据清洗与特征处理综述」一起看。

Write Explicit Tests

Sometimes, normal programming good practices don’t apply to software tests. DRY in particular I don’t subscribe to for test code, because I want my tests to read like a story. - Kent Beck 如是說

你減少了重複,但是卻帶來了耦合。寫程式真的很難啊。

Strace - The SysAdmin's Microscope

strace 是個可以用來觀測某個 script 或 process 在 system call 這個層面到底做了哪些事的指令,是 troubleshooting 的好幫手,尤其是用來解決在 Linux 上大家喜聞樂見的「幹你娘為什麼 xxx 跑不起來?!(20 分鐘之後)噢我權限設錯了」的問題。

延伸閱讀:

Speed up Python and Node.js builds on Travis CI

Speed up Python and Node.js builds on Travis CI

Travis CI's caching archives all directories listed in the configuration and uploads them to Amazon S3. Cached contents are available to any build on the repository, including Pull Requests. For Python and Node.js projects, you could cache both site-packages and node_modules directories in every Travis CI build.

Here is an example of .travis.yml:

sudo: false

language: python

python:
  - "2.7"

node_js: 4

cache:
  directories:
    - $HOME/.cache/pip
    - $HOME/virtualenv/python2.7.9/lib/python2.7/site-packages
    - node_modules

before_install:
  - pip install -U pip

install:
  - pip install -r requirements.txt
  - pip install coverage --ignore-installed
  - npm install

script:
  - coverage run manage.py test

In the case of mine, after applying these changes, the installation time of pip and npm reduces from 180 seconds to 5 seconds.

One thing should be mentioned here: Since we didn't specify any bin folder in the configuration (and I don't think that's necessary), any execution file that being installed by pip such as coverage or django-admin.py will not exist in subsequent builds. If you need those commands, you could just force install them by adding pip install some_package --ignore-installed.

References:

Caching Dependencies and Directories
https://docs.travis-ci.com/user/caching/

How to cache requirements for a Django project on Travis-CI?
http://stackoverflow.com/questions/19422229/how-to-cache-requirements-for-a-django-project-on-travis-ci

如何在 Travis CI 加快 Python 單元測試速度
https://tzangms.com/how-to-speed-up-python-unit-test-on-travis-ci/

Integrate with webpages using CasperJS (built on top of PhantomJS)

Integrate with webpages using CasperJS (built on top of PhantomJS)

PhantomJS is a headless and scriptable WebKit runtime (aka browser) with JavaScript API.

Usage

in script.js

Login and delete spare movie tags on Douban.

var casper = require('casper').create({
  pageSettings: {
    loadImages: true,
    loadPlugins: false
  },
  logLevel: 'debug',
  verbose: true
});

// save session cookies
var fs = require('fs');
var page = require('webpage').create();

var cookieFile = 'cookies.json';

var saveSessionCookie = function() {
  try {
    fs.statSync(cookieFile);
  } catch (e) {
    fs.write(cookieFile, JSON.stringify(phantom.cookies), 'w');
  }
}

if (fs.isFile(cookieFile)) {
  Array.prototype.forEach.call(JSON.parse(fs.read(cookieFile)), function(x) {
    phantom.addCookie(x);
  });
}

// script
var loginUrl = 'https://accounts.douban.com/login';
var startUrl = 'https://movie.douban.com/people/vinta/all';

var tags_do_not_delete = [
  '丹麦', '新西兰', '新加坡', '以色列', '印度', '意大利', '瑞典', '墨西哥', '俄罗斯', '西班牙', '比利时'
];

casper.start(loginUrl, function() {
  this.echo(this.getCurrentUrl());
  this.echo(this.getTitle());

  this.capture('login.png');

  var data = {
    form_email: 'xxx',
    form_password: 'xxx'
  };

  // 可能會被豆瓣要求輸入驗證碼
  // 可以用 casperjs script.js --remote-debugger-port=9000
  // 先打開 login.png 看驗證碼是什麼
  // 到 http://127.0.0.1:9000/ 的 console 手動輸入驗證碼
  // data['captcha-solution'] = '123';

  this.waitForSelector('form#lzform');
  this.fill('form#lzform', data, true);
});

casper.then(function() {
  this.echo(this.getCurrentUrl());
  this.echo(this.getTitle());

  saveSessionCookie();

  this.capture('all.png');

  this.open(startUrl).then(function() {
    this.waitForSelector('#open_tags', function() {
      this.click('#open_tags');
    });

    this.waitWhileSelector('#open_tags');
  });
});

casper.then(function() {
  this.echo(this.getCurrentUrl());
  this.echo(this.getTitle());

  var links = this.evaluate(function() {
    var tagList = document.querySelectorAll('ul.tag-list li a');
    var theLinks = Array.prototype.map.call(tagList, function(elem) {
        return {
          tag: elem.textContent.trim(),
          href: elem.getAttribute('href'),
          count: parseInt(elem.nextElementSibling.textContent, 10)
        };
    });

    return theLinks;
  });

  var filteredLinks = links.filter(function(link) {
    if (link.count < 5 && tags_do_not_delete.indexOf(link.tag) == -1) {
      return true;
    }
    return false;
  });

  this.each(filteredLinks, function(self, link) {
    this.echo(link.tag + ', ' + link.count);

    self.thenOpen(link.href, function() {
      this.echo(this.getCurrentUrl());
      this.echo(this.getTitle());

      this.waitForSelector('#tag-del', function() {
        this.click('#tag-del');
      });

      this.waitForSelector('input[name="del_submit"]', function() {
        this.click('input[name="del_submit"]');
      });
    });
  });
});

casper.run();

To evaluate JavaScript code in the context of the webpage, you must use evaluate() function. The context is a sandbox.

ref:
http://docs.casperjs.org/en/latest/modules/index.html

ref:
https://github.com/vinta/playground/blob/master/casperjs/script.js

Save session cookies

--cookies-file=xxx.txt only store non-session cookies (which remain your logged-in or authenticated status). You have to save every cookie manually.

var casper = require('casper').create();

// save session cookies
var fs = require('fs');
var page = require('webpage').create();

var cookieFile = 'cookies.json';

var saveSessionCookie = function() {
  try {
    fs.statSync(cookieFile);
  } catch (e) {
    fs.write(cookieFile, JSON.stringify(phantom.cookies), 'w');
  }
}

if (fs.isFile(cookieFile)) {
  Array.prototype.forEach.call(JSON.parse(fs.read(cookieFile)), function(x) {
    phantom.addCookie(x);
  });
}

casper.start('yourUrl', function() {
  // do your shit
});

ref:
http://stackoverflow.com/questions/18739354/how-can-i-use-persisted-cookies-from-a-file-using-phantomjs

Run

$ docker run --rm -v `pwd`:/data vinta/casperjs:1.1.3 script.js

# or

$ brew install casperjs
$ casperjs script.js --disk-cache=true

ref:
https://hub.docker.com/r/vinta/casperjs/
https://hub.docker.com/r/zopanix/casperjs/

ref:
http://phantomjs.org/api/command-line.html

Run in debugging mode

$ casperjs script.js --remote-debugger-port=9000
$ open http://127.0.0.1:9000/
  • Click the first link (something like "file:///usr/local/Cellar/xxx").
  • In Sources tab, press "Enable Debugging" button.
  • In Console tab, type "__run();" to start.
  • Once breakpoints worked, you could go to Console tab to debug.

ref:
http://phantomjs.org/troubleshooting.html

testify: Testing in Go

假設你把 project 放在 $GOPATH/src/github.com/your_username/ 底下

test

in xxx_test.go

package pangu_test

import (
    "github.com/stretchr/testify/suite"
    "github.com/your_username/your_project"
    "testing"
)

type YourTestSuite struct {
    suite.Suite
}

func (suite *YourTestSuite) TestFunction1() {
    suite.Equal("expected", "actual")
}

// In order for 'go test' to run this suite, we need to create
// a normal test function and pass our suite to suite.Run
func TestYourTestSuite(t *testing.T) {
    suite.Run(t, new(YourTestSuite))
}

go test 會自動執行當前目錄下的所有以 _test.go 結尾的檔案
更準確地說是執行所有以 Test 開頭的 functions

# run tests in the current directory
$ go test

# run tests in the current directory and sub-directories
$ go test ./...

ref:
http://golang.org/pkg/testing/
https://github.com/stretchr/testify

How to specify test resources?
https://groups.google.com/forum/#!topic/golang-nuts/VPVlIiO5yXw

example

in example_test.go

這個檔名是固定的
裡面的內容會出現在 godoc 的 Example 條目底下
確切地說
ExampleFunction1() 這個 function 的內容就會出現在 Function1() 的 Example 條目中
你可以執行 godoc -http=:6060 在本機查看

因為也是以 _test.go 結尾的關係
所以執行 go test 的時候也會執行這個檔案
貼心地幫你確認你的 example 程式碼是不是真的能夠動

package pangu_test

import (
    "fmt"
    "github.com/your_username/your_project"
)

func ExampleFunction1() {
    fmt.Println(`The knights who say "Ni"!`)
    // Output:
    // The knights who say "Ni"!
}

benchmark

in benchmark_test.go

這個檔名是固定的

package pangu_test

import (
    "testing"
)

func BenchmarkFunction1(b *testing.B) {
    for i := 0; i < b.N; i++ {
        doShit()
    }
}
# This will run tests and benchmarks
$ go test -bench=.

# If you want to skip the tests, you can do so by passing a regex to the -run flag that will not match anything. 
$ go test -run=XXX -bench=.

coverage

$ go test -cover -coverprofile=cover.out -covermode=count
$ go tool cover -func=cover.out
$ go tool cover -html=cover.out

# or

$ go test -cover -covermode=count ./...

Integrate with Travis CI and Coveralls

language: go

go:
  - 1.4.2
  - 1.4.1
  - 1.4
  - 1.3.3
  - 1.3.2
  - 1.3.1
  - 1.3
before_install:
  - go get github.com/axw/gocov/gocov
  - go get github.com/mattn/goveralls
  - go get golang.org/x/tools/cmd/cover
script:
  - goveralls -service=travis-ci

ref:
http://docs.travis-ci.com/user/languages/go/
https://coveralls.zendesk.com/hc/en-us/articles/201342809-Go

Karma: JavaScript test runner

ref:

http://karma-runner.github.io/

建議跟 grunt 一起用
然後把配置都寫在 Gruntfile.js 裡

Install

$ npm install karma --save-dev

# 產生 karma.conf.js,你可以看一下默認的配置
$ ./node_modules/.bin/karma init

Configuration

如果是用 karma 的方式跑 jasmine 的話
就不需要 SpecRunner.html
而是改用 Gruntfile.js 裡的 files

in Gruntfile.js

karma: {
    test: {
        options: {
            // base path, that will be used to resolve files and exclude
            basePath: '',
            // start these browsers
            browsers: [
                'PhantomJS'
            ],
            coverageReporter: {
                // multiple reporters
                reporters: [
                    {type: 'lcov', dir:'coverage/'},
                    {type: 'text'}
                ]
            },
            // list of files to exclude
            exclude: [],
            // list of files / patterns to load in the browser
            files: [
                'dist/pangu.js',
                'tests/lib/jquery/jquery-1.10.2.min.js',
                'tests/lib/jasmine-jquery/jasmine-jquery.js',
                'tests/spec/**/*.js',
                {
                    pattern: 'tests/fixtures/**/*.html',
                    included: false,
                    served: true
                }
            ],
            // frameworks to use
            frameworks: [
                'jasmine'
            ],
            plugins: [
                'karma-*'
            ],
            // preprocessors allow you to do some work with your files before they get served to the browser.
            preprocessors: {
                '**/*.html': [],
                // source files, that you wanna generate coverage for
                'dist/pangu.js': [
                    'coverage'
                ]
            },
            // test results reporter to use, possible values: 'dots', 'progress', 'junit', 'growl', 'coverage'
            reporters: [
                'progress',
                'coverage'
            ],
            // continuous Integration mode, if true, it capture browsers, run tests and exit
            singleRun: true
        }
    }
},

ref:
http://karma-runner.github.io/0.10/config/configuration-file.html

Issues

Fixture could not be loaded

用 jasmine-jquery loadFixtures 的時候出現這個錯誤
是因為 karma 的 preprocessors 在作怪
改成這樣就可以了:

in Gruntfile.js

karma: {
  unit: {
    options: {
      ...
      preprocessors: {
        '**/*.html': []
      },
      ...
    }
  }
},

ref:
http://karma-runner.github.io/0.10/config/preprocessors.html
https://github.com/karma-runner/karma/issues/736
https://github.com/karma-runner/karma/issues/788

而且 karma 會把檔案放在 base/ 路徑底下
所以 jasmine-jquery 還要額外設定:

in SpecHelper.js

beforeEach(function() {
  // https://github.com/karma-runner/karma/issues/481
  var path = '';
  if (typeof window.__karma__ !== 'undefined') {
    path += 'base/'
  }
  jasmine.getFixtures().fixturesPath = path + 'test/fixture/';
});