Parallel tasks in Python: concurrent.futures

Install

concurrent.futures is part of the standard library in Python 3.2+. If you're using an older version of Python, you need to install the futures package.

$ pip install futures

ref:
https://docs.python.org/3/library/concurrent.futures.html

executor.map()

You should use the ProcessPoolExecutor for CPU intensive tasks and the ThreadPoolExecutor is suited for network operations or I/O. The ProcessPoolExecutor uses the multiprocessing module, which is not affected by GIL (Global Interpreter Lock) but also means that only picklable objects can be executed and returned.

In Python 3.5+, map() receives an optional argument: chunksize. For very long iterables, using a large value for chunksize can significantly improve performance compared to the default size of 1. With ThreadPoolExecutor, chunksize has no effect.

from concurrent.futures import ThreadPoolExecutor
import time

import requests

def fetch(a):
    url = 'http://httpbin.org/get?a={0}'.format(a)
    r = requests.get(url)
    return r.json()['args']

start = time.time()

# if max_workers is None or not given, it will default to the number of processors, multiplied by 5
with ThreadPoolExecutor(max_workers=None) as executor:
    for result in executor.map(fetch, range(30)):
        print('response: {0}'.format(result))

print('Use requests + ThreadPoolExecutor cost: {}'.format(time.time() - start))

ref:
https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures
https://www.blog.pythonlibrary.org/2016/08/03/python-3-concurrency-the-concurrent-futures-module/
http://masnun.com/2016/03/29/python-a-quick-introduction-to-the-concurrent-futures-module.html

executor.submit() and as_completed()

executor.submit() returns a Future object. A Future is basically an object that encapsulates an asynchronous execution of a function that will finish (or raise an exception) in the future.

The main difference between map and as_completed is that map returns the results in the order in which you pass iterables. On the other hand, the first result from the as_completed function is from whichever future completed first. Besides, iterating a map() returns results of futures; iterating a as_completed(futures) returns futures themselves.

from concurrent.futures import ThreadPoolExecutor, as_completed

import requests

def fetch(url, timeout):
    r = requests.get(url, timeout=timeout)
    data = r.json()['args']
    return data

with ThreadPoolExecutor(max_workers=10) as executor:
    futures = {}
    for i in range(42):
        url = 'https://httpbin.org/get?i={0}'.format(i)
        future = executor.submit(fetch, url, 60)
        futures[future] = url

    for future in as_completed(futures):
        url = futures[future]
        try:
            data = future.result()
        except Exception as exc:
            print(exc)
        else:
            print('fetch {0}, get {1}'.format(url, data))

ref:
https://docs.python.org/3/library/concurrent.futures.html#future-objects

Machine Learning Glossary 常見名詞解釋

Anomaly Detection 異常偵測

把一些異常值從 dataset 中挑出來

Anscombe's Quartet 安斯庫姆四重奏

四張圖表表示四組基本的統計特性一致的數據,但是各自畫出來的圖表完全不同
主要是在說統計方法有其侷限和離群值對統計的影響之大
還有就是分析數據前應該要先畫圖表

ref:
https://www.wikiwand.com/zh-tw/%E5%AE%89%E6%96%AF%E5%BA%93%E5%A7%86%E5%9B%9B%E9%87%8D%E5%A5%8F

Association Rule 關聯規則

找出資料之間的隱含關係
例如知名的啤酒與尿布

Best Subset Selection

是一種 model selection 的方法

Cost Function / Loss Function 損失函數

cost function 算出來的值(所謂的 cost)越小表示 model 越能 fit 資料
常用的 cost function 有 Mean Squared Error (MSE)

ref:
https://ml.berkeley.edu/blog/2016/11/06/tutorial-1/

Cross Validation 交叉驗證

把 dataset 分成 training set 和 testing set
用 training set 來訓練資料
用 testing set 來測試準確度
這兩組數據必須是從原始的 dataset 裡「均勻取樣」(隨機)

也有人分成 training set、validation set、testing set
training set 用來訓練模型,validation set 用於模型的選擇,testing set 用於最終對學習方法評估

另外一種方式是 k-Fold
假設 k 是 3
就是把 dataset 分成三等份
先用 1 + 2 訓練 model,用 3 來驗證
再用 1 + 3 訓練 model,用 2 來驗證
最後用 2 + 3 訓練 model,用 1 來驗證
通常 k 越大,bias 會比較小,variance 會比較大

Curse of Dimensionality 維度災難

當數據的維度(feature 數)超過某個程度之後
導致計算的時間過久、記憶體用量過大(因為是指數型增加)
也必然會造成數據稀疏
特徵越多也可能造成 overfitting

ref:
https://www.quora.com/What-is-the-curse-of-dimensionality
http://stats.stackexchange.com/questions/169156/explain-curse-of-dimensionality-to-a-child

Decision Boundary 決策邊界

A smoother boundary corresponds to a simpler model.

每個特徵表示為一個維度
decision boundary 就是能夠把整個特徵空間裡的 dataset 正確劃分的一條邊界
這個邊界可能是 linear 或 non-linear

Dimensionality Reduction 降維

算是 unsupervised learning 的一種(transformations of the dataset)
可以分成 feature selection 和 feature extraction
在不喪失太多資訊的前提下減少 features 的維度
換個說法是嘗試用更少的維度來表示這個 dataset
維度減少的好處是提升計算效率和更容易進行 visualization

ref:
https://www.wikiwand.com/en/Dimensionality_reduction

從一堆 features 中選擇最有用的 features
稱為 feature selection
常見的方法有 Greedy forward selection

把原本高維度的 features 轉換成較少維度的 features
稱為 feature extraction
轉換之後已經不是原本的那些 features 了
常見的方法有 Principal Component Analysis (PCA)、Non-negative Matrix Factorization (NMF)

ref:
https://www.wikiwand.com/en/Feature_engineering
https://www.wikiwand.com/en/Feature_selection

Ensemble Learning 組合式學習

就是指結合多種演算法的 machine learning
例如 Random Forest(decision trees + bagging)

常見的 ensemble methods 有:
bagging (aka bootstrap aggregating)
boosting

Error 誤差 / Bias 偏差 / Variance 方差

Error = Bias + Variance 是指整個模型的準確度

Bias 是指預測值和真實值之間的差距,表示模型的精準度(反映的是模型在樣本上的輸出與真實值之間的誤差)
偏差越大,越偏離真實數據
因為模型太簡單而帶來的預測不準確 >> high bias

Variance 是指預測值的變化範圍,表示模型的穩定性(反映的是模型每一次輸出結果與模型輸出期望之間的誤差)
方差越大,數據的分佈越分散
因為模型太複雜而帶來的更大的空間變化和不確定性 >> high variance

ref:
https://www.zhihu.com/question/20448464 有圖
https://www.zhihu.com/question/27068705

Feature Engineering 特徵工程

就是找出(或是創造出)能夠讓演算法運作得更好的 features 的過程
也可能是整合、轉換多個相關的 features 變成一個新的 feature
通常會避免使用過多的 features 餵給演算法

Feature Scaling 特徵縮放

屬於 preprocessing 的一部分
統一各個特徵的數值範圍
對很多演算法來說這個步驟是必要的

rescaling 就是把各種尺度的數值(123, 10000, 8)統一表示成 0 ~ 1(或 -1 ~ 1)之間的數字
也稱之為 normalization 歸一化

standarization 標準化
另一種統計學常用的方法,把數值轉換成 z-scores

ref:
https://www.quora.com/What-is-the-difference-between-normalization-standardization-and-regularization-for-data
http://sobuhu.com/ml/2012/12/29/normalization-regularization.html

Forward Stepwise Selection

一次增加一個 feature 來訓練 model
每次都計算準確率
直到所有 features 都用到

Backwards Stepwise Selection 就是反過來

Generalization 泛化

If a model is able to make accurate predictions on unseen data, we say it is able to generalize from the training set to the test set.

就是指 model 預測 unseen data 的能力
例如一個 overfitting 的 model,它的泛化能力就不好

ref:
https://www.quora.com/What-is-generalization-in-machine-learning

Gradient Descent 梯度下降

是一種找出最小的 cost function 的演算法
也就是找出最好的 model parameters

Greedy Feature Selection

一次只用一個 feature 來訓練 model

In greedy feature selection we choose one feature, train a model and evaluate the performance of the model on a fixed evaluation metric. We keep adding and removing features one-by-one and record performance of the model at every step. We then select the features which have the best evaluation score.

Hyperparameter 超參數

就是在建立 machine learning 演算法時輸入的參數

Linear Separability 線性可分

當你有一堆 data points
你能夠畫出一條「直線」來區分這些點時
就可以說是 linearly separable
反而則是 linearly inseparable

Logistic Curve

就是一條長得像頭尾被拉長拉扁的 S 的曲線

ref:
https://www.stat.ubc.ca/~rollin/teach/643w04/lec/node46.html

Missing Value Imputation(缺失值填充)

針對那些沒有值的欄位,可能是用中位數、平均值或是最常見的值之類的資料填進去
也稱為 interpolation

Manifold Learning

是一種 non-linear dimensionality reduction 的方式
可以用在把高維度的 dataset 變成較低維度
主要用來做 visualization
常用的有 t-SNE

manifold learning 通常用在 exploratory data analysis
不像 PCA 那樣,會把結果用於 supervised learning 的輸入

ref:
http://scikit-learn.org/stable/modules/manifold.html
https://www.wikiwand.com/en/Nonlinear_dimensionality_reduction

Predictors

就是 features

Principal component analysis (PCA) 主成份分析

主成分分析,是一种分析、简化数据集的技术。用于减少数据集的维数,同时保持数据集中的对方差贡献最大的特征。

用來 reduce dimensionality(減少 dataset 的維度數)
可以找出對 Variance 貢獻最大的特徵

Overfitting 過度擬合(過擬合)/ Underfitting 擬合不足(欠擬合)

overfitting 常常發生在 model 很複雜、有很多參數的時候
或是 dataset 裡有很多 noise 或 outlier
表現為在 training set 的準確率很高,但是在 testing set 的準確率卻很低
複雜模型 >> high variance / low bias >> overfitting

underfitting 通常發生在 model 太簡單的時候
表現為就算是在 training set 上的錯誤率就很高
簡單模型 >> high bias / low variance >> underfitting

ref:
http://www.csuldw.com/2016/02/26/2016-02-26-choosing-a-machine-learning-classifier/

Regularization 正規化、正則化

Regularization means explicitly restricting a model to avoid overfitting.

是一種防止 overfitting 的技巧
regularization 保留所有 features
但是降低或懲罰某些 features 對 model 預測值的影響
常見的方法有 L1 和 L2

Trellis Plotting

ref:
https://www.zhihu.com/question/37189447

SSL23_GET_SERVER_HELLO: tlsv1 alert internal error (SNI) in Python 2.7.6

會發生在 Python 2.7.6 連 CloudFlare HTTPS 的時候
似乎是因為 2.7.6 不支援 Server-Name-Indication (SNI)
Python 2.7.9 之後才支援

相關錯誤:

SSL routines: SSL23_GET_SERVER_HELLO: tlsv1 alert internal error
SSLError: bad handshake: SysCallError(32, 'EPIPE')

解決辦法:

$ sudo apt-get install libffi-dev libssl-dev
$ pip install -U urllib3[secure] ndg-httpsclient pyasn1

ref:
https://stackoverflow.com/questions/18578439/using-requests-with-tls-doesnt-give-sni-support/
http://docs.python-requests.org/en/master/community/faq/
https://github.com/kennethreitz/requests/issues/749
https://github.com/kennethreitz/requests/pull/1347