Calculate the similarity of two vectors

Calculate the similarity of two vectors

scipy.spatial.distance
https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

sklearn.metrics
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

Distance

Euclidean distance 歐幾里德距離

from sklearn.metrics.pairwise import euclidean_distances

euclidean_distances([0, 0, 0, 0], [0, 0, 0, 0])
# array([[ 0.]])

euclidean_distances([1, 0, 1, 0], [1, 0, 1, 0])
# array([[ 0.]])

euclidean_distances([0, 1, 0, 1], [1, 0, 1, 0])
# array([[ 2.]])

ref:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html

Manhattan Distance 曼哈頓距離

from sklearn.metrics.pairwise import manhattan_distances

manhattan_distances([0, 0, 0, 0], [0, 0 , 0, 0])
# array([[ 0.]])

manhattan_distances([1, 1, 1, 0], [1, 0, 0, 0])
# array([[ 2.]])

manhattan_distances([0, 1, 0, 1], [1, 0, 1, 0])
# array([[ 4.]])

ref:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.manhattan_distances.html

Similarity

Cosine similarity 餘弦相似度

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import cosine_distances
from sklearn.metrics.pairwise import pairwise_distances
from scipy.spatial.distance import pdist, squareform

cosine_similarity(matrix) == \
1 - cosine_distances(matrix) == \
1 - pairwise_distances(matrix, metric='cosine') == \
1 - squareform(pdist(matrix, 'cosine'))

cosine_similarity([0, 0, 0, 0], [0, 0, 0, 0])
# array([[ 0.]])

cosine_similarity([1, 0, 0, 0], [1, 0, 0, 0])
# array([[ 1.]])

cosine_similarity([1, 0, 1, 0], [0, 1, 0, 1])
# array([[ 0.]])

cosine_similarity([1, 0, 0, 1], [1, 0, 0, 0])
# array([[ 0.70710678]])

cosine_similarity([1, 0, 0, 1], [1, 0, 1, 0])
# array([[ 0.5]])

ref:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html

Jaccard similarity coefficient score

from sklearn.metrics import jaccard_similarity_score

jaccard_similarity_score([0, 0, 0, 0], [0, 0, 0, 0])
# 1.0

jaccard_similarity_score([0, 0, 0, 0], [1, 0, 0, 0])
# 0.75

jaccard_similarity_score([1, 0, 0, 0], [1, 0, 0, 0])
# 1.0

jaccard_similarity_score([1, 0, 1, 0], [0, 1, 0, 1])
# 0.0

ref:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html

http://datascience.stackexchange.com/questions/5121/applications-and-differences-for-jaccard-similarity-and-cosine-similarity

Log-Likelihood similarity

TODO

Pearson correlation coefficient 皮爾森相關係數

It has a value between +1 and −1 inclusive, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. You should only calculate Pearson Correlations when the number of items in common between two users is > 1, preferably greater than 5/10. Only calculate the Pearson Correlation for two users where they have commonly rated items.

For hign-dimensional binary attributes, the performances of Pearson correlation coefficient and Cosine similarity
are better than Jaccard similarity coefficient score.

from scipy.stats import pearsonr

pearsonr([1, 0, 1, 1], [0, 0, 0, 0])
# (nan, 1.0)

pearsonr([1, 0, 1, 1], [1, 0, 0, 0])
# (0.33333333333333331, 0.66666666666666607)

pearsonr([1, 0, 1, 0], [0, 1, 0, 1])
# (-1.0, 0.0)

ref:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html
http://stackoverflow.com/questions/11429604/how-is-nan-handled-in-pearson-correlation-user-user-similarity-matrix-in-a-recom

Dissimilarity

Dice dissimilarity

from scipy.spatial.distance import dice
import numpy as np

v1 = np.array([0, 0, 0, 0])
v2 = np.array([0, 0, 0, 0])

try:
    sim = 1.0 - dice(v1.astype(bool), v2.astype(bool))
except ZeroDivisionError:
    sim = 0

ref:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.dice.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.kulsinski.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.sokalsneath.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.yule.html

Recommender System: Collaborative Filtering 協同過濾推薦演算法

Recommender System: Collaborative Filtering 協同過濾推薦演算法

dataset 會是 m 個用戶對 n 個物品的評分 utility matrix
因為通常只有部分用戶和部份物品會有評分資料
所以是一個 sparse matrix(稀疏矩陣)
目標是利用這些稀疏的資料去預測出用戶對他還沒評分過的物品的評分
除了評分之外,也可能是喜歡(和不喜歡)、購買、瀏覽之類的數據
又分成主動評分和被動評分

CF 的缺點:

  • 如果沒有用戶的歷史數據就沒辦法做任何推薦
  • 以及無論 user-based 或 item-based 都需要消耗大量的運算資源
  • 大部分用戶有評分紀錄的資料都只佔所有資料中的很小一部分,matrix 相當稀疏,很難找到相似的資料
  • 會有馬太效應,越熱門的物品越容易被推薦,所以通常都會降低熱門物品的權重

CF 主要分為 memory-based 和 model-based 兩大類
user-based 和 item-based collaborative filtering 屬於 memory-based
memory-based 基本上就是純粹的計算,沒有什麼 Machine Learning 的成分
model-based 才是 Machine Learning 的範疇

User-based Collaborative Filtering

        item_a  item_b  item_c
user_1  2       -       3
user_2  5       2       -
user_3  3       3       1
user_4  -       2       2
# the algorithm from "Mahout in Action"
for every other user w
  compute a similarity s between u and w
  retain the top users, ranked by similarity, as a neighborhood n

for every item i that some user in n has a preference for,
      but that u  has no preference for yet
  for every other user v in n that has a preference for i
    compute a similarity s between u and  v
    incorporate v's preference for i, weighted by s, into a running average

user-based 考慮的是 user 和 user 之間的相似程度

給定一個用戶 A
計算用戶 A 跟其他所有用戶的相似度
找出最相似的 m 個用戶
再找出這些用戶有評分但是用戶 A 沒有評分的物品(也可以額外限制至少要幾個用戶有評分過)
以「相似用戶的相似度」和「該用戶對該物品的評分」來加權算出用戶 A 對這些未評分物品的評分
最後推薦給 A 評分最高的 n 個物品

預測 user_4 對 item_a 的評分 =
(user_4_user_1_sim x user_1_item_a_rating + user_4_user_3_sim x user_3_item_a_rating) / (user_4_user_1_sim + user_4_user_3_sim)

user-based 的特點:

  • 適合 user 遠少於 item 的系統,相似度的計算量會較少
  • item 的時效性強、更多樣的系統,例如新聞、社交網站,適合用 user-based CF
  • 不容易給出推薦理由
  • 驚喜度較高

常用的相似度演算法:

  • Pearson Correlation Coefficient
  • Cosine Similarity
  • Adjusted Cosine Similarity(有些用戶傾向於對所有物品評高分或低分,這個計算方式可以消除這樣的影響)

ref:
https://www.safaribooksonline.com/library/view/mahout-in-action/9781935182689/kindle_split_013.html

Item-based Collaborative Filtering

        user_1  user_2  user_3  user_4
item_a  2       5       3       -
item_b  -       2       3       2
item_c  3       -       1       2
# the algorithm from "Mahout in Action"
for every item i that u has no preference for yet
  for every item j that u has a preference for
    compute a similarity s between i and j
    add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average

item-based 考慮的是 item 和 item 之間的相似程度
item-based 用的還是跟 user-based CF 一模一樣的資料
而不是使用 item 本身的特徵(那個叫 content-based)

如果物品數比用戶數還少得多的話
可以事先計算好所有物品之間的相似度
給定一個用戶 A
找出用戶 A 的所有未評分物品
以「用戶 A 的已評分物品對該未評分物品的相似度」和「用戶 A 對已評分物品的評分」來加權算出用戶 A 對這些未評分物品的評分
最後推薦給用戶 A 評分最高的 n 個物品

預測 user_4 對 item_a 的評分 =
(item_b_item_a_sim x user_4_item_b_rating + item_c_item_a_sim x user_4_item_c_rating) / (item_b_item_a_sim + item_c_item_a_sim)

也可以無視用戶 A 的歷史評分資料(或是根本沒有用戶 A 的歷史資料)
直接推薦跟某個物品最相似的 n 個物品

item-based 的特點:

  • 適合 item 遠少於 user 的系統,相似度的計算量會較少
  • 購物、電影、音樂、書籍等系統,用戶的興趣相對固定,適合用 item-based CF
  • 只會推薦類似的東西,驚喜度和多樣性較低
  • 通常只有在用戶量比較小的時候才需要頻繁地重新計算物品之間的相似度,隨著用戶量越大,物品的相似度會趨於穩定

ref:
https://ashokharnal.wordpress.com/2014/12/18/worked-out-example-item-based-collaborative-filtering-for-recommenmder-engine/
http://blog.csdn.net/huagong_adu/article/details/7362908

Slope One Recommender

        item_a  item_b  item_c
user_1  5       3       2
user_2  3       4       -
user_3  -       2       5
# the algorithm from "Mahout in Action"
for every item i the user u expresses no preference for
  for every item j that user u expresses a preference for
    find the average preference difference between j and i
    add this diff to u's preference value for j
    add this to a running average
return the top items, ranked by these averages

因為 memory-based collaborative filtering 的其中一個問題是數據量很大時計算量也會很可觀
所有就有人提出 Slope One 這種簡單粗暴的演算法來
雖然 Slope One 還是得計算所有物品兩兩之間的平均差異

Slope One 假設任兩個物品之間的評分都是一個 y = mx + b 而且 m = 1(斜率為 1)的線性關係
item_a 平均比 item_b 多 (2 + (-1)) / 2 = 0.5
item_a 平均比 item_c 多 (5 - 2) / 1 = 3
如果用 user_3 對 item_b 的評分來預測他對 item_a 的評分會是 2 + 0.5 = 2.5
如果用 user_3 對 item_c 的評分來預測他對 item_a 的評分會是 5 + 3 = 8
通常會用有多少人同時評分來加權多個評分

預測 user_3 對 item_a 的評分 =
((同時對 item_a 和 item_b 評分的人數 x user_3 用 item_b 對 item_a 的預測評分) + (同時對 item_a 和 item_c 評分的人數 x user_3 用 item_c 對 item_a 的預測評分)) / (同時對 item_a 和 item_b 評分的人數 + 同時對 item_a 和 item_c 評分的人數)
((2 x 2.5) + (1 x 8)) / (2 + 1) = 4.33

ref:
https://en.wikipedia.org/wiki/Slope_One

Parallel tasks in Python: concurrent.futures

Parallel tasks in Python: concurrent.futures

TL;DR: concurrent.futures is well suited to Embarrassingly Parallel tasks. You could write concurrent code with a simple for loop.

executor.map() runs the same function multiple times with different parameters and executor.submit() accepts any function with arbitrary parameters.

Install

concurrent.futures is part of the standard library in Python 3.2+. If you're using an older version of Python, you need to install the futures package.

$ pip install futures

ref:
https://docs.python.org/3/library/concurrent.futures.html

executor.map()

You should use the ProcessPoolExecutor for CPU intensive tasks and the ThreadPoolExecutor is suited for network operations or I/O. The ProcessPoolExecutor uses the multiprocessing module, which is not affected by GIL (Global Interpreter Lock) but also means that only picklable objects can be executed and returned.

In Python 3.5+, executor.map() receives an optional argument: chunksize. For very long iterables, using a large value for chunksize can significantly improve performance compared to the default size of 1. With ThreadPoolExecutor, chunksize has no effect.

from concurrent.futures import ThreadPoolExecutor
import time

import requests

def fetch(a):
    url = 'http://httpbin.org/get?a={0}'.format(a)
    r = requests.get(url)
    result = r.json()['args']
    return result

start = time.time()

# if max_workers is None or not given, it will default to the number of processors, multiplied by 5
with ThreadPoolExecutor(max_workers=None) as executor:
    for result in executor.map(fetch, range(42)):
        print('response: {0}'.format(result))

print('time: {0}'.format(time.time() - start))

You might want to change the value of max_workers to 1 and observe the difference.

ref:
https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures
https://www.blog.pythonlibrary.org/2016/08/03/python-3-concurrency-the-concurrent-futures-module/
http://masnun.com/2016/03/29/python-a-quick-introduction-to-the-concurrent-futures-module.html

executor.submit()

executor.submit() returns a Future object. A Future is basically an object that encapsulates an asynchronous execution of a function that will finish (or raise an exception) in the future.

The main difference between map and as_completed is that map returns the results in the order in which you pass iterables. On the other hand, the first result from the as_completed function is from whichever future completed first. Besides, iterating a map() returns results of futures; iterating a as_completed(futures) returns futures themselves.

from concurrent.futures import ThreadPoolExecutor, as_completed
import time

import requests

def fetch(url, timeout):
    r = requests.get(url, timeout=timeout)
    data = r.json()['args']
    return data

start = time.time()

with ThreadPoolExecutor(max_workers=20) as executor:
    futures = {}
    for i in range(42):
        url = 'https://httpbin.org/get?i={0}'.format(i)
        future = executor.submit(fetch, url, 60)
        futures[future] = url

    for future in as_completed(futures):
        url = futures[future]
        try:
            data = future.result()
        except Exception as exc:
            print(exc)
        else:
            print('fetch {0}, get {1}'.format(url, data))

print('time: {0}'.format(time.time() - start))

ref:
https://docs.python.org/3/library/concurrent.futures.html#future-objects

Discussion

ref:
https://news.ycombinator.com/item?id=16737129

Machine Learning glossary 常見名詞解釋

Machine Learning glossary 常見名詞解釋

Anomaly Detection 異常偵測

把一些異常值從 dataset 中挑出來

Anscombe's Quartet 安斯庫姆四重奏

四張圖表表示四組基本的統計特性一致的數據,但是各自畫出來的圖表完全不同
主要是在說統計方法有其侷限和離群值對統計的影響之大
還有就是分析數據前應該要先畫圖表

ref:
https://www.wikiwand.com/zh-tw/%E5%AE%89%E6%96%AF%E5%BA%93%E5%A7%86%E5%9B%9B%E9%87%8D%E5%A5%8F

Association Rule 關聯規則

找出資料之間的隱含關係
例如知名的啤酒與尿布

Best Subset Selection

是一種 model selection 的方法

Cost Function / Loss Function 損失函數

大部分的 machine learning 模型都是在計算 cost function
想辦法求出讓 cost function 最小化或最大化的各項參數
常用的 cost function 有 Mean Squared Error (MSE), Root Mean Squared Error (RMSE) 等

ref:
https://ml.berkeley.edu/blog/2016/11/06/tutorial-1/

Cross-validation 交叉驗證

cross-validation 常常用來做 hyperparameter tuning
最主流的方式是 k-fold
假設 k 是 3
你先把整個 dataset 拆分成 training set 和 test set
通常你會有很多組想要測試的超參數
則每一組超參數都會經歷以下過程:

  • 把 training set 分成三等份
  • 先用 1 + 2 訓練模型,用 3 來評估
  • 再用 1 + 3 訓練模型,用 2 來評估
  • 再用 2 + 3 訓練模型,用 1 來評估
  • 然後對三組評估結果取平均作為這組超參數的分數

等到測試過所有超參數的組合之後
用表現最好的那一組超參數對整個 training set 再訓練一次
得到最終的模型
這時候再用 test set 來做最終模型的評估

ref:
https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation

不過如果你的 dataset 真的沒那麼大
也是可以對整個 dataset 對 cross-validation
就不要先拆分 training set 和 test set 了

ref:
https://stats.stackexchange.com/questions/148688/cross-validation-with-test-data-set

還有一種方式是 leave-one-out 或 leave-n-out
每次只用一個 sample 來驗證
其餘的都用來訓練模型
直到每個 sample 都被用來驗證過

Curse of Dimensionality 維度災難

當數據的維度(feature 數)超過某個程度之後
導致計算的時間過久、記憶體用量過大(因為是指數型增加)
也必然會造成數據稀疏
特徵越多也可能造成 overfitting

ref:
https://www.quora.com/What-is-the-curse-of-dimensionality
http://stats.stackexchange.com/questions/169156/explain-curse-of-dimensionality-to-a-child

Decision Boundary 決策邊界

A smoother boundary corresponds to a simpler model.

每個特徵表示為一個維度
decision boundary 就是能夠把整個特徵空間裡的 dataset 正確劃分的一條邊界
這個邊界可能是 linear 或 non-linear

Dimensionality Reduction 降維

算是 unsupervised learning 的一種(transformations of the dataset)
可以分成 feature selection 和 feature extraction
在不喪失太多資訊的前提下減少 features 的維度
換個說法是嘗試用更少的維度來表示這個 dataset
維度減少的好處是提升計算效率和更容易進行 visualization

ref:
https://www.wikiwand.com/en/Dimensionality_reduction

從一堆 features 中選擇最有用的 features
稱為 feature selection
常見的方法有 Greedy forward selection

把原本高維度的 features 轉換成較少維度的 features
稱為 feature extraction
轉換之後已經不是原本的那些 features 了
常見的方法有 Principal Component Analysis (PCA)、Non-negative Matrix Factorization (NMF)

ref:
https://www.wikiwand.com/en/Feature_engineering
https://www.wikiwand.com/en/Feature_selection

Ensemble Learning 組合式學習

就是指結合多種演算法的 machine learning
例如 Random Forest(decision trees + bagging)

常見的 ensemble methods 有:
bagging (aka bootstrap aggregating)
boosting

Error 誤差 / Bias 偏差 / Variance 方差

Error = Bias + Variance 是指整個模型的準確度

Bias 是指預測值和真實值之間的差距,表示模型的精準度(反映的是模型在樣本上的輸出與真實值之間的誤差)
偏差越大,越偏離真實數據
因為模型太簡單而帶來的預測不準確 >> high bias

Variance 是指預測值的變化範圍,表示模型的穩定性(反映的是模型每一次輸出結果與模型輸出期望之間的誤差)
方差越大,數據的分佈越分散
因為模型太複雜而帶來的更大的空間變化和不確定性 >> high variance

ref:
https://www.zhihu.com/question/20448464 有圖
https://www.zhihu.com/question/27068705

Feature Engineering 特徵工程

就是找出(或是創造出)能夠讓演算法運作得更好的 features 的過程
也可能是整合、轉換多個相關的 features 變成一個新的 feature
通常會避免使用過多的 features 餵給演算法

Forward Stepwise Selection

一次增加一個 feature 來訓練 model
每次都計算準確率
直到所有 features 都用到

Backwards Stepwise Selection 就是反過來

Generalization 泛化

If a model is able to make accurate predictions on unseen data, we say it is able to generalize from the training set to the test set.

就是指 model 預測 unseen data 的能力
例如一個 overfitting 的 model,它的泛化能力就不好

ref:
https://www.quora.com/What-is-generalization-in-machine-learning

Gradient Descent 梯度下降

是一種找出最小的 cost function 的演算法
也就是找出最好的 model parameters

Greedy Feature Selection

一次只用一個 feature 來訓練 model

In greedy feature selection we choose one feature, train a model and evaluate the performance of the model on a fixed evaluation metric. We keep adding and removing features one-by-one and record performance of the model at every step. We then select the features which have the best evaluation score.

Hyperparameter 超參數

就是在訓練 model 時輸入的參數,那些 model 沒辦法自己學到,必須人工指定的參數。通常會透過 grid search 和 cross-validation 的方式選出最合適的參數。

Kernel Methods

kernel function 會是一個距離函數

linear kernel 是最簡單的一種 kernel function
其實就是兩個 input 的 dot product

ref:
https://www.zhihu.com/question/30371867

Linear Separability 線性可分

當你有一堆 data points
你能夠畫出一條「直線」來區分這些點時
就可以說是 linearly separable
反而則是 linearly inseparable

Logistic Curve

就是一條長得像頭尾被拉長拉扁的 S 的曲線

ref:
https://www.stat.ubc.ca/~rollin/teach/643w04/lec/node46.html

Missing Value Imputation(缺失值填充)

針對那些沒有值的欄位,可能是用中位數、平均值或是最常見的值之類的資料填進去
也稱為 interpolation

Manifold Learning

是一種 non-linear dimensionality reduction 的方式
可以用在把高維度的 dataset 變成較低維度
主要用來做 visualization
常用的有 t-SNE

manifold learning 通常用在 exploratory data analysis
不像 PCA 那樣,會把結果用於 supervised learning 的輸入

ref:
http://scikit-learn.org/stable/modules/manifold.html
https://www.wikiwand.com/en/Nonlinear_dimensionality_reduction

Normalization 歸一化、Standarization 標準化

屬於 preprocessing 的一部分
統一各個特徵的數值範圍
對很多演算法來說這個步驟是必要的

例如:
特徵一是距離,單位是公尺,值的範圍是 10 ~ 3000
特徵二是樓層,值的範圍是 1 ~ 14
為了避免尺度不同造成誤導
需要 rescaling
把各種尺度的數值統一表示成 0 ~ 1 之間的數字
稱為 normalization 歸一化

還有另一種統計學常用的方法,是把數值轉換成 z-scores
使所有數據的平均值為 0、標準差為 1
稱為 standarization 標準化

ref:
https://www.quora.com/What-is-the-difference-between-normalization-standardization-and-regularization-for-data
http://sobuhu.com/ml/2012/12/29/normalization-regularization.html

Predictors

就是 features

Principal component analysis (PCA) 主成份分析

主成分分析,是一种分析、简化数据集的技术。用于减少数据集的维数,同时保持数据集中的对方差贡献最大的特征。

用來 reduce dimensionality(減少 dataset 的維度數)
可以找出對 Variance 貢獻最大的特徵

Overfitting 過度擬合(過擬合)/ Underfitting 擬合不足(欠擬合)

overfitting 常常發生在 model 很複雜、有很多參數的時候
或是 dataset 裡有很多 noise 或 outlier
表現為在 training set 的準確率很高,但是在 testing set 的準確率卻很低
複雜模型 >> high variance / low bias >> overfitting

underfitting 通常發生在 model 太簡單的時候
表現為就算是在 training set 上的錯誤率就很高
簡單模型 >> high bias / low variance >> underfitting

ref:
http://www.csuldw.com/2016/02/26/2016-02-26-choosing-a-machine-learning-classifier/

Regularization 正規化、正則化

Regularization means explicitly restricting a model to avoid overfitting.

是一種防止 overfitting 的技巧
regularization 保留所有 features
但是降低或懲罰某些 features 對 model 預測值的影響
常見的方法有 L1 和 L2

L1 正則化是指權重向量 w 中各個元素的絕對值之和

L2
正則化是指權重向量 w 中各個元素的平方和然後再求平方根

ref:
https://zhuanlan.zhihu.com/p/25707761
http://blog.csdn.net/jinping_shi/article/details/52433975
http://blog.csdn.net/zouxy09/article/details/24971995

Resampling

在 classification 問題中
每一種 class 的數量差距很大
例如正樣本佔了 98%、負樣本佔了 2%
這就是所謂的不平衡的 dataset
解決的辦法之一是 resampling
主要可以分成 oversampling 和 undersampling(過採樣和欠採樣)

undersampling 是指減少多數類樣本的數量
例如隨機拿掉部分多數類樣本
直到正負樣本的數量相同
缺點是你可能也拿掉了 dataset 裡潛在的資訊

oversampling 指的是增加少數類樣本的數量
例如複製少數類樣本
讓正負樣本的數量盡可能相同
缺點顯而易見就是容易 overfitting
其他 oversampling 的方法還有 SMOTE (Synthetic Minority Over-sampling Technique)
合成新的少數類樣本
合成的策略是對每個少數類樣本 a
從它的最近鄰中隨機選一個樣本 b
然後在 a、b 之間的連線上隨機選一點作為新合成的少數類樣本

ref:
http://www.jiqizhixin.com/article/2499
http://www.algorithmdog.com/unbalance
https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis

Training set / Test set

把 dataset 分成 training set 和 test set
用 training set 來訓練模型
用 test set 來評估結果
這兩組數據必須是從原始的 dataset 裡「均勻取樣」(隨機)
常見的比例是 70/30
這種方式稱為 holdout

也可以分成 training set、validation set、test set
training set 用來訓練模型,validation set 用來選擇模型(調整超參數),testing set 用在最終模型的評估
常見的比例是 50/25/25

基本上你的 test set 只能用來評估最終模型
不能用 test set 去訓練模型或是交叉驗證
對你的 model 來說 test set 就是一組 unseen 的資料
所以 test set 的評估結果才可以視為 model 上線後對真實資料的預測能力
如果你的 model 在 validation set 表現不錯
但是在 test set 的表現很差
那就是 overfitting 了

ref:
https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set
https://www.jiqizhixin.com/articles/a62fc871-6366-402b-b32f-f9a3f17a566b
https://mp.weixin.qq.com/s/W7wpxHoC2F5DHCUO7ES1cw

Convert utf8 tables to utf8mb4 in MySQL

Convert utf8 tables to utf8mb4 in MySQL

Change Schema

# For each database:
ALTER DATABASE
    database_name
    CHARACTER SET = utf8mb4
    COLLATE = utf8mb4_unicode_ci;

# For each table:
ALTER TABLE
    table_name
    CONVERT TO CHARACTER SET utf8mb4
    COLLATE utf8mb4_unicode_ci;

# For each column:
ALTER TABLE
    table_name
    CHANGE column_name column_name
    VARCHAR(191)
    CHARACTER SET utf8mb4
    COLLATE utf8mb4_unicode_ci;

ALTER TABLE
    svcomments_svcomment
    CHANGE comment comment 
    longtext
    CHARACTER SET utf8mb4
    COLLATE utf8_general_ci;

SHOW VARIABLES WHERE Variable_name LIKE 'character%' OR Variable_name LIKE 'collation%';

# Don’t blindly copy-paste this!
# The exact statement depends on the column type, maximum length, and other properties.
# The above line is just an example for a `VARCHAR` column.

ref:
https://coderwall.com/p/pns4pa/setting-up-unicode-defaults-for-mariadb-or-mysql
https://mathiasbynens.be/notes/mysql-utf8mb4
http://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci
http://dba.stackexchange.com/questions/8239/how-to-easily-convert-utf8-tables-to-utf8mb4-in-mysql-5-5

Application Settings

Take a Django project as an example.

in settings.py

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        ...
        'OPTIONS': {
            'charset': 'utf8mb4',
        },
        'TEST': {
            'CHARSET': 'utf8mb4',
            'COLLATION': 'utf8mb4_general_ci',
        },
    },
}

ref:
https://tzangms.com/use-emoji-in-mysql-with-django/