MySQL system error codes

MySQL system error codes

Print all OS error codes and MySQL error codes using the perror command.

$ for i in {1..190..1}; do perror "$i"; done

OS error code   1:  Operation not permitted
OS error code   2:  No such file or directory
OS error code   3:  No such process
OS error code   4:  Interrupted system call
OS error code   5:  Input/output error
OS error code   6:  No such device or address
OS error code   7:  Argument list too long
OS error code   8:  Exec format error
OS error code   9:  Bad file descriptor
OS error code  10:  No child processes
OS error code  11:  Resource temporarily unavailable
OS error code  12:  Cannot allocate memory
OS error code  13:  Permission denied
OS error code  14:  Bad address
OS error code  15:  Block device required
OS error code  16:  Device or resource busy
OS error code  17:  File exists
OS error code  18:  Invalid cross-device link
OS error code  19:  No such device
OS error code  20:  Not a directory
OS error code  21:  Is a directory
OS error code  22:  Invalid argument
OS error code  23:  Too many open files in system
OS error code  24:  Too many open files
OS error code  25:  Inappropriate ioctl for device
OS error code  26:  Text file busy
OS error code  27:  File too large
OS error code  28:  No space left on device
OS error code  30:  Read-only file system
OS error code  31:  Too many links
OS error code  32:  Broken pipe
OS error code  33:  Numerical argument out of domain
OS error code  34:  Numerical result out of range
OS error code  35:  Resource deadlock avoided
OS error code  36:  File name too long
OS error code  37:  No locks available
OS error code  38:  Function not implemented
OS error code  39:  Directory not empty
OS error code  40:  Too many levels of symbolic links
OS error code  42:  No message of desired type
OS error code  43:  Identifier removed
OS error code  44:  Channel number out of range
OS error code  45:  Level 2 not synchronized
OS error code  46:  Level 3 halted
OS error code  47:  Level 3 reset
OS error code  48:  Link number out of range
OS error code  49:  Protocol driver not attached
OS error code  50:  No CSI structure available
OS error code  51:  Level 2 halted
OS error code  52:  Invalid exchange
OS error code  53:  Invalid request descriptor
OS error code  54:  Exchange full
OS error code  55:  No anode
OS error code  56:  Invalid request code
OS error code  57:  Invalid slot
OS error code  59:  Bad font file format
OS error code  60:  Device not a stream
OS error code  61:  No data available
OS error code  62:  Timer expired
OS error code  63:  Out of streams resources
OS error code  64:  Machine is not on the network
OS error code  65:  Package not installed
OS error code  66:  Object is remote
OS error code  67:  Link has been severed
OS error code  68:  Advertise error
OS error code  69:  Srmount error
OS error code  70:  Communication error on send
OS error code  71:  Protocol error
OS error code  72:  Multihop attempted
OS error code  73:  RFS specific error
OS error code  74:  Bad message
OS error code  75:  Value too large for defined data type
OS error code  76:  Name not unique on network
OS error code  77:  File descriptor in bad state
OS error code  78:  Remote address changed
OS error code  79:  Can not access a needed shared library
OS error code  80:  Accessing a corrupted shared library
OS error code  81:  .lib section in a.out corrupted
OS error code  82:  Attempting to link in too many shared libraries
OS error code  83:  Cannot exec a shared library directly
OS error code  84:  Invalid or incomplete multibyte or wide character
OS error code  85:  Interrupted system call should be restarted
OS error code  86:  Streams pipe error
OS error code  87:  Too many users
OS error code  88:  Socket operation on non-socket
OS error code  89:  Destination address required
OS error code  90:  Message too long
OS error code  91:  Protocol wrong type for socket
OS error code  92:  Protocol not available
OS error code  93:  Protocol not supported
OS error code  94:  Socket type not supported
OS error code  95:  Operation not supported
OS error code  96:  Protocol family not supported
OS error code  97:  Address family not supported by protocol
OS error code  98:  Address already in use
OS error code  99:  Cannot assign requested address
OS error code 100:  Network is down
OS error code 101:  Network is unreachable
OS error code 102:  Network dropped connection on reset
OS error code 103:  Software caused connection abort
OS error code 104:  Connection reset by peer
OS error code 105:  No buffer space available
OS error code 106:  Transport endpoint is already connected
OS error code 107:  Transport endpoint is not connected
OS error code 108:  Cannot send after transport endpoint shutdown
OS error code 109:  Too many references: cannot splice
OS error code 110:  Connection timed out
OS error code 111:  Connection refused
OS error code 112:  Host is down
OS error code 113:  No route to host
OS error code 114:  Operation already in progress
OS error code 115:  Operation now in progress
OS error code 116:  Stale NFS file handle
OS error code 117:  Structure needs cleaning
OS error code 118:  Not a XENIX named type file
OS error code 119:  No XENIX semaphores available
OS error code 120:  Is a named type file
OS error code 121:  Remote I/O error
OS error code 122:  Disk quota exceeded
OS error code 123:  No medium found
OS error code 124:  Wrong medium type
OS error code 125:  Operation canceled
OS error code 126:  Required key not available
OS error code 127:  Key has expired
OS error code 128:  Key has been revoked
OS error code 129:  Key was rejected by service
OS error code 130:  Owner died
OS error code 131:  State not recoverable
OS error code 132:  Operation not possible due to RF-kill
OS error code 133:  Memory page has hardware error
MySQL error code 120: Did not find key on read or update
MySQL error code 121: Duplicate key on write or update
MySQL error code 122: Internal (unspecified) error in handler
MySQL error code 123: Someone has changed the row since it was read (while the table was locked to prevent it)
MySQL error code 124: Wrong index given to function
MySQL error code 125: Undefined handler error 125
MySQL error code 126: Index file is crashed
MySQL error code 127: Record file is crashed
MySQL error code 128: Out of memory in engine
MySQL error code 129: Undefined handler error 129
MySQL error code 130: Incorrect file format
MySQL error code 131: Command not supported by database
MySQL error code 132: Old database file
MySQL error code 126: Index file is crashed
MySQL error code 127: Record-file is crashed
MySQL error code 128: Out of memory
MySQL error code 130: Incorrect file format
MySQL error code 131: Command not supported by database
MySQL error code 132: Old database file
MySQL error code 133: No record read before update
MySQL error code 134: Record was already deleted (or record file crashed)
MySQL error code 135: No more room in record file
MySQL error code 136: No more room in index file
MySQL error code 137: No more records (read after end of file)
MySQL error code 138: Unsupported extension used for table
MySQL error code 139: Too big row
MySQL error code 140: Wrong create options
MySQL error code 141: Duplicate unique key or constraint on write or update
MySQL error code 142: Unknown character set used in table
MySQL error code 143: Conflicting table definitions in sub-tables of MERGE table
MySQL error code 144: Table is crashed and last repair failed
MySQL error code 145: Table was marked as crashed and should be repaired
MySQL error code 146: Lock timed out; Retry transaction
MySQL error code 147: Lock table is full;  Restart program with a larger locktable
MySQL error code 148: Updates are not allowed under a read only transactions
MySQL error code 149: Lock deadlock; Retry transaction
MySQL error code 150: Foreign key constraint is incorrectly formed
MySQL error code 151: Cannot add a child row
MySQL error code 152: Cannot delete a parent row
MySQL error code 153: No savepoint with that name
MySQL error code 154: Non unique key block size
MySQL error code 155: The table does not exist in engine
MySQL error code 156: The table already existed in storage engine
MySQL error code 157: Could not connect to storage engine
MySQL error code 158: Unexpected null pointer found when using spatial index
MySQL error code 159: The table changed in storage engine
MySQL error code 160: There is no partition in table for the given value
MySQL error code 161: Row-based binlogging of row failed
MySQL error code 162: Index needed in foreign key constraint
MySQL error code 163: Upholding foreign key constraints would lead to a duplicate key error in some other table
MySQL error code 164: Table needs to be upgraded before it can be used
MySQL error code 165: Table is read only
MySQL error code 166: Failed to get next auto increment value
MySQL error code 167: Failed to set row auto increment value
MySQL error code 168: Unknown (generic) error from engine
MySQL error code 169: Record is the same
MySQL error code 170: It is not possible to log this statement
MySQL error code 171: The event was corrupt, leading to illegal data being read
MySQL error code 172: The table is of a new format not supported by this version
MySQL error code 173: The event could not be processed no other hanlder error happened
MySQL error code 174: Got a fatal error during initialzaction of handler
MySQL error code 175: File to short; Expected more data in file
MySQL error code 176: Read page with wrong checksum
MySQL error code 177: Too many active concurrent transactions
MySQL error code 178: Record not matching the given partition set
MySQL error code 179: Index column length exceeds limit
MySQL error code 180: Index corrupted
MySQL error code 181: Undo record too big
MySQL error code 182: Invalid InnoDB FTS Doc ID
MySQL error code 183: Table is being used in foreign key check
MySQL error code 184: Tablespace already exists
MySQL error code 185: Too many columns
MySQL error code 186: Row in wrong partition
MySQL error code 187: InnoDB is in read only mode
MySQL error code 188: FTS query exceeds result cache memory limit
MySQL error code 189: Temporary file write failure
MySQL error code 190: Operation not allowed when innodb_forced_recovery > 0
MySQL error code 191: Too many words in a FTS phrase or proximity search
MySQL error code 192: Foreign key cascade delete/update exceeds max depth
MySQL error code 193: Required Create option missing
MySQL error code 194: Out of memory in storage engine
MySQL error code 195: Table corrupted
MySQL error code 196: Query interrupted
MySQL error code 197: Tablespace cannot be accessed
MySQL error code 198: Tablespace is not empty
MySQL error code 199: Incorrect file name
MySQL error code 200: Operation is not allowed
MySQL error code 201: Compute generate value failed

ref:
http://man7.org/linux/man-pages/man3/perror.3.html

Calculate the similarity of two vectors

Calculate the similarity of two vectors

scipy.spatial.distance
https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

sklearn.metrics
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

Distance

Euclidean distance 歐幾里德距離

from sklearn.metrics.pairwise import euclidean_distances

euclidean_distances([0, 0, 0, 0], [0, 0, 0, 0])
# array([[ 0.]])

euclidean_distances([1, 0, 1, 0], [1, 0, 1, 0])
# array([[ 0.]])

euclidean_distances([0, 1, 0, 1], [1, 0, 1, 0])
# array([[ 2.]])

ref:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html

Manhattan Distance 曼哈頓距離

from sklearn.metrics.pairwise import manhattan_distances

manhattan_distances([0, 0, 0, 0], [0, 0 , 0, 0])
# array([[ 0.]])

manhattan_distances([1, 1, 1, 0], [1, 0, 0, 0])
# array([[ 2.]])

manhattan_distances([0, 1, 0, 1], [1, 0, 1, 0])
# array([[ 4.]])

ref:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.manhattan_distances.html

Similarity

Cosine similarity 餘弦相似度

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import cosine_distances
from sklearn.metrics.pairwise import pairwise_distances
from scipy.spatial.distance import pdist, squareform

cosine_similarity(matrix) == \
1 - cosine_distances(matrix) == \
1 - pairwise_distances(matrix, metric='cosine') == \
1 - squareform(pdist(matrix, 'cosine'))

cosine_similarity([0, 0, 0, 0], [0, 0, 0, 0])
# array([[ 0.]])

cosine_similarity([1, 0, 0, 0], [1, 0, 0, 0])
# array([[ 1.]])

cosine_similarity([1, 0, 1, 0], [0, 1, 0, 1])
# array([[ 0.]])

cosine_similarity([1, 0, 0, 1], [1, 0, 0, 0])
# array([[ 0.70710678]])

cosine_similarity([1, 0, 0, 1], [1, 0, 1, 0])
# array([[ 0.5]])

ref:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html

Jaccard similarity coefficient score

from sklearn.metrics import jaccard_similarity_score

jaccard_similarity_score([0, 0, 0, 0], [0, 0, 0, 0])
# 1.0

jaccard_similarity_score([0, 0, 0, 0], [1, 0, 0, 0])
# 0.75

jaccard_similarity_score([1, 0, 0, 0], [1, 0, 0, 0])
# 1.0

jaccard_similarity_score([1, 0, 1, 0], [0, 1, 0, 1])
# 0.0

ref:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html

http://datascience.stackexchange.com/questions/5121/applications-and-differences-for-jaccard-similarity-and-cosine-similarity

Log-Likelihood similarity

TODO

Pearson correlation coefficient 皮爾森相關係數

It has a value between +1 and −1 inclusive, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. You should only calculate Pearson Correlations when the number of items in common between two users is > 1, preferably greater than 5/10. Only calculate the Pearson Correlation for two users where they have commonly rated items.

For hign-dimensional binary attributes, the performances of Pearson correlation coefficient and Cosine similarity
are better than Jaccard similarity coefficient score.

from scipy.stats import pearsonr

pearsonr([1, 0, 1, 1], [0, 0, 0, 0])
# (nan, 1.0)

pearsonr([1, 0, 1, 1], [1, 0, 0, 0])
# (0.33333333333333331, 0.66666666666666607)

pearsonr([1, 0, 1, 0], [0, 1, 0, 1])
# (-1.0, 0.0)

ref:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html
http://stackoverflow.com/questions/11429604/how-is-nan-handled-in-pearson-correlation-user-user-similarity-matrix-in-a-recom

Dissimilarity

Dice dissimilarity

from scipy.spatial.distance import dice
import numpy as np

v1 = np.array([0, 0, 0, 0])
v2 = np.array([0, 0, 0, 0])

try:
    sim = 1.0 - dice(v1.astype(bool), v2.astype(bool))
except ZeroDivisionError:
    sim = 0

ref:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.dice.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.kulsinski.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.sokalsneath.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.yule.html

Recommender System: Collaborative Filtering 協同過濾推薦演算法

Recommender System: Collaborative Filtering 協同過濾推薦演算法

dataset 會是 m 個用戶對 n 個物品的評分 utility matrix
因為通常只有部分用戶和部份物品會有評分資料
所以是一個 sparse matrix(稀疏矩陣)
目標是利用這些稀疏的資料去預測出用戶對他還沒評分過的物品的評分
除了評分之外,也可能是喜歡(和不喜歡)、購買、瀏覽之類的數據
又分成主動評分和被動評分

CF 的缺點:

  • 如果沒有用戶的歷史數據就沒辦法做任何推薦
  • 以及無論 user-based 或 item-based 都需要消耗大量的運算資源
  • 大部分用戶有評分紀錄的資料都只佔所有資料中的很小一部分,matrix 相當稀疏,很難找到相似的資料
  • 會有馬太效應,越熱門的物品越容易被推薦,所以通常都會降低熱門物品的權重

CF 主要分為 memory-based 和 model-based 兩大類
user-based 和 item-based collaborative filtering 屬於 memory-based
memory-based 基本上就是純粹的計算,沒有什麼 Machine Learning 的成分
model-based 才是 Machine Learning 的範疇

User-based Collaborative Filtering

        item_a  item_b  item_c
user_1  2       -       3
user_2  5       2       -
user_3  3       3       1
user_4  -       2       2
# the algorithm from "Mahout in Action"
for every other user w
  compute a similarity s between u and w
  retain the top users, ranked by similarity, as a neighborhood n

for every item i that some user in n has a preference for,
      but that u  has no preference for yet
  for every other user v in n that has a preference for i
    compute a similarity s between u and  v
    incorporate v's preference for i, weighted by s, into a running average

user-based 考慮的是 user 和 user 之間的相似程度

給定一個用戶 A
計算用戶 A 跟其他所有用戶的相似度
找出最相似的 m 個用戶
再找出這些用戶有評分但是用戶 A 沒有評分的物品(也可以額外限制至少要幾個用戶有評分過)
以「相似用戶的相似度」和「該用戶對該物品的評分」來加權算出用戶 A 對這些未評分物品的評分
最後推薦給 A 評分最高的 n 個物品

預測 user_4 對 item_a 的評分 =
(user_4_user_1_sim x user_1_item_a_rating + user_4_user_3_sim x user_3_item_a_rating) / (user_4_user_1_sim + user_4_user_3_sim)

user-based 的特點:

  • 適合 user 遠少於 item 的系統,相似度的計算量會較少
  • item 的時效性強、更多樣的系統,例如新聞、社交網站,適合用 user-based CF
  • 不容易給出推薦理由
  • 驚喜度較高

常用的相似度演算法:

  • Pearson Correlation Coefficient
  • Cosine Similarity
  • Adjusted Cosine Similarity(有些用戶傾向於對所有物品評高分或低分,這個計算方式可以消除這樣的影響)

ref:
https://www.safaribooksonline.com/library/view/mahout-in-action/9781935182689/kindle_split_013.html

Item-based Collaborative Filtering

        user_1  user_2  user_3  user_4
item_a  2       5       3       -
item_b  -       2       3       2
item_c  3       -       1       2
# the algorithm from "Mahout in Action"
for every item i that u has no preference for yet
  for every item j that u has a preference for
    compute a similarity s between i and j
    add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average

item-based 考慮的是 item 和 item 之間的相似程度
item-based 用的還是跟 user-based CF 一模一樣的資料
而不是使用 item 本身的特徵(那個叫 content-based)

如果物品數比用戶數還少得多的話
可以事先計算好所有物品之間的相似度
給定一個用戶 A
找出用戶 A 的所有未評分物品
以「用戶 A 的已評分物品對該未評分物品的相似度」和「用戶 A 對已評分物品的評分」來加權算出用戶 A 對這些未評分物品的評分
最後推薦給用戶 A 評分最高的 n 個物品

預測 user_4 對 item_a 的評分 =
(item_b_item_a_sim x user_4_item_b_rating + item_c_item_a_sim x user_4_item_c_rating) / (item_b_item_a_sim + item_c_item_a_sim)

也可以無視用戶 A 的歷史評分資料(或是根本沒有用戶 A 的歷史資料)
直接推薦跟某個物品最相似的 n 個物品

item-based 的特點:

  • 適合 item 遠少於 user 的系統,相似度的計算量會較少
  • 購物、電影、音樂、書籍等系統,用戶的興趣相對固定,適合用 item-based CF
  • 只會推薦類似的東西,驚喜度和多樣性較低
  • 通常只有在用戶量比較小的時候才需要頻繁地重新計算物品之間的相似度,隨著用戶量越大,物品的相似度會趨於穩定

ref:
https://ashokharnal.wordpress.com/2014/12/18/worked-out-example-item-based-collaborative-filtering-for-recommenmder-engine/
http://blog.csdn.net/huagong_adu/article/details/7362908

Slope One Recommender

        item_a  item_b  item_c
user_1  5       3       2
user_2  3       4       -
user_3  -       2       5
# the algorithm from "Mahout in Action"
for every item i the user u expresses no preference for
  for every item j that user u expresses a preference for
    find the average preference difference between j and i
    add this diff to u's preference value for j
    add this to a running average
return the top items, ranked by these averages

因為 memory-based collaborative filtering 的其中一個問題是數據量很大時計算量也會很可觀
所有就有人提出 Slope One 這種簡單粗暴的演算法來
雖然 Slope One 還是得計算所有物品兩兩之間的平均差異

Slope One 假設任兩個物品之間的評分都是一個 y = mx + b 而且 m = 1(斜率為 1)的線性關係
item_a 平均比 item_b 多 (2 + (-1)) / 2 = 0.5
item_a 平均比 item_c 多 (5 - 2) / 1 = 3
如果用 user_3 對 item_b 的評分來預測他對 item_a 的評分會是 2 + 0.5 = 2.5
如果用 user_3 對 item_c 的評分來預測他對 item_a 的評分會是 5 + 3 = 8
通常會用有多少人同時評分來加權多個評分

預測 user_3 對 item_a 的評分 =
((同時對 item_a 和 item_b 評分的人數 x user_3 用 item_b 對 item_a 的預測評分) + (同時對 item_a 和 item_c 評分的人數 x user_3 用 item_c 對 item_a 的預測評分)) / (同時對 item_a 和 item_b 評分的人數 + 同時對 item_a 和 item_c 評分的人數)
((2 x 2.5) + (1 x 8)) / (2 + 1) = 4.33

ref:
https://en.wikipedia.org/wiki/Slope_One

Parallel tasks in Python: concurrent.futures

Parallel tasks in Python: concurrent.futures

TL;DR: concurrent.futures is well suited to Embarrassingly Parallel tasks. You could write concurrent code with a simple for loop.

executor.map() runs the same function multiple times with different parameters and executor.submit() accepts any function with arbitrary parameters.

Install

concurrent.futures is part of the standard library in Python 3.2+. If you're using an older version of Python, you need to install the futures package.

$ pip install futures

ref:
https://docs.python.org/3/library/concurrent.futures.html

executor.map()

You should use the ProcessPoolExecutor for CPU intensive tasks and the ThreadPoolExecutor is suited for network operations or I/O. The ProcessPoolExecutor uses the multiprocessing module, which is not affected by GIL (Global Interpreter Lock) but also means that only picklable objects can be executed and returned.

In Python 3.5+, executor.map() receives an optional argument: chunksize. For very long iterables, using a large value for chunksize can significantly improve performance compared to the default size of 1. With ThreadPoolExecutor, chunksize has no effect.

from concurrent.futures import ThreadPoolExecutor
import time

import requests

def fetch(a):
    url = 'http://httpbin.org/get?a={0}'.format(a)
    r = requests.get(url)
    result = r.json()['args']
    return result

start = time.time()

# if max_workers is None or not given, it will default to the number of processors, multiplied by 5
with ThreadPoolExecutor(max_workers=None) as executor:
    for result in executor.map(fetch, range(42)):
        print('response: {0}'.format(result))

print('time: {0}'.format(time.time() - start))

You might want to change the value of max_workers to 1 and observe the difference.

ref:
https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures
https://www.blog.pythonlibrary.org/2016/08/03/python-3-concurrency-the-concurrent-futures-module/
http://masnun.com/2016/03/29/python-a-quick-introduction-to-the-concurrent-futures-module.html

executor.submit()

executor.submit() returns a Future object. A Future is basically an object that encapsulates an asynchronous execution of a function that will finish (or raise an exception) in the future.

The main difference between map and as_completed is that map returns the results in the order in which you pass iterables. On the other hand, the first result from the as_completed function is from whichever future completed first. Besides, iterating a map() returns results of futures; iterating a as_completed(futures) returns futures themselves.

from concurrent.futures import ThreadPoolExecutor, as_completed
import time

import requests

def fetch(url, timeout):
    r = requests.get(url, timeout=timeout)
    data = r.json()['args']
    return data

start = time.time()

with ThreadPoolExecutor(max_workers=20) as executor:
    futures = {}
    for i in range(42):
        url = 'https://httpbin.org/get?i={0}'.format(i)
        future = executor.submit(fetch, url, 60)
        futures[future] = url

    for future in as_completed(futures):
        url = futures[future]
        try:
            data = future.result()
        except Exception as exc:
            print(exc)
        else:
            print('fetch {0}, get {1}'.format(url, data))

print('time: {0}'.format(time.time() - start))

ref:
https://docs.python.org/3/library/concurrent.futures.html#future-objects

Discussion

ref:
https://news.ycombinator.com/item?id=16737129

Machine Learning glossary 常見名詞解釋

Machine Learning glossary 常見名詞解釋

Anomaly Detection 異常偵測

把一些異常值從 dataset 中挑出來

Anscombe's Quartet 安斯庫姆四重奏

四張圖表表示四組基本的統計特性一致的數據,但是各自畫出來的圖表完全不同
主要是在說統計方法有其侷限和離群值對統計的影響之大
還有就是分析數據前應該要先畫圖表

ref:
https://www.wikiwand.com/zh-tw/%E5%AE%89%E6%96%AF%E5%BA%93%E5%A7%86%E5%9B%9B%E9%87%8D%E5%A5%8F

Association Rule 關聯規則

找出資料之間的隱含關係
例如知名的啤酒與尿布

Best Subset Selection

是一種 model selection 的方法

Cost Function / Loss Function 損失函數

大部分的 machine learning 模型都是在計算 cost function
想辦法求出讓 cost function 最小化或最大化的各項參數
常用的 cost function 有 Mean Squared Error (MSE), Root Mean Squared Error (RMSE) 等

ref:
https://ml.berkeley.edu/blog/2016/11/06/tutorial-1/

Cross-validation 交叉驗證

cross-validation 常常用來做 hyperparameter tuning
最主流的方式是 k-fold
假設 k 是 3
你先把整個 dataset 拆分成 training set 和 test set
通常你會有很多組想要測試的超參數
則每一組超參數都會經歷以下過程:

  • 把 training set 分成三等份
  • 先用 1 + 2 訓練模型,用 3 來評估
  • 再用 1 + 3 訓練模型,用 2 來評估
  • 再用 2 + 3 訓練模型,用 1 來評估
  • 然後對三組評估結果取平均作為這組超參數的分數

等到測試過所有超參數的組合之後
用表現最好的那一組超參數對整個 training set 再訓練一次
得到最終的模型
這時候再用 test set 來做最終模型的評估

ref:
https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation

不過如果你的 dataset 真的沒那麼大
也是可以對整個 dataset 對 cross-validation
就不要先拆分 training set 和 test set 了

ref:
https://stats.stackexchange.com/questions/148688/cross-validation-with-test-data-set

還有一種方式是 leave-one-out 或 leave-n-out
每次只用一個 sample 來驗證
其餘的都用來訓練模型
直到每個 sample 都被用來驗證過

Curse of Dimensionality 維度災難

當數據的維度(feature 數)超過某個程度之後
導致計算的時間過久、記憶體用量過大(因為是指數型增加)
也必然會造成數據稀疏
特徵越多也可能造成 overfitting

ref:
https://www.quora.com/What-is-the-curse-of-dimensionality
http://stats.stackexchange.com/questions/169156/explain-curse-of-dimensionality-to-a-child

Decision Boundary 決策邊界

A smoother boundary corresponds to a simpler model.

每個特徵表示為一個維度
decision boundary 就是能夠把整個特徵空間裡的 dataset 正確劃分的一條邊界
這個邊界可能是 linear 或 non-linear

Dimensionality Reduction 降維

算是 unsupervised learning 的一種(transformations of the dataset)
可以分成 feature selection 和 feature extraction
在不喪失太多資訊的前提下減少 features 的維度
換個說法是嘗試用更少的維度來表示這個 dataset
維度減少的好處是提升計算效率和更容易進行 visualization

ref:
https://www.wikiwand.com/en/Dimensionality_reduction

從一堆 features 中選擇最有用的 features
稱為 feature selection
常見的方法有 Greedy forward selection

把原本高維度的 features 轉換成較少維度的 features
稱為 feature extraction
轉換之後已經不是原本的那些 features 了
常見的方法有 Principal Component Analysis (PCA)、Non-negative Matrix Factorization (NMF)

ref:
https://www.wikiwand.com/en/Feature_engineering
https://www.wikiwand.com/en/Feature_selection

Ensemble Learning 組合式學習

就是指結合多種演算法的 machine learning
例如 Random Forest(decision trees + bagging)

常見的 ensemble methods 有:
bagging (aka bootstrap aggregating)
boosting

Error 誤差 / Bias 偏差 / Variance 方差

Error = Bias + Variance 是指整個模型的準確度

Bias 是指預測值和真實值之間的差距,表示模型的精準度(反映的是模型在樣本上的輸出與真實值之間的誤差)
偏差越大,越偏離真實數據
因為模型太簡單而帶來的預測不準確 >> high bias

Variance 是指預測值的變化範圍,表示模型的穩定性(反映的是模型每一次輸出結果與模型輸出期望之間的誤差)
方差越大,數據的分佈越分散
因為模型太複雜而帶來的更大的空間變化和不確定性 >> high variance

ref:
https://www.zhihu.com/question/20448464 有圖
https://www.zhihu.com/question/27068705

Feature Engineering 特徵工程

就是找出(或是創造出)能夠讓演算法運作得更好的 features 的過程
也可能是整合、轉換多個相關的 features 變成一個新的 feature
通常會避免使用過多的 features 餵給演算法

Forward Stepwise Selection

一次增加一個 feature 來訓練 model
每次都計算準確率
直到所有 features 都用到

Backwards Stepwise Selection 就是反過來

Generalization 泛化

If a model is able to make accurate predictions on unseen data, we say it is able to generalize from the training set to the test set.

就是指 model 預測 unseen data 的能力
例如一個 overfitting 的 model,它的泛化能力就不好

ref:
https://www.quora.com/What-is-generalization-in-machine-learning

Gradient Descent 梯度下降

是一種找出最小的 cost function 的演算法
也就是找出最好的 model parameters

Greedy Feature Selection

一次只用一個 feature 來訓練 model

In greedy feature selection we choose one feature, train a model and evaluate the performance of the model on a fixed evaluation metric. We keep adding and removing features one-by-one and record performance of the model at every step. We then select the features which have the best evaluation score.

Hyperparameter 超參數

就是在訓練 model 時輸入的參數,那些 model 沒辦法自己學到,必須人工指定的參數。通常會透過 grid search 和 cross-validation 的方式選出最合適的參數。

Kernel Methods

kernel function 會是一個距離函數

linear kernel 是最簡單的一種 kernel function
其實就是兩個 input 的 dot product

ref:
https://www.zhihu.com/question/30371867

Linear Separability 線性可分

當你有一堆 data points
你能夠畫出一條「直線」來區分這些點時
就可以說是 linearly separable
反而則是 linearly inseparable

Logistic Curve

就是一條長得像頭尾被拉長拉扁的 S 的曲線

ref:
https://www.stat.ubc.ca/~rollin/teach/643w04/lec/node46.html

Missing Value Imputation(缺失值填充)

針對那些沒有值的欄位,可能是用中位數、平均值或是最常見的值之類的資料填進去
也稱為 interpolation

Manifold Learning

是一種 non-linear dimensionality reduction 的方式
可以用在把高維度的 dataset 變成較低維度
主要用來做 visualization
常用的有 t-SNE

manifold learning 通常用在 exploratory data analysis
不像 PCA 那樣,會把結果用於 supervised learning 的輸入

ref:
http://scikit-learn.org/stable/modules/manifold.html
https://www.wikiwand.com/en/Nonlinear_dimensionality_reduction

Normalization 歸一化、Standarization 標準化

屬於 preprocessing 的一部分
統一各個特徵的數值範圍
對很多演算法來說這個步驟是必要的

例如:
特徵一是距離,單位是公尺,值的範圍是 10 ~ 3000
特徵二是樓層,值的範圍是 1 ~ 14
為了避免尺度不同造成誤導
需要 rescaling
把各種尺度的數值統一表示成 0 ~ 1 之間的數字
稱為 normalization 歸一化

還有另一種統計學常用的方法,是把數值轉換成 z-scores
使所有數據的平均值為 0、標準差為 1
稱為 standarization 標準化

ref:
https://www.quora.com/What-is-the-difference-between-normalization-standardization-and-regularization-for-data
http://sobuhu.com/ml/2012/12/29/normalization-regularization.html

Predictors

就是 features

Principal component analysis (PCA) 主成份分析

主成分分析,是一种分析、简化数据集的技术。用于减少数据集的维数,同时保持数据集中的对方差贡献最大的特征。

用來 reduce dimensionality(減少 dataset 的維度數)
可以找出對 Variance 貢獻最大的特徵

Overfitting 過度擬合(過擬合)/ Underfitting 擬合不足(欠擬合)

overfitting 常常發生在 model 很複雜、有很多參數的時候
或是 dataset 裡有很多 noise 或 outlier
表現為在 training set 的準確率很高,但是在 testing set 的準確率卻很低
複雜模型 >> high variance / low bias >> overfitting

underfitting 通常發生在 model 太簡單的時候
表現為就算是在 training set 上的錯誤率就很高
簡單模型 >> high bias / low variance >> underfitting

ref:
http://www.csuldw.com/2016/02/26/2016-02-26-choosing-a-machine-learning-classifier/

Regularization 正規化、正則化

Regularization means explicitly restricting a model to avoid overfitting.

是一種防止 overfitting 的技巧
regularization 保留所有 features
但是降低或懲罰某些 features 對 model 預測值的影響
常見的方法有 L1 和 L2

L1 正則化是指權重向量 w 中各個元素的絕對值之和

L2
正則化是指權重向量 w 中各個元素的平方和然後再求平方根

ref:
https://zhuanlan.zhihu.com/p/25707761
http://blog.csdn.net/jinping_shi/article/details/52433975
http://blog.csdn.net/zouxy09/article/details/24971995

Resampling

在 classification 問題中
每一種 class 的數量差距很大
例如正樣本佔了 98%、負樣本佔了 2%
這就是所謂的不平衡的 dataset
解決的辦法之一是 resampling
主要可以分成 oversampling 和 undersampling(過採樣和欠採樣)

undersampling 是指減少多數類樣本的數量
例如隨機拿掉部分多數類樣本
直到正負樣本的數量相同
缺點是你可能也拿掉了 dataset 裡潛在的資訊

oversampling 指的是增加少數類樣本的數量
例如複製少數類樣本
讓正負樣本的數量盡可能相同
缺點顯而易見就是容易 overfitting
其他 oversampling 的方法還有 SMOTE (Synthetic Minority Over-sampling Technique)
合成新的少數類樣本
合成的策略是對每個少數類樣本 a
從它的最近鄰中隨機選一個樣本 b
然後在 a、b 之間的連線上隨機選一點作為新合成的少數類樣本

ref:
http://www.jiqizhixin.com/article/2499
http://www.algorithmdog.com/unbalance
https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis

Training set / Test set

把 dataset 分成 training set 和 test set
用 training set 來訓練模型
用 test set 來評估結果
這兩組數據必須是從原始的 dataset 裡「均勻取樣」(隨機)
常見的比例是 70/30
這種方式稱為 holdout

也可以分成 training set、validation set、test set
training set 用來訓練模型,validation set 用來選擇模型(調整超參數),testing set 用在最終模型的評估
常見的比例是 50/25/25

基本上你的 test set 只能用來評估最終模型
不能用 test set 去訓練模型或是交叉驗證
對你的 model 來說 test set 就是一組 unseen 的資料
所以 test set 的評估結果才可以視為 model 上線後對真實資料的預測能力
如果你的 model 在 validation set 表現不錯
但是在 test set 的表現很差
那就是 overfitting 了

ref:
https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set
https://www.jiqizhixin.com/articles/a62fc871-6366-402b-b32f-f9a3f17a566b
https://mp.weixin.qq.com/s/W7wpxHoC2F5DHCUO7ES1cw