Linux commands cookbook

Linux commands cookbook

50 Most Frequently Used UNIX / Linux Commands
http://www.thegeekstuff.com/2010/11/50-linux-commands/

Write a Simple Script in Shell

Write a for loop.

$ for pod_name in $(kubectl get pods -l app=swag-worker-test -o jsonpath={..metadata.name}); do; kubectl delete pod $pod_name; done
pod "swag-worker-test-67fffcdd5-5hgf3" deleted
pod "swag-worker-test-67fffcdd5-h8jgg" deleted

# you could also write multiple lines
for pod_name in $(kubectl get pods -l app=swag-worker-test -o jsonpath={..metadata.name}); do
    kubectl delete pod $pod_name
done

Write a while true.

# using trap and wait will make your container react immediately to a stop request
$ bash -c "trap : TERM INT; sleep infinity & wait"
# or
$ bash -c "while true; do echo 'I am alive!'; sleep 3600; done"

or

#!/bin/bash
while true; do echo 'I am alive!'; sleep 3600; done

ref:
https://stackoverflow.com/questions/31870222/how-can-i-keep-container-running-on-kubernetes

Set Environment Variables from a File

$ export $(cat .env | grep -v ^# | xargs)

ref:
https://stackoverflow.com/questions/19331497/set-environment-variables-from-file

Switch to Another User

# the latter with "-" gets an environment as if another user just logged in
$ sudo su - ubuntu

Clear the Content of a File

$ echo -n > /var/log/nginx/error.log

ref:
https://unix.stackexchange.com/questions/88808/empty-the-contents-of-a-file

Pipeline stdout with xargs

$ find . -type f -name "*.yaml" | xargs echo
./configmap.yaml ./pvc.yaml ./service.yaml ./statefulset.yaml

$ find . -type f -name "*.yaml" | xargs -n 1 echo
./configmap.yaml
./pvc.yaml
./service.yaml
./statefulset.yaml

$ find . -type f -name "*.yaml" | xargs -n 2 echo
./configmap.yaml ./pvc.yaml
./service.yaml ./statefulset.yaml

$ redis-cli KEYS "*-*-*-*.reply.celery.pidbox" | xargs -n 100 redis-cli DEL

ref:
https://shapeshed.com/unix-xargs/

Set a Timeout for any Command

$ timeout -t 15 celery inspect ping -A app:celery -d celery@$(hostname)

Run a One-time Command at a Specific Time

at executes commands at a specified time. You may need to install the "at" package manually.

# install
$ sudo apt-get install at

# start
$ sudo atd

# list jobs
$ atq

$ at 00:05
at> echo "123" > /tmp/test.txt

$ at 00:00 18.1.2017
at> DPS_ENV=production /home/ubuntu/.virtualenvs/dps/bin/python /home/ubuntu/dps/manage.py send_emails > /tmp/send_emails.log

Press Control + D to exit at shell.

ref:
https://www.lifewire.com/linux-command-at-4091646
https://tecadmin.net/one-time-task-scheduling-using-at-commad-in-linux/

Pass Arguments to bash when Executing a script Fetched by curl

$ curl -L http://bit.ly/open-the-pod-bay-doors | bash -s -- --tags docker 

ref:
https://stackoverflow.com/a/25563308/885524
https://github.com/vinta/HAL-9000

Change a File's Modify Time

$ touch -m -d '1 Jan 2006 12:34' tmp
$ touch -m -d '1 Jan 2006 12:34' tmp/old_file.txt

ref:
https://www.unixtutorial.org/2008/11/how-to-update-atime-and-mtime-for-a-file-in-unix/

Delete Old Files under a Directory

$ find /data/storage/tmp/* -mtime +2 | xargs rm -Rf
$ find /data/storage/tmp/* -mtime +2 -exec rm {} \;

ref:
http://stackoverflow.com/questions/14731133/how-to-delete-all-files-older-than-3-days-when-argument-list-too-long

Append String to a File

# append
$ echo "the last line" >> README.md

# replace
$ echo "replace all" > README.md

Rename Sub-folders

$ for f in */migrations/; do mv -v "$f" "${f%???????????}south_migrations"; done

ref:
http://unix.stackexchange.com/questions/220176/rename-specific-level-of-sub-folders

List History Commands

$ export HISTTIMEFORMAT="%Y%m%d %T  "
$ history

Make Permission For A File Same As Another File

$ chmod --reference=file1 file2

Find Computer's Public IP

$ wget -qO- http://ipecho.net/plain ; echo

ref:
http://askubuntu.com/questions/95910/command-for-determining-my-public-ip

Compress and Uncompress Files

$ tar czf media-20151010.tar.gz media/
$ s3cmd put media-20151010.tar.gz s3://goeasytaxi/

# decompress
$ tar -xzf media.tar.gz

$ sudo apt-get install zip unzip
$ zip -j -r deps.zip spark_app/src/deps/
$ zip -r hourmasters.zip hourmasters/
$ scp -r -i ~/hourmasters.pem ssh [email protected]:/home/ubuntu/hourmasters.zip ~/Desktop/

# decompress
$ unzip stork.1.4.zip
$ gzip -d uwsgi.log.*.gz

$ gzip dps.201701171200.sql
$ gunzip dps.201701171200.sql.gz

Count File Lines

$ wc -l filename.txt

$ wc -l *.py

Find Files by Name or Content

$ find / -name virtualenvwrapper.sh

# 在現在的資料夾裡的全部檔案中搜尋字串,會自動搜尋子目錄
$ find . | xargs grep 'string'

$ find . -iname '*something*'

$ find *.html | xargs grep 'share_server.html'

# 搜尋當前目錄及子目錄下的含有 print() 字串的檔案
$ grep -rnw "." -e "print()"

$ grep -rnw "." -e "print()" --include=\*.py

ref:
https://stackoverflow.com/questions/16956810/how-do-i-find-all-files-containing-specific-text-on-linux

Find Directories by Name

$ find . -type d -name "*native*" -print

ref:
https://askubuntu.com/questions/153144/how-can-i-recursively-search-for-directory-names-with-a-particular-string-where

List Files by Date

$ ls -lrt

List Files Opened by a Process

$ lsof | grep uwsgi

$ lsof -i | grep LISTEN
$ lsof -i -n -P | grep LISTEN

Extract Content from a File

$ cat uwsgi.log | grep error

Display Contents of All Files in the Current Directory

$ grep . *
$ grep . *.html

List Used Ports

$ netstat -a

# TCP
$ netstat -ntlp | grep uwsgi

# UCP
$ netstat -nulp

Ping a Port

$ curl -I "10.148.70.84:9200"
$ curl -I "192.168.100.10:80"

$ sudo apt-get install nmap
$ nmap -p 4505 54.250.5.176
$ nmap -p 8000 10.10.100.70
$ nmap -p 5672 10.10.100.82

$ telnet 54.250.5.176 4505

ref:
http://stackoverflow.com/questions/12792856/what-ports-does-rabbitmq-use

Show Network Traffic and Bandwidth

$ tcpdump -i eth0

$ sudo apt-get install tcptrack
$ tcptrack -i eth0

ref:
http://askubuntu.com/questions/257263/how-to-display-network-traffic-in-terminal

List Running Processes

# show all processes
$ pstree -a

# also show pid
$ pstree -ap

# 列出前 10 個最佔記憶體的 processes
$ ps aux | sort -nk +4 | tail

# 列出 mysql 相關的 processes
$ ps aux | grep 'worker process'
$ ps aux | grep uwsgi

# 樹狀顯示
$ ps auxf

# 搜尋 process 並以樹狀結果顯示 parent process
$ ps f -opid,cmd -C python

Kill Processes

# 列出目前所有的正在記憶體當中的程序
$ ps aux

# 匹配字串
$ ps aux | grep "mongo"

# 幹掉它
$ kill PID

# kill all processes matching a name
$ sudo pkill -f runserver

Store User's Input as a Variable

$ read YOUR_VARIABLE_NAME

$ read name
# you type: vinta

$ echo $name
vinta

ref:
https://canred.blogspot.tw/2013/03/read.html

Show Terminal Size

$ stty size
$ echo $LINES && echo $COLUMNS
59 273 

ref:
https://stackoverflow.com/questions/263890/how-do-i-find-the-width-height-of-a-terminal-window

Functional Programming in Python

Functional Programming in Python

lambda

square_func = lambda x: x**2
square_func(2)

# equals to

def square_func(x):
    return x**2

Python 的 lambda 其實就是 JavaScript 的 arrow function

map

list 的每一個元素都會各自經過 def func(x) 去處理
最後得到的會是一個新的數量相同的 list

def double(number):
  return number * 2

print(list(map(double, [1, 2, 3, 4])))
# [2, 4, 6, 8]

# equals to

print(list(map(lambda number: number * 2, [1, 2, 3, 4])))

reduce

list 中的元素會兩兩經過 def func(x, y) 去處理
最後得到的會是一個單一的值

def add(x, y):
    return x + y

print(reduce(add, [1, 2, 3, 4]))
# ((((1+2)+3)+4)+5) = 10

# equals to

print(reduce(lambda x, y: x + y, [1, 2, 3, 4]))

filter

對 list 的每一個元素做 def func(x)
產生一個新的 list 只包含 def func(x) 結果為 True 的元素

ref:
http://www.vinta.com.br/blog/2015/functional-programming-python.html
http://www.bogotobogo.com/python/python_fncs_map_filter_reduce.php

zip

number_list = [1, 2, 3]
str_list = ['one', 'two', 'three']
list(zip(number_list, str_list))
# [(1, 'one'), (2, 'two'), (3, 'three')]

ref:
https://www.programiz.com/python-programming/methods/built-in/zip

MySQL system error codes

MySQL system error codes

Print all OS error codes and MySQL error codes using the perror command.

$ for i in {1..190..1}; do perror "$i"; done

OS error code   1:  Operation not permitted
OS error code   2:  No such file or directory
OS error code   3:  No such process
OS error code   4:  Interrupted system call
OS error code   5:  Input/output error
OS error code   6:  No such device or address
OS error code   7:  Argument list too long
OS error code   8:  Exec format error
OS error code   9:  Bad file descriptor
OS error code  10:  No child processes
OS error code  11:  Resource temporarily unavailable
OS error code  12:  Cannot allocate memory
OS error code  13:  Permission denied
OS error code  14:  Bad address
OS error code  15:  Block device required
OS error code  16:  Device or resource busy
OS error code  17:  File exists
OS error code  18:  Invalid cross-device link
OS error code  19:  No such device
OS error code  20:  Not a directory
OS error code  21:  Is a directory
OS error code  22:  Invalid argument
OS error code  23:  Too many open files in system
OS error code  24:  Too many open files
OS error code  25:  Inappropriate ioctl for device
OS error code  26:  Text file busy
OS error code  27:  File too large
OS error code  28:  No space left on device
OS error code  30:  Read-only file system
OS error code  31:  Too many links
OS error code  32:  Broken pipe
OS error code  33:  Numerical argument out of domain
OS error code  34:  Numerical result out of range
OS error code  35:  Resource deadlock avoided
OS error code  36:  File name too long
OS error code  37:  No locks available
OS error code  38:  Function not implemented
OS error code  39:  Directory not empty
OS error code  40:  Too many levels of symbolic links
OS error code  42:  No message of desired type
OS error code  43:  Identifier removed
OS error code  44:  Channel number out of range
OS error code  45:  Level 2 not synchronized
OS error code  46:  Level 3 halted
OS error code  47:  Level 3 reset
OS error code  48:  Link number out of range
OS error code  49:  Protocol driver not attached
OS error code  50:  No CSI structure available
OS error code  51:  Level 2 halted
OS error code  52:  Invalid exchange
OS error code  53:  Invalid request descriptor
OS error code  54:  Exchange full
OS error code  55:  No anode
OS error code  56:  Invalid request code
OS error code  57:  Invalid slot
OS error code  59:  Bad font file format
OS error code  60:  Device not a stream
OS error code  61:  No data available
OS error code  62:  Timer expired
OS error code  63:  Out of streams resources
OS error code  64:  Machine is not on the network
OS error code  65:  Package not installed
OS error code  66:  Object is remote
OS error code  67:  Link has been severed
OS error code  68:  Advertise error
OS error code  69:  Srmount error
OS error code  70:  Communication error on send
OS error code  71:  Protocol error
OS error code  72:  Multihop attempted
OS error code  73:  RFS specific error
OS error code  74:  Bad message
OS error code  75:  Value too large for defined data type
OS error code  76:  Name not unique on network
OS error code  77:  File descriptor in bad state
OS error code  78:  Remote address changed
OS error code  79:  Can not access a needed shared library
OS error code  80:  Accessing a corrupted shared library
OS error code  81:  .lib section in a.out corrupted
OS error code  82:  Attempting to link in too many shared libraries
OS error code  83:  Cannot exec a shared library directly
OS error code  84:  Invalid or incomplete multibyte or wide character
OS error code  85:  Interrupted system call should be restarted
OS error code  86:  Streams pipe error
OS error code  87:  Too many users
OS error code  88:  Socket operation on non-socket
OS error code  89:  Destination address required
OS error code  90:  Message too long
OS error code  91:  Protocol wrong type for socket
OS error code  92:  Protocol not available
OS error code  93:  Protocol not supported
OS error code  94:  Socket type not supported
OS error code  95:  Operation not supported
OS error code  96:  Protocol family not supported
OS error code  97:  Address family not supported by protocol
OS error code  98:  Address already in use
OS error code  99:  Cannot assign requested address
OS error code 100:  Network is down
OS error code 101:  Network is unreachable
OS error code 102:  Network dropped connection on reset
OS error code 103:  Software caused connection abort
OS error code 104:  Connection reset by peer
OS error code 105:  No buffer space available
OS error code 106:  Transport endpoint is already connected
OS error code 107:  Transport endpoint is not connected
OS error code 108:  Cannot send after transport endpoint shutdown
OS error code 109:  Too many references: cannot splice
OS error code 110:  Connection timed out
OS error code 111:  Connection refused
OS error code 112:  Host is down
OS error code 113:  No route to host
OS error code 114:  Operation already in progress
OS error code 115:  Operation now in progress
OS error code 116:  Stale NFS file handle
OS error code 117:  Structure needs cleaning
OS error code 118:  Not a XENIX named type file
OS error code 119:  No XENIX semaphores available
OS error code 120:  Is a named type file
OS error code 121:  Remote I/O error
OS error code 122:  Disk quota exceeded
OS error code 123:  No medium found
OS error code 124:  Wrong medium type
OS error code 125:  Operation canceled
OS error code 126:  Required key not available
OS error code 127:  Key has expired
OS error code 128:  Key has been revoked
OS error code 129:  Key was rejected by service
OS error code 130:  Owner died
OS error code 131:  State not recoverable
OS error code 132:  Operation not possible due to RF-kill
OS error code 133:  Memory page has hardware error
MySQL error code 120: Did not find key on read or update
MySQL error code 121: Duplicate key on write or update
MySQL error code 122: Internal (unspecified) error in handler
MySQL error code 123: Someone has changed the row since it was read (while the table was locked to prevent it)
MySQL error code 124: Wrong index given to function
MySQL error code 125: Undefined handler error 125
MySQL error code 126: Index file is crashed
MySQL error code 127: Record file is crashed
MySQL error code 128: Out of memory in engine
MySQL error code 129: Undefined handler error 129
MySQL error code 130: Incorrect file format
MySQL error code 131: Command not supported by database
MySQL error code 132: Old database file
MySQL error code 126: Index file is crashed
MySQL error code 127: Record-file is crashed
MySQL error code 128: Out of memory
MySQL error code 130: Incorrect file format
MySQL error code 131: Command not supported by database
MySQL error code 132: Old database file
MySQL error code 133: No record read before update
MySQL error code 134: Record was already deleted (or record file crashed)
MySQL error code 135: No more room in record file
MySQL error code 136: No more room in index file
MySQL error code 137: No more records (read after end of file)
MySQL error code 138: Unsupported extension used for table
MySQL error code 139: Too big row
MySQL error code 140: Wrong create options
MySQL error code 141: Duplicate unique key or constraint on write or update
MySQL error code 142: Unknown character set used in table
MySQL error code 143: Conflicting table definitions in sub-tables of MERGE table
MySQL error code 144: Table is crashed and last repair failed
MySQL error code 145: Table was marked as crashed and should be repaired
MySQL error code 146: Lock timed out; Retry transaction
MySQL error code 147: Lock table is full;  Restart program with a larger locktable
MySQL error code 148: Updates are not allowed under a read only transactions
MySQL error code 149: Lock deadlock; Retry transaction
MySQL error code 150: Foreign key constraint is incorrectly formed
MySQL error code 151: Cannot add a child row
MySQL error code 152: Cannot delete a parent row
MySQL error code 153: No savepoint with that name
MySQL error code 154: Non unique key block size
MySQL error code 155: The table does not exist in engine
MySQL error code 156: The table already existed in storage engine
MySQL error code 157: Could not connect to storage engine
MySQL error code 158: Unexpected null pointer found when using spatial index
MySQL error code 159: The table changed in storage engine
MySQL error code 160: There is no partition in table for the given value
MySQL error code 161: Row-based binlogging of row failed
MySQL error code 162: Index needed in foreign key constraint
MySQL error code 163: Upholding foreign key constraints would lead to a duplicate key error in some other table
MySQL error code 164: Table needs to be upgraded before it can be used
MySQL error code 165: Table is read only
MySQL error code 166: Failed to get next auto increment value
MySQL error code 167: Failed to set row auto increment value
MySQL error code 168: Unknown (generic) error from engine
MySQL error code 169: Record is the same
MySQL error code 170: It is not possible to log this statement
MySQL error code 171: The event was corrupt, leading to illegal data being read
MySQL error code 172: The table is of a new format not supported by this version
MySQL error code 173: The event could not be processed no other hanlder error happened
MySQL error code 174: Got a fatal error during initialzaction of handler
MySQL error code 175: File to short; Expected more data in file
MySQL error code 176: Read page with wrong checksum
MySQL error code 177: Too many active concurrent transactions
MySQL error code 178: Record not matching the given partition set
MySQL error code 179: Index column length exceeds limit
MySQL error code 180: Index corrupted
MySQL error code 181: Undo record too big
MySQL error code 182: Invalid InnoDB FTS Doc ID
MySQL error code 183: Table is being used in foreign key check
MySQL error code 184: Tablespace already exists
MySQL error code 185: Too many columns
MySQL error code 186: Row in wrong partition
MySQL error code 187: InnoDB is in read only mode
MySQL error code 188: FTS query exceeds result cache memory limit
MySQL error code 189: Temporary file write failure
MySQL error code 190: Operation not allowed when innodb_forced_recovery > 0
MySQL error code 191: Too many words in a FTS phrase or proximity search
MySQL error code 192: Foreign key cascade delete/update exceeds max depth
MySQL error code 193: Required Create option missing
MySQL error code 194: Out of memory in storage engine
MySQL error code 195: Table corrupted
MySQL error code 196: Query interrupted
MySQL error code 197: Tablespace cannot be accessed
MySQL error code 198: Tablespace is not empty
MySQL error code 199: Incorrect file name
MySQL error code 200: Operation is not allowed
MySQL error code 201: Compute generate value failed

ref:
http://man7.org/linux/man-pages/man3/perror.3.html

Calculate the similarity of two vectors

Calculate the similarity of two vectors

scipy.spatial.distance
https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

sklearn.metrics
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

Distance

Euclidean distance 歐幾里德距離

from sklearn.metrics.pairwise import euclidean_distances

euclidean_distances([0, 0, 0, 0], [0, 0, 0, 0])
# array([[ 0.]])

euclidean_distances([1, 0, 1, 0], [1, 0, 1, 0])
# array([[ 0.]])

euclidean_distances([0, 1, 0, 1], [1, 0, 1, 0])
# array([[ 2.]])

ref:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html

Manhattan Distance 曼哈頓距離

from sklearn.metrics.pairwise import manhattan_distances

manhattan_distances([0, 0, 0, 0], [0, 0 , 0, 0])
# array([[ 0.]])

manhattan_distances([1, 1, 1, 0], [1, 0, 0, 0])
# array([[ 2.]])

manhattan_distances([0, 1, 0, 1], [1, 0, 1, 0])
# array([[ 4.]])

ref:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.manhattan_distances.html

Similarity

Cosine similarity 餘弦相似度

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import cosine_distances
from sklearn.metrics.pairwise import pairwise_distances
from scipy.spatial.distance import pdist, squareform

cosine_similarity(matrix) == \
1 - cosine_distances(matrix) == \
1 - pairwise_distances(matrix, metric='cosine') == \
1 - squareform(pdist(matrix, 'cosine'))

cosine_similarity([0, 0, 0, 0], [0, 0, 0, 0])
# array([[ 0.]])

cosine_similarity([1, 0, 0, 0], [1, 0, 0, 0])
# array([[ 1.]])

cosine_similarity([1, 0, 1, 0], [0, 1, 0, 1])
# array([[ 0.]])

cosine_similarity([1, 0, 0, 1], [1, 0, 0, 0])
# array([[ 0.70710678]])

cosine_similarity([1, 0, 0, 1], [1, 0, 1, 0])
# array([[ 0.5]])

ref:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html

Jaccard similarity coefficient score

from sklearn.metrics import jaccard_similarity_score

jaccard_similarity_score([0, 0, 0, 0], [0, 0, 0, 0])
# 1.0

jaccard_similarity_score([0, 0, 0, 0], [1, 0, 0, 0])
# 0.75

jaccard_similarity_score([1, 0, 0, 0], [1, 0, 0, 0])
# 1.0

jaccard_similarity_score([1, 0, 1, 0], [0, 1, 0, 1])
# 0.0

ref:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html

http://datascience.stackexchange.com/questions/5121/applications-and-differences-for-jaccard-similarity-and-cosine-similarity

Log-Likelihood similarity

TODO

Pearson correlation coefficient 皮爾森相關係數

It has a value between +1 and −1 inclusive, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. You should only calculate Pearson Correlations when the number of items in common between two users is > 1, preferably greater than 5/10. Only calculate the Pearson Correlation for two users where they have commonly rated items.

For hign-dimensional binary attributes, the performances of Pearson correlation coefficient and Cosine similarity
are better than Jaccard similarity coefficient score.

from scipy.stats import pearsonr

pearsonr([1, 0, 1, 1], [0, 0, 0, 0])
# (nan, 1.0)

pearsonr([1, 0, 1, 1], [1, 0, 0, 0])
# (0.33333333333333331, 0.66666666666666607)

pearsonr([1, 0, 1, 0], [0, 1, 0, 1])
# (-1.0, 0.0)

ref:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html
http://stackoverflow.com/questions/11429604/how-is-nan-handled-in-pearson-correlation-user-user-similarity-matrix-in-a-recom

Dissimilarity

Dice dissimilarity

from scipy.spatial.distance import dice
import numpy as np

v1 = np.array([0, 0, 0, 0])
v2 = np.array([0, 0, 0, 0])

try:
    sim = 1.0 - dice(v1.astype(bool), v2.astype(bool))
except ZeroDivisionError:
    sim = 0

ref:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.dice.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.kulsinski.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.sokalsneath.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.yule.html

Recommender System: Collaborative Filtering 協同過濾推薦演算法

Recommender System: Collaborative Filtering 協同過濾推薦演算法

dataset 會是 m 個用戶對 n 個物品的評分 utility matrix
因為通常只有部分用戶和部份物品會有評分資料
所以是一個 sparse matrix(稀疏矩陣)
目標是利用這些稀疏的資料去預測出用戶對他還沒評分過的物品的評分
除了評分之外,也可能是喜歡(和不喜歡)、購買、瀏覽之類的數據
又分成主動評分和被動評分

CF 的缺點:

  • 如果沒有用戶的歷史數據就沒辦法做任何推薦
  • 以及無論 user-based 或 item-based 都需要消耗大量的運算資源
  • 大部分用戶有評分紀錄的資料都只佔所有資料中的很小一部分,matrix 相當稀疏,很難找到相似的資料
  • 會有馬太效應,越熱門的物品越容易被推薦,所以通常都會降低熱門物品的權重

CF 主要分為 memory-based 和 model-based 兩大類
user-based 和 item-based collaborative filtering 屬於 memory-based
memory-based 基本上就是純粹的計算,沒有什麼 Machine Learning 的成分
model-based 才是 Machine Learning 的範疇

User-based Collaborative Filtering

        item_a  item_b  item_c
user_1  2       -       3
user_2  5       2       -
user_3  3       3       1
user_4  -       2       2
# the algorithm from "Mahout in Action"
for every other user w
  compute a similarity s between u and w
  retain the top users, ranked by similarity, as a neighborhood n

for every item i that some user in n has a preference for,
      but that u  has no preference for yet
  for every other user v in n that has a preference for i
    compute a similarity s between u and  v
    incorporate v's preference for i, weighted by s, into a running average

user-based 考慮的是 user 和 user 之間的相似程度

給定一個用戶 A
計算用戶 A 跟其他所有用戶的相似度
找出最相似的 m 個用戶
再找出這些用戶有評分但是用戶 A 沒有評分的物品(也可以額外限制至少要幾個用戶有評分過)
以「相似用戶的相似度」和「該用戶對該物品的評分」來加權算出用戶 A 對這些未評分物品的評分
最後推薦給 A 評分最高的 n 個物品

預測 user_4 對 item_a 的評分 =
(user_4_user_1_sim x user_1_item_a_rating + user_4_user_3_sim x user_3_item_a_rating) / (user_4_user_1_sim + user_4_user_3_sim)

user-based 的特點:

  • 適合 user 遠少於 item 的系統,相似度的計算量會較少
  • item 的時效性強、更多樣的系統,例如新聞、社交網站,適合用 user-based CF
  • 不容易給出推薦理由
  • 驚喜度較高

常用的相似度演算法:

  • Pearson Correlation Coefficient
  • Cosine Similarity
  • Adjusted Cosine Similarity(有些用戶傾向於對所有物品評高分或低分,這個計算方式可以消除這樣的影響)

ref:
https://www.safaribooksonline.com/library/view/mahout-in-action/9781935182689/kindle_split_013.html

Item-based Collaborative Filtering

        user_1  user_2  user_3  user_4
item_a  2       5       3       -
item_b  -       2       3       2
item_c  3       -       1       2
# the algorithm from "Mahout in Action"
for every item i that u has no preference for yet
  for every item j that u has a preference for
    compute a similarity s between i and j
    add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average

item-based 考慮的是 item 和 item 之間的相似程度
item-based 用的還是跟 user-based CF 一模一樣的資料
而不是使用 item 本身的特徵(那個叫 content-based)

如果物品數比用戶數還少得多的話
可以事先計算好所有物品之間的相似度
給定一個用戶 A
找出用戶 A 的所有未評分物品
以「用戶 A 的已評分物品對該未評分物品的相似度」和「用戶 A 對已評分物品的評分」來加權算出用戶 A 對這些未評分物品的評分
最後推薦給用戶 A 評分最高的 n 個物品

預測 user_4 對 item_a 的評分 =
(item_b_item_a_sim x user_4_item_b_rating + item_c_item_a_sim x user_4_item_c_rating) / (item_b_item_a_sim + item_c_item_a_sim)

也可以無視用戶 A 的歷史評分資料(或是根本沒有用戶 A 的歷史資料)
直接推薦跟某個物品最相似的 n 個物品

item-based 的特點:

  • 適合 item 遠少於 user 的系統,相似度的計算量會較少
  • 購物、電影、音樂、書籍等系統,用戶的興趣相對固定,適合用 item-based CF
  • 只會推薦類似的東西,驚喜度和多樣性較低
  • 通常只有在用戶量比較小的時候才需要頻繁地重新計算物品之間的相似度,隨著用戶量越大,物品的相似度會趨於穩定

ref:
https://ashokharnal.wordpress.com/2014/12/18/worked-out-example-item-based-collaborative-filtering-for-recommenmder-engine/
http://blog.csdn.net/huagong_adu/article/details/7362908

Slope One Recommender

        item_a  item_b  item_c
user_1  5       3       2
user_2  3       4       -
user_3  -       2       5
# the algorithm from "Mahout in Action"
for every item i the user u expresses no preference for
  for every item j that user u expresses a preference for
    find the average preference difference between j and i
    add this diff to u's preference value for j
    add this to a running average
return the top items, ranked by these averages

因為 memory-based collaborative filtering 的其中一個問題是數據量很大時計算量也會很可觀
所有就有人提出 Slope One 這種簡單粗暴的演算法來
雖然 Slope One 還是得計算所有物品兩兩之間的平均差異

Slope One 假設任兩個物品之間的評分都是一個 y = mx + b 而且 m = 1(斜率為 1)的線性關係
item_a 平均比 item_b 多 (2 + (-1)) / 2 = 0.5
item_a 平均比 item_c 多 (5 - 2) / 1 = 3
如果用 user_3 對 item_b 的評分來預測他對 item_a 的評分會是 2 + 0.5 = 2.5
如果用 user_3 對 item_c 的評分來預測他對 item_a 的評分會是 5 + 3 = 8
通常會用有多少人同時評分來加權多個評分

預測 user_3 對 item_a 的評分 =
((同時對 item_a 和 item_b 評分的人數 x user_3 用 item_b 對 item_a 的預測評分) + (同時對 item_a 和 item_c 評分的人數 x user_3 用 item_c 對 item_a 的預測評分)) / (同時對 item_a 和 item_b 評分的人數 + 同時對 item_a 和 item_c 評分的人數)
((2 x 2.5) + (1 x 8)) / (2 + 1) = 4.33

ref:
https://en.wikipedia.org/wiki/Slope_One