Setup Spark on macOS

Setup Spark on macOS

Install

First, you need Java 8.
http://www.oracle.com/technetwork/java/javase/downloads/index.html

$ brew update
$ brew install scala
$ brew install apache-spark

ref:
https://gist.github.com/ololobus/4c221a0891775eaa86b0

Build

You could also build Spark on your computer.

$ brew install scala maven
$ mkdir -p /usr/local/share/spark && \
  cd /usr/local/share/spark && \
  wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0.tgz && \
  tar -xvzf spark-2.1.0.tgz && \
  cd spark-2.1.0
$ ./build/mvn -Pyarn -Phadoop-2.7 -Pnetlib-lgpl -Dhadoop.version=2.7.0 -DskipTests clean package
$ spark-shell --packages "com.github.fommil.netlib:all:1.1.2"
scala> import com.github.fommil.netlib.BLAS
scala> println(BLAS.getInstance().getClass().getName())

ref:
http://spark.apache.org/docs/latest/ml-guide.html#dependencies
http://spark.apache.org/docs/latest/building-spark.html
http://www.spark.tc/blas-libraries-in-mllib/
http://blog.prabeeshk.com/blog/2016/12/07/install-apache-spark-2-on-ubuntu-16-dot-04-and-mac-os/

Docker

$ docker run -it --rm -p 8888:8888 -v $PWD:/home/jovyan/work jupyter/all-spark-notebook

ref:
https://github.com/jupyter/docker-stacks/tree/master/all-spark-notebook
https://github.com/gettyimages/docker-spark
https://github.com/epahomov/docker-spark

Configuration

in .zshrc

if which java > /dev/null; then
    export JAVA_HOME=$(/usr/libexec/java_home -v 1.8);
fi

if which pyspark > /dev/null; then
  # build version:
  # export SPARK_HOME="/usr/local/share/spark/spark-2.1.0"
  export SPARK_HOME="/usr/local/Cellar/apache-spark/2.1.0/libexec"
  export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
fi

ref:
https://spark.apache.org/docs/latest/programming-guide.html
https://spark.apache.org/docs/latest/configuration.html

Commands

Local mode (Interactive shell)

$ export PYSPARK_DRIVER_PYTHON=ipython
# or
$ export PYSPARK_DRIVER_PYTHON="jupyter" && \
  export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip 0.0.0.0"

$ pyspark \
--packages "org.xerial:sqlite-jdbc:3.16.1,mysql:mysql-connector-java:5.1.41,com.github.fommil.netlib:all:1.1.2" \
--driver-memory 4g \
--executor-memory 2g \
--master "local[*]"

# Spark Web UI of the driver
$ open http://localhost:4040/

ref:
https://spark.apache.org/docs/latest/programming-guide.html

Standalone mode

$ cd $SPARK_HOME
$ cp conf/slaves.template conf/slaves
$ cp conf/spark-env.sh.template conf/spark-env.sh

$ ./sbin/start-master.sh
$ ./sbin/start-slave.sh spark://TechnoCore.local:7077

# Spark Web UI of the cluster manager
$ open http://localhost:8080/

$ pyspark \
--driver-memory 4g \
--master spark://TechnoCore.local:7077
# or
$ spark-submit \
--master spark://TechnoCore.local:7077 \
examples/src/main/python/pi.py 10

# Spark Web UI of the driver
$ open http://localhost:4040/

ref:
http://spark.apache.org/docs/latest/spark-standalone.html
http://spark.apache.org/docs/latest/submitting-applications.html

Recommender System: Model-based approaches

Recommender System: Model-based approaches

协同过滤推荐算法总结
http://www.cnblogs.com/pinard/p/6349233.html

Matrix Factorization

Singular Value Decomposition (SVD) 奇異值分解

        item_a  item_b  item_c
user_1  2       -       -
user_2  -       1       -
user_3  3       3       -
user_4  -       2       2

對 m x n 的評分矩陣做矩陣分解,矩陣分解常用的方法之一是 SVD 奇異值分解,把評分矩陣分解成三個矩陣 U S V*。

SVD 在推荐系统中的应用
http://yanyiwu.com/work/2012/09/10/SVD-application-in-recsys.html

奇异值分解 (SVD) 原理与在降维中的应用
http://www.cnblogs.com/pinard/p/6251584.html

Alternating Least Squares (ALS) 交替最小二乘法

因為 SVD 要求矩陣是稠密的,推薦系統中的評分矩陣通常是一個很大的稀疏矩陣,所以實務上都會使用 SVD 的變種:Funk-SVD,Funk-SVD 也稱為 Latent Factor Model (LFM) 隱語意模型:把 m x n 的評分矩陣 R 分解成 U I 兩個矩陣,分別是 m x k 的 users 矩陣和 k x n 的 items 矩陣,各自都有 k 維的特徵,計算時只考慮評分不為 0 的項目。

ALS 就是求出 U I 矩陣的一種求解方法。其他的方式還有 Stochastic Gradient Descent (SGD)。

ALS 在 Spark MLlib 中的实现
http://www.csdn.net/article/2015-05-07/2824641

矩阵分解在协同过滤推荐算法中的应用
http://www.cnblogs.com/pinard/p/6351319.html

基于矩阵分解的隐因子模型
http://www.voidcn.com/blog/winone361/article/p-5031282.html

ALS for implicit feedbacks

這種模型也稱為 Weighted Regularized Matrix Factorization (WRMF)。

        item_a  item_b  item_c
user_1  1       -       -
user_2  -       1       -
user_3  1       1       -
user_4  -       1       1

符號說明:

  • Rui = User u's interaction on Item i
  • Xu = User u's vector
  • Yi = Item i's vector
  • Pui = 1 if Rui > 0 else 0
  • Cui = 1 + alpha x Rui

常用值:

  • iteration = 20 迭代次數
  • rank = 20 ~ 200 隱含特徵數
  • lambda = 0.1 ~ 0.001 正則化參數
  • alpha = 40 置信度權重

跟 Explicit ALS 一樣,把 m x n 的隱式反饋矩陣 R 分解成 X Y 兩個矩陣,分別是 m x k 的 user latent factors 矩陣和 k x n 的 item latent factors 矩陣。不同的是,額外引入了兩個矩陣:m x n 的 P 矩陣,binary preference,Pui 表示用戶是不是對該物品感興趣;m x n 的 C 矩陣,confidence 置信度(或者想成是 strength 強度),Cui 表示有多確定用戶對該物品感興趣。整個演算法的目標仍舊是計算出 users 和 items 矩陣,跟 Explicit ALS 的差別在於 loss function 多加入了 Cui 和 Pui,而且會考慮所有的 user-item pair,包含 missing value(Rui = 0)。

只要用戶對物品有過行為(Rui > 0),我們就假設用戶對該物品有偏好(Pui = 1),用戶對該物品的行為越多(Rui 越大),Cui 就會越大,表示置信度越高,假設越可信;如果用戶對該物品沒有行為(Rui = 0),就假設用戶對該物品沒有偏好(Pui = 0),但是因為 Cui = 1 + alpha x 0,所以這個假設相對來說置信度較低,比較不可信。

Explicit ALS 是要預測用戶對某個物品的評分;Implicit ALS 則是要預測用戶會不會對某個物品感興趣:Pui = Xu x Yi.T,也就是說 Implicit ALS 輸出的 prediction 其實是用 X x Y.T 所重建出來的 P 矩陣。這些 prediction 的值基本上會落在 [0, 1] 之間,但是也可能大於 1 或是小於 0,值越大表示用戶對該物品有越大的偏好,越小則反之。

因為 ALS 屬於 collaborative filtering,所以也有 Cold Start 的問題,無法對新用戶和新物品做出推薦。

Collaborative Filtering for Implicit Feedback Datasets
http://yifanhu.net/PUB/cf.pdf

A Gentle Introduction to Recommender Systems with Implicit Feedback
https://jessesw.com/Rec-System/

Machine learning @ Spotify
https://www.slideshare.net/AndySloane/machine-learning-spotify-madison-big-data-meetup/17

协同过滤的 ALS 算法
http://www.tk4479.net/antkillerfarm/article/details/53734658

隐性反馈行为数据的协同过滤推荐算法
http://blog.csdn.net/lingerlanlan/article/details/46917601

Spark ML 算法原理剖析
https://github.com/endymecy/spark-ml-source-analysis

Linux commands cookbook

Linux commands cookbook

switch shell to another user

# the latter with "-" gets an environment as if another user just logged in
$ sudo su - ubuntu

change file's modify time

$ touch -m -d '1 Jan 2006 12:34' tmp
$ touch -m -d '1 Jan 2006 12:34' tmp/old_file.txt

ref:
https://www.unixtutorial.org/2008/11/how-to-update-atime-and-mtime-for-a-file-in-unix/

delete old files under a directory

$ find /data/storage/tmp/* -mtime +2 | xargs rm -Rf
$ find /data/storage/tmp/* -mtime +2 -exec rm {} \;

ref:
http://stackoverflow.com/questions/14731133/how-to-delete-all-files-older-than-3-days-when-argument-list-too-long

append string to file in command line

# append
$ echo "the last line" >> README.md

# replace
$ echo "replace all" > README.md

rename sub-folders

$ for f in */migrations/; do mv -v "$f" "${f%???????????}south_migrations"; done

ref:
http://unix.stackexchange.com/questions/220176/rename-specific-level-of-sub-folders

list history commands

$ export HISTTIMEFORMAT="%Y%m%d %T  "
$ history

find public IP

$ wget -qO- http://ipecho.net/plain ; echo

ref:
http://askubuntu.com/questions/95910/command-for-determining-my-public-ip

count file lines

$ wc -l filename.txt

$ wc -l *.py

find files by name or content

$ find / -name virtualenvwrapper.sh

# 在現在的資料夾裡的全部檔案中搜尋字串,會自動搜尋子目錄
$ find . | xargs grep 'string'

$ find . -iname '*something*'

$ find *.html | xargs grep 'share_server.html'

# 搜尋當前目錄下的所有檔案
$ grep "sublime_jedi_goto" *.*

# 搜尋 doc/ 目錄下的所有 txt 檔案
$ grep "sublime_jedi_goto" doc/*.txt

list files by date

$ ls -lrt

extract info from a file

$ cat uwsgi.log | grep error

display contents of all files in the current directory

$ grep . *
$ grep . *.html

list used ports

# list open files for a process
$ lsof | grep uwsgi

$ lsof -i | grep LISTEN
$ lsof -i -n -P | grep LISTEN

# TCP
$ sudo netstat -ntlp | grep uwsgi

# UCP
$ sudo netstat -nulp

$ sudo netstat -nxlp

ping port

$ curl -I "10.148.70.84:9200"
$ curl -I "192.168.100.10:80"

$ sudo apt-get install nmap
$ nmap -p 4505 54.250.5.176
$ nmap -p 8000 10.10.100.70
$ nmap -p 5672 10.10.100.82

$ telnet 54.250.5.176 4505

ref:
http://stackoverflow.com/questions/12792856/what-ports-does-rabbitmq-use

show network traffic and bandwidth

$ tcpdump -i eth0

$ sudo apt-get install tcptrack
$ tcptrack -i eth0

ref:
http://askubuntu.com/questions/257263/how-to-display-network-traffic-in-terminal

list running processes

# show all processes
$ pstree -a

# also show pid
$ pstree -ap

# 列出前 10 個最佔記憶體的 processes
$ ps aux | sort -nk +4 | tail

# 列出 mysql 相關的 processes
$ ps aux | grep 'worker process'
$ ps aux | grep uwsgi

# 樹狀顯示
$ ps auxf

# 搜尋 process 並以樹狀結果顯示 parent process
$ ps f -opid,cmd -C python

kill processes

# 列出目前所有的正在記憶體當中的程序
$ ps aux

# 匹配字串
$ ps aux | grep "mongo"

# 幹掉它
$ kill PID

# kill all processes matching a name
$ sudo killall -9 httpd
$ sudo killall salt
$ sudo pkill -f runserver