碼天狗週刊 第 115 期 @vinta - Python, MongoDB, Apache Spark

碼天狗週刊 第 115 期 @vinta - Python, MongoDB, Apache Spark

本文同步發表於 CodeTengu Weekly - Issue 115

Timezone in Python: Offset-naive and Offset-aware datetimes

關於 datetime 和 timezone 的處理可能是每個軟體工程師遲早都會遭遇的難題之一,以前一直沒有搞得很清楚,前陣子總算有時間稍微整理了一下 timezone 在 Python 裡需要注意的地方。跟大家分享一下。

之前幾期的 issues 裡,其他 curator 也都不約而同分享過 Falsehoods Programmers Believe About XXX,其中就有幾個主題是「時間」,有興趣的人可以翻一下。

延伸閱讀:

MongoDB Query Performance over Ranges & Indexes that Help

上禮拜換了新工作,負責解的第一個 issue 就是改善某個複雜的 MongoDB aggregation 的效能。雖然對 MongoDB 還不是很熟(只有在第一份工作用過一陣子,而且後來就改用 PostgreSQL 了),不過遇到 slow query 問題的起手式基本上都是先 explain 一下,從 queryPlannerexecutionStats 欄位可以知道哪個 stage 的哪個 query 用了哪個 index、掃描了多少數據和回傳了多少數據。最後透過修改了 query 的寫法並且加了一個「欄位順序正確」的 compound index,總算成功地把執行時間從 1.x 秒降到 0.01x 秒。

這篇文章講的就是「欄位順序」對 MongoDB 的 compound index 的影響。作者還給出了一個簡單的判斷方式:

The order of fields in an index should be:

  • First, fields on which you will query for exact values.
  • Second, fields on which you will sort.
  • Finally, fields on which you will query for a range of values.

延伸閱讀:

Spark 2.x Troubleshooting Guide

這份簡報和延伸閱讀的那篇文章列出了許多 Apache Spark 在使用上常見的問題和解法,這陣子在用 Spark 做推薦系統的時候,幾乎把這些地雷都踩過了,如果可以早一點讀過這些資料就好啦。

題外話,如果你也是用 Google Cloud Dataproc 跑 Spark cluster 的話,千萬記得可以用 Preemptible VMs,能夠省下不少費用。不要像我一樣白白被多收了好幾百塊美金,心寒啊。

延伸閱讀:

Learning at work

這篇文章是 Stripe 的工程師分享的「如何在工作中學習」的文章。不過重點其實不是如何學習,而是「在工作中」學習。

畢竟要求每個工程師都應該在下班之後還要花很多時間學新技術、開發 side project 或是貢獻 open source 專案其實是不太現實的嘛。而且一天工作 8 小時,但是可以真的進入 mindflow 專心寫程式的時間可能連 4 小時都不到(我承認我超少)。有時候可能是狀態不對或是你就是欠缺完成任務所需要的知識,這時候與其硬幹、用 workaround 或是產出一些自己都不明所以只是剛好能動的程式碼,那還不如好好地利用上班時間,仔細鑽研一下你要解決的問題、搞懂那些你之前沒搞清楚的東西。再不然也可以重新審視一遍那個你剛剛解決的 bug 的 root cause,因為每個 bug 其實都是一個學習的契機啊。

就像陈皓說的「多些时间能少写些代码」,共勉之。

碼天狗週刊 第 104 期 @vinta - Recommender System, Apache Spark, Machine Learning, MySQL

碼天狗週刊 第 104 期 @vinta - Recommender System, Apache Spark, Machine Learning, MySQL

本文同步發表於 CodeTengu Weekly - Issue 104

Build a recommender system with Spark: Logistic Regression

前陣子寫了幾篇文章專門在講用 Apache Spark 搭建一個 GitHub repo 的推薦系統,打算寫成一個系列,不過因為身體不適中斷了好一陣子,所以寫著寫著 GitHub 都推出自己的推薦系統了(攤手)。言歸正傳,這篇文章主要是在講用 Logistic Regression 來對推薦結果排序,重點放在特徵工程和 Machine Learning Pipeline,對 LR 演算法本身沒有太多著墨,畢竟它就是個線性模型嘛。

延伸閱讀:

接下來是個沈重的題外話,由衷地建議大家真的要好好注意自己用電腦、用手機的姿勢,因為我最近才被診斷出頸椎椎間盤突出壓迫到神經,這個發作起來真的不是在開玩笑的,手腳又麻又癢又痛,你根本沒辦法專心做任何事,連好好睡一覺都不行。我最近吃了一堆止痛藥和肌肉鬆弛劑,都他媽快變成燕南天了。大家就想像一下有個跳蛋在你的骨頭或是神經的深處,三不五時就震一下,喔,那可一點都不好玩。

Build a recommender system with Spark: Content-based and Elasticsearch

這一篇也是 GitHub 推薦系統系列的文章之一,講的是大家喜聞樂見的 Content-based Recommendation。我原本是打算把 repo 的文本資料轉成 Word2Vec 向量,然後計算各個 repo 之間的相似度(所謂的 Similarity Join),但是要計算這麼多 repo 之間的相似度實在太花時間了,就算用了 Locality Sensitive Hashing 也還是太久又容易 OOM。後來一想,尋找相似或相關物品這件事不就是搜尋引擎在做的嗎,所以後來就直接改用 Elasticsearch 了。用 document id 當作搜尋條件,一個 More Like This query 就解決了,清爽利落。畢竟不需要所有的事情都在 Spark 裡解決。

基于 Spark UI 性能优化与调试 —— 初级篇

在寫程式的時候偶爾會發生「我明明只是加了一行簡單的 code,為什麼整個程式的效能就掉了這麼多」的情形,只因為我們對那行 code 實際上到底做了什麼其實並不清楚。更慘的是你的程式還是跑在一個分散式系統上時。幸好 Spark 提供了一個非常棒的工具:Spark UI。透過 Event Timeline 和 DAG Visualization,你可以看到非常詳細的整個 Spark application 的執行過程,例如某個 job 的某個 stage 的某個 task 做了什麼、花了多少時間和在哪一台機器上執行,甚至能夠精確地定位到是在你的 code 的某個檔案的某一行的某個 function call。真希望所有語言和框架都有這樣方便的工具啊。

不過老實說 DAG Visualization 第一眼看起來真的是讓人眼花撩亂,尤其是當你操作的是 Spark SQL 和 DataFrame,但是 Spark UI 顯示的其實卻是低層的 RDD operations,是需要花一點時間熟悉的。雖然你可能得先對 Executor、RDD、Partition 或 Shuffe 這些東西有點概念。

延伸閱讀:

Google - Machine Learning Glossary

這是 Google 製作的一份機器學習的常見詞彙表,非常實用!

MySQL vs. MariaDB: Reality Check

Percona 製作了一份表格,比較了 MySQL、MariaDB 和 Percona Server for MySQL 之間的異同。對有在評估選用或是跳槽到其中之一的資料庫的人應該很有幫助。不過如果你看完還是不知道該選哪個,黃金法則:你就閉著眼睛挑最多人用的那個就好了。

雖然當年 MariaDB 是打著 "a drop-in replacement for MySQL" 的名號,但是現在都 2017 年了,滄海桑田啊。而且大家對「相容性」這三個字可能都有更現實的認知了,畢竟它們就是由不同的人在不同的時間以不同的方式開發的不同的產品啊。

延伸閱讀:

Build a recommender system with Spark: Logistic Regression

Build a recommender system with Spark: Logistic Regression

在這個系列的文章裡,我們將使用 Apache Spark、XGBoost、Elasticsearch 和 MySQL 等工具來搭建一個推薦系統的 Machine Learning Pipeline。推薦系統的組成可以粗略地分成 Candidate Generation 和 Ranking 兩個部分,前者是針對用戶產生候選物品集,常用的方法有 Collaborative Filtering、Content-based、標籤配對、熱門排行或人工精選等;後者則是對這些候選物品排序,以 Top N 的方式呈現最終的推薦結果,常用的方法有 Logistic Regression。

在本篇文章中,我們將以 Ranking 階段常用的方法之一:Logistic Regression 邏輯迴歸為例,利用 Apache Spark 的 Logistic Regression 模型建立一個 GitHub repositories 的推薦系統,以用戶對 repo 的打星紀錄和用戶與 repo 的各項屬性做為特徵,預測出用戶會不會打星某個 repo(分類問題)。最後訓練出來的模型就可以做為我們的推薦系統的 Ranking 模組。不過因為 LR 是線性模型,所以通常需要大量的 Feature Engineering 來習得非線性關係。所以這篇文章的重點是 Spark ML 的 Pipeline 機制和特徵工程,不會在演算法的部分著墨太多。

完整的程式碼可以在 https://github.com/vinta/albedo 找到。

系列文章:

Submit the Application

$ spark-submit \
--master spark://192.168.10.100:7077 \
--packages "com.github.fommil.netlib:all:1.1.2,com.hankcs:hanlp:portable-1.3.4,mysql:mysql-connector-java:5.1.41" \
--class ws.vinta.albedo.UserProfileBuilder \
target/albedo-1.0.0-SNAPSHOT.jar

$ spark-submit \
--master spark://192.168.10.100:7077 \
--packages "com.github.fommil.netlib:all:1.1.2,com.hankcs:hanlp:portable-1.3.4,mysql:mysql-connector-java:5.1.41" \
--class ws.vinta.albedo.RepoProfileBuilder \
target/albedo-1.0.0-SNAPSHOT.jar

$ spark-submit \
--master spark://192.168.10.100:7077 \
--packages "com.github.fommil.netlib:all:1.1.2,com.hankcs:hanlp:portable-1.3.4,mysql:mysql-connector-java:5.1.41" \
--class ws.vinta.albedo.LogisticRegressionRanker \
target/albedo-1.0.0-SNAPSHOT.jar

ref:
https://vinta.ws/code/setup-spark-scala-and-maven-with-intellij-idea.html
https://spark.apache.org/docs/latest/submitting-applications.html
https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application

Load Data

我們之前已經利用 GitHub API 和 BigQuery 上的 GitHub Archive 收集了 150 萬筆的打星紀錄和所屬的用戶、repo 數據。目前有以下幾個數據集,大致上是照著 GitHub API 建立的,欄位分別如下:

rawUserInfoDS.printSchema()
// root
 // |-- user_id: integer (nullable = true)
 // |-- user_login: string (nullable = true)
 // |-- user_account_type: string (nullable = true)
 // |-- user_name: string (nullable = true)
 // |-- user_company: string (nullable = true)
 // |-- user_blog: string (nullable = true)
 // |-- user_location: string (nullable = true)
 // |-- user_email: string (nullable = true)
 // |-- user_bio: string (nullable = true)
 // |-- user_public_repos_count: integer (nullable = true)
 // |-- user_public_gists_count: integer (nullable = true)
 // |-- user_followers_count: integer (nullable = true)
 // |-- user_following_count: integer (nullable = true)
 // |-- user_created_at: timestamp (nullable = true)
 // |-- user_updated_at: timestamp (nullable = true)

rawRepoInfoDS.printSchema()
// |-- repo_id: integer (nullable = true)
 // |-- repo_owner_id: integer (nullable = true)
 // |-- repo_owner_username: string (nullable = true)
 // |-- repo_owner_type: string (nullable = true)
 // |-- repo_name: string (nullable = true)
 // |-- repo_full_name: string (nullable = true)
 // |-- repo_description: string (nullable = true)
 // |-- repo_language: string (nullable = true)
 // |-- repo_created_at: timestamp (nullable = true)
 // |-- repo_updated_at: timestamp (nullable = true)
 // |-- repo_pushed_at: timestamp (nullable = true)
 // |-- repo_homepage: string (nullable = true)
 // |-- repo_size: integer (nullable = true)
 // |-- repo_stargazers_count: integer (nullable = true)
 // |-- repo_forks_count: integer (nullable = true)
 // |-- repo_subscribers_count: integer (nullable = true)
 // |-- repo_is_fork: boolean (nullable = true)
 // |-- repo_has_issues: boolean (nullable = true)
 // |-- repo_has_projects: boolean (nullable = true)
 // |-- repo_has_downloads: boolean (nullable = true)
 // |-- repo_has_wiki: boolean (nullable = true)
 // |-- repo_has_pages: boolean (nullable = true)
 // |-- repo_open_issues_count: integer (nullable = true)
 // |-- repo_topics: string (nullable = true)

rawStarringDS.printSchema()
// root
 // |-- user_id: integer (nullable = true)
 // |-- repo_id: integer (nullable = true)
 // |-- starred_at: timestamp (nullable = true)
 // |-- starring: double (nullable = true)

ref:
https://www.githubarchive.org/
http://ghtorrent.org/

載入資料之後,要做的第一件事應該就是 Exploratory Data Analysis (EDA) 了,把玩一下手上的數據。建議大家可以試試 Apache Zeppelin 或是 Databricks 的 Notebook,除了內建 Spark 支援的所有語言之外,也整合了 NoSQL 和 JDBC 支援的資料庫,要畫圖表也很方便,簡直比 Jupyter Notebook 還好用了。

ref:
https://zeppelin.apache.org/
https://databricks.com/

Build User Profile / Item Profile

在這個專案中,最主要的數據主體就是 user 和 repo,所以我們針對兩者各自建立 User Profile 和 Item Profile,作為之後在模型訓練階段會使用的特徵。我們把這個步驟跟模型訓練的流程分開,這樣對整個架構的搭建會更有彈性。實務上,我們可以用 user id 或 item id 當 key,直接把製作好的特徵存進 Redis 或其他 schemaless 的 NoSQL 資料庫,方便之後給多個模型取用;在做 real-time 推薦時,也可以很快地拿到特徵,只需要重新計算部份欄位即可。

不過因為這裡主要用的是來自 GitHub API 的資料,某種程度上人家已經幫我們做了很多資料清理和正規化的動作了,但是在現實中,你的系統要處理的數據通常不會這麼乾淨,可能來自各種 data source、有著各種格式,還會隨著時間而改變,通常得花上不少力氣做 Extract, Transform, Load (ETL),所以最好在寫 log(埋點)的時候就溝通好。而且在 production 環境中,數據是會一直變動的,要確保數據的時效性和容錯性,很重要的一個部分就是 monitoring。

ref:
http://www.algorithmdog.com/ad-rec-deploy
https://tech.meituan.com/online-feature-system02.html

礙於篇幅有限,以下的文章中我們只會挑幾個重要的部分說明。簡單說,在這個步驟的最後,我們會得到 userProfileDFrepoProfileDF 這兩個 DataFrame,分別存放製作好的特徵。詳細的程式碼如下:

ref:
https://github.com/vinta/albedo/blob/master/src/main/scala/ws/vinta/albedo/UserProfileBuilder.scala
https://github.com/vinta/albedo/blob/master/src/main/scala/ws/vinta/albedo/RepoProfileBuilder.scala

Feature Engineering

以推薦系統為例,特徵可以分成以下四種:

  • 用戶特徵:用戶本身的各種屬性,例如 user id、性別、職業或所在的城市等
  • 物品特徵:物品本身的各種屬性,例如 item id、作者、標題、分類、評分或所屬的標籤等
  • 交互特徵:用戶對物品做出的某項行為,該行為的 aggregation 或交叉特徵,例如是否看過同類型的電影、最近聽的歌曲的曲風分佈或上週買了多少高單價的商品
  • 上下文特徵:用戶對物品做出的某項行為,該行為的 metadata,例如發生的時間、使用的裝置或當前的 GPS 位置等

有些特徵是在資料採集階段就能拿到,有些特徵則會需要額外的步驟(例如透過外部的 API 或是其他模型)才能取得,也有些特徵必須即時更新。順道一提,因為我們要預測的是「某個用戶會不會打星某個 repo」,所以下述特徵裡的 user 可以是 repo stargazer 也可以是 repo owner。

原始特徵:

  • 用戶特徵
    • user_id
    • user_login
    • user_name
    • user_email
    • user_blog
    • user_bio
    • user_company
    • user_location
    • user_followers_coung
    • user_following_count
    • user_public_repos_count
    • user_public_gists_count
    • user_created_at
    • user_updated_at
  • 物品特徵
    • repo_id
    • repo_name
    • repo_owner
    • repo_owner_type
    • repo_language
    • repo_description
    • repo_homepage
    • repo_subscribers_count
    • repo_stargazers_count
    • repo_forks_count
    • repo_size
    • repo_created_at
    • repo_updated_at
    • repo_pushed_at
    • repo_has_issues
    • repo_has_projects
    • repo_has_downloads
    • repo_has_wiki
    • repo_has_pages
    • repo_open_issues_count
    • repo_topics
  • 交互特徵
    • user_stars_repo
    • user_follows_user
  • 上下文特徵
    • user_repo_starred_at

發想特徵:

  • 用戶特徵
    • user_days_between_created_at_today: 該用戶的註冊日期距離今天過了幾年
    • user_days_between_updated_at_today: 該用戶的更新日期距離今天過了幾天
    • user_repos_avg_stargazers_count: 該用戶名下的所有 repo(不含 fork 的)的平均星星數
    • user_organizations: 該用戶屬於哪些組織
    • user_has_null: 該用戶至少有一個欄位是 null
    • user_has_blog: 該用戶有沒有網站
    • user_is_freelancer: 該用戶的 bio 中是否包含 Freelancer 等字眼
    • user_is_junior: 該用戶的 bio 中是否包含 Beginner 或 Junior 等字眼
    • user_is_lead: 該用戶的 bio 中是否包含 Team Lead、Architect、Creator、CTO 或 VP of Engineering 等字眼
    • user_is_scholar: 該用戶的 bio 中是否包含 Researcher、Scientist、PhD 或 Professor 等字眼
    • user_is_pm: 該用戶的 bio 中是否包含 Product Manager 等字眼
    • user_knows_backend: 該用戶的 bio 中是否包含 Backend 或 Back end 等字眼
    • user_knows_data: 該用戶的 bio 中是否包含 Machine Learning、Deep Learning 或 Data Science 等字眼
    • user_knows_devops: 該用戶的 bio 中是否包含 DevOps、SRE、SysAdmin 或 Infrastructure 等字眼
    • user_knows_frontend: 該用戶的 bio 中是否包含 Frontend 或 Front end 等字眼
    • user_knows_mobile: 該用戶的 bio 中是否包含 Mobile、iOS 或 Android 等字眼
    • user_knows_recsys: 該用戶的 bio 中是否包含 Recommender System、Data Mining 或 Information Retrieval 等字眼
    • user_knows_web: 該用戶的 bio 中是否包含 Web Development 或 Fullstack 等字眼
  • 物品特徵
    • repo_created_at_days_since_today: 該 repo 的建立日期距離今天過了幾天
    • repo_updated_at_days_since_today: 該 repo 的更新日期距離今天過了幾天
    • repo_pushed_at_days_since_today: 該 repo 的提交日期距離今天過了幾天
    • repo_stargazers_count_in_30days: 該 repo 在 30 天內收到的星星數
    • repo_subscribers_stargazers_ratio: 該 repo 的 watch 數和 star 數的比例
    • repo_forks_stargazers_ratio: 該 repo 的 fork 數和 star 數的比例
    • repo_open_issues_stargazers_ratio: 該 repo 的 數和 star 數的比例
    • repo_releases_count: 該 repo 的 release 或 tag 數
    • repo_lisence: 該 repo 的授權條款
    • repo_readme: 該 repo 的 README 內容
    • repo_has_null: 該 repo 有至少一個欄位是 null
    • repo_has_readme: 該 repo 是否有 README 檔案
    • repo_has_changelog: 該 repo 是否有 CHANGELOG 檔案
    • repo_has_contributing: 該 repo 是否有 CONTRIBUTING 檔案
    • repo_has_tests: 該 repo 是否有測試
    • repo_has_ci: 該 repo 是否有 CI
    • repo_has_dockerfile: 該 repo 是否有 Dockerfile
    • repo_is_unmaintained: 該 repo 是否不再維護了
    • repo_is_awesome: 該 repo 是否被收錄進任何的 awesome-xxx 列表裡
    • repo_is_vinta_starred: 該 repo 是否被 @vinta aka 本文的作者打星了
  • 交互特徵
    • user_starred_repos_count: 該用戶總共打星了多少 repo
    • user_avg_daily_starred_repos_count: 該用戶平均每天打星多少 repo
    • user_forked_repos_count: 該用戶總共 fork 了多少 repo
    • user_follower_following_count_ratio: 該用戶的 follower 數和 following 數的比例
    • user_recent_searched_keywords: 該用戶最近搜尋的 50 個關鍵字
    • user_recent_commented_repos: 該用戶最近留言的 50 個 repo
    • user_recent_watched_repos: 該用戶最近訂閱的 50 個 repo
    • user_recent_starred_repos_descriptions: 該用戶最近打星的 50 個 repo 的描述
    • user_recent_starred_repos_languages: 該用戶最近打星的 50 個 repo 的語言
    • user_recent_starred_repos_topics: 該用戶最近打星的 50 個 repo 的標籤
    • user_follows_repo_owner: 該用戶是否追蹤該 repo 的作者
    • repo_language_index_in_user_recent_repo_languages: 該 repo 的語言出現在該用戶最近打星的語言列表的順序
    • repo_language_count_in_user_recent_repo_languages: 該 repo 的語言出現在該用戶最近打星的語言列表的次數
    • repo_topics_user_recent_topics_similarity: 該 repo 的標籤與該用戶最近打星的標籤列表的相似度
  • 上下文特徵
    • als_model_prediction: 來自 ALS 模型的預測值,該用戶對該 repo 的偏好程度
    • gbdt_model_index: 來自 GBDT 模型的 tree index,該 observation 的自動特徵

Feature Engineering 特徵工程中常見的方法
https://vinta.ws/code/feature-engineering.html

Detect Outliers

除了缺失值之外,離群值(異常值)也是需要注意的地方。如果是 continuous 特徵,用 box plot 可以很快地發現離群值;如果是 categorical 特徵,可以 SELECT COUNT(*) ... GROUP BY 一下,然後用 bar chart 查看每個 category 的數量。取決於你所要解決的問題,異常值可能可以直接忽略,也可能需要特別對待,例如搞清楚異常值出現的原因,是資料採集時的差錯或是某種隱含的深層的因素之類的。

ref:
https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
https://www.slideshare.net/tw_dsconf/123-70852901

Impute Missing Values

可以利用 df.describe().show() 查看各個欄位的統計數據:countmeanstddevminmax。除了使用 df.where("some_column IS NULL") 之外,比較不同欄位的 count 差異也可以很快地發現哪些欄位有缺失值。順便觀察一下有缺失值的欄位和 target variable 有沒有什麼關聯。

這裡直接對 nullNaN 數據填充缺失值,因為以下幾個欄位都是字串類型,所以直接改成空字串,方便後續的處理。然後順便做一個 has_null 的特徵。

針對 user:

import org.apache.spark.sql.functions._

val nullableColumnNames = Array("user_name", "user_company", "user_blog", "user_location", "user_bio")

val imputedUserInfoDF = rawUserInfoDS
  .withColumn("user_has_null", when(nullableColumnNames.map(rawUserInfoDS(_).isNull).reduce(_ || _), true).otherwise(false))
  .na.fill("", nullableColumnNames)

針對 repo:

import org.apache.spark.sql.functions._

val nullableColumnNames = Array("repo_description", "repo_homepage")

val imputedRepoInfoDF = rawRepoInfoDS
  .withColumn("repo_has_null", when(nullableColumnNames.map(rawRepoInfoDS(_).isNull).reduce(_ || _), true).otherwise(false))
  .na.fill("", nullableColumnNames)

ref:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions

如果是數值類型的欄位可以考慮使用 Imputer

ref:
https://spark.apache.org/docs/latest/ml-features.html#imputer

Clean Data

針對 user,用 User-defined Function 對幾個文字欄位做一些正規化的處理:

import org.apache.spark.sql.functions._

val cleanUserInfoDF = imputedUserInfoDF
  .withColumn("user_clean_company", cleanCompanyUDF($"user_company"))
  .withColumn("user_clean_location", cleanLocationUDF($"user_location"))
  .withColumn("user_clean_bio", lower($"user_bio"))

針對 repo,過濾掉一些 repo_stargazers_count 太多和太少、description 欄位含有 "unmaintained" 或 "assignment" 等字眼的項目:

val reducedRepoInfo = imputedRepoInfoDF
  .where($"repo_is_fork" === false)
  .where($"repo_forks_count" <= 90000)
  .where($"repo_stargazers_count".between(30, 100000))

val unmaintainedWords = Array("%unmaintained%", "%no longer maintained%", "%deprecated%", "%moved to%")
val assignmentWords = Array("%assignment%", "%作業%", "%作业%")
val demoWords = Array("test", "%demo project%")
val blogWords = Array("my blog")

val cleanRepoInfoDF = reducedRepoInfo
  .withColumn("repo_clean_description", lower($"repo_description"))
  .withColumn("repo_is_unmaintained", when(unmaintainedWords.map($"repo_clean_description".like(_)).reduce(_ or _), true).otherwise(false))
  .withColumn("repo_is_assignment", when(assignmentWords.map($"repo_clean_description".like(_)).reduce(_ or _), true).otherwise(false))
  .withColumn("repo_is_demo", when(demoWords.map($"repo_clean_description".like(_)).reduce(_ or _) and $"repo_stargazers_count" <= 40, true).otherwise(false))
  .withColumn("repo_is_blog", when(blogWords.map($"repo_clean_description".like(_)).reduce(_ or _) and $"repo_stargazers_count" <= 40, true).otherwise(false))
  .where($"repo_is_unmaintained" === false)
  .where($"repo_is_assignment" === false)
  .where($"repo_is_demo" === false)
  .where($"repo_is_blog" === false)
  .withColumn("repo_clean_language", lower($"repo_language"))
  .withColumn("repo_clean_topics", lower($"repo_topics"))

Construct Features

針對 user,根據上述的「發想特徵」,製作出新的特徵:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val webThings = Array("web", "fullstack", "full stack")
val backendThings = Array("backend", "back end", "back-end")
val frontendThings = Array("frontend", "front end", "front-end")
val mobileThings = Array("mobile", "ios", "android")
val devopsThings = Array("devops", "sre", "admin", "infrastructure")
val dataThings = Array("machine learning", "deep learning", "data scien", "data analy")
val recsysThings = Array("data mining", "recommend", "information retrieval")

val leadTitles = Array("team lead", "architect", "creator", "director", "cto", "vp of engineering")
val scholarTitles = Array("researcher", "scientist", "phd", "professor")
val freelancerTitles = Array("freelance")
val juniorTitles = Array("junior", "beginner", "newbie")
val pmTitles = Array("product manager")

val userStarredReposCountDF = rawStarringDS
  .groupBy($"user_id")
  .agg(count("*").alias("user_starred_repos_count"))

val starringRepoInfoDF = rawStarringDS
  .select($"user_id", $"repo_id", $"starred_at")
  .join(rawRepoInfoDS, Seq("repo_id"))

val userTopLanguagesDF = starringRepoInfoDF
  .withColumn("rank", rank.over(Window.partitionBy($"user_id").orderBy($"starred_at".desc)))
  .where($"rank" <= 50)
  .groupBy($"user_id")
  .agg(collect_list(lower($"repo_language")).alias("user_recent_repo_languages"))
  .select($"user_id", $"user_recent_repo_languages")

val userTopTopicsDF = starringRepoInfoDF
  .where($"repo_topics" =!= "")
  .withColumn("rank", rank.over(Window.partitionBy($"user_id").orderBy($"starred_at".desc)))
  .where($"rank" <= 50)
  .groupBy($"user_id")
  .agg(concat_ws(",", collect_list(lower($"repo_topics"))).alias("temp_user_recent_repo_topics"))
  .select($"user_id", split($"temp_user_recent_repo_topics", ",").alias("user_recent_repo_topics"))

val userTopDescriptionDF = starringRepoInfoDF
  .where($"repo_description" =!= "")
  .withColumn("rank", rank.over(Window.partitionBy($"user_id").orderBy($"starred_at".desc)))
  .where($"rank" <= 50)
  .groupBy($"user_id")
  .agg(concat_ws(" ", collect_list(lower($"repo_description"))).alias("user_recent_repo_descriptions"))
  .select($"user_id", $"user_recent_repo_descriptions")

val constructedUserInfoDF = cleanUserInfoDF
  .withColumn("user_knows_web", when(webThings.map($"user_clean_bio".like(_)).reduce(_ or _), true).otherwise(false))
  .withColumn("user_knows_backend", when(backendThings.map($"user_clean_bio".like(_)).reduce(_ or _), true).otherwise(false))
  .withColumn("user_knows_frontend", when(frontendThings.map($"user_clean_bio".like(_)).reduce(_ or _), true).otherwise(false))
  .withColumn("user_knows_mobile", when(mobileThings.map($"user_clean_bio".like(_)).reduce(_ or _), true).otherwise(false))
  .withColumn("user_knows_devops", when(devopsThings.map($"user_clean_bio".like(_)).reduce(_ or _), true).otherwise(false))
  .withColumn("user_knows_data", when(dataThings.map($"user_clean_bio".like(_)).reduce(_ or _), true).otherwise(false))
  .withColumn("user_knows_recsys", when(recsysThings.map($"user_clean_bio".like(_)).reduce(_ or _), true).otherwise(false))
  .withColumn("user_is_lead", when(leadTitles.map($"user_clean_bio".like(_)).reduce(_ or _), true).otherwise(false))
  .withColumn("user_is_scholar", when(scholarTitles.map($"user_clean_bio".like(_)).reduce(_ or _), true).otherwise(false))
  .withColumn("user_is_freelancer", when(freelancerTitles.map($"user_clean_bio".like(_)).reduce(_ or _), true).otherwise(false))
  .withColumn("user_is_junior", when(juniorTitles.map($"user_clean_bio".like(_)).reduce(_ or _), true).otherwise(false))
  .withColumn("user_is_pm", when(pmTitles.map($"user_clean_bio".like(_)).reduce(_ or _), true).otherwise(false))
  .withColumn("user_followers_following_ratio", round($"user_followers_count" / ($"user_following_count" + lit(1.0)), 3))
  .withColumn("user_days_between_created_at_today", datediff(current_date(), $"user_created_at"))
  .withColumn("user_days_between_updated_at_today", datediff(current_date(), $"user_updated_at"))
  .join(userStarredReposCountDF, Seq("user_id"))
  .withColumn("user_avg_daily_starred_repos_count", round($"user_starred_repos_count" / ($"user_days_between_created_at_today" + lit(1.0)), 3))
  .join(userTopDescriptionDF, Seq("user_id"))
  .join(userTopTopicsDF, Seq("user_id"))
  .join(userTopLanguagesDF, Seq("user_id"))

針對 repo,根據上述的「發想特徵」,製作出新的特徵,意思到了就好:

import org.apache.spark.sql.functions._

val vintaStarredRepos = rawStarringDS
  .where($"user_id" === 652070)
  .select($"repo_id".as[Int])
  .collect()
  .to[List]

val constructedRepoInfoDF = cleanRepoInfoDF
  .withColumn("repo_has_activities_in_60days", datediff(current_date(), $"repo_pushed_at") <= 60)
  .withColumn("repo_has_homepage", when($"repo_homepage" === "", false).otherwise(true))
  .withColumn("repo_is_vinta_starred", when($"repo_id".isin(vintaStarredRepos: _*), true).otherwise(false))
  .withColumn("repo_days_between_created_at_today", datediff(current_date(), $"repo_created_at"))
  .withColumn("repo_days_between_updated_at_today", datediff(current_date(), $"repo_updated_at"))
  .withColumn("repo_days_between_pushed_at_today", datediff(current_date(), $"repo_pushed_at"))
  .withColumn("repo_subscribers_stargazers_ratio", round($"repo_subscribers_count" / ($"repo_stargazers_count" + lit(1.0)), 3))
  .withColumn("repo_forks_stargazers_ratio", round($"repo_forks_count" / ($"repo_stargazers_count" + lit(1.0)), 3))
  .withColumn("repo_open_issues_stargazers_ratio", round($"repo_open_issues_count" / ($"repo_stargazers_count" + lit(1.0)), 3))
  .withColumn("repo_text", lower(concat_ws(" ", $"repo_owner_username", $"repo_name", $"repo_language", $"repo_description")))

ref:
https://databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.html

Convert Features

針對 user,這裡主要是對一些 categorical 特徵作 binning:

import org.apache.spark.sql.functions._

val companyCountDF = cleanUserInfoDF
  .groupBy($"user_clean_company")
  .agg(count("*").alias("count_per_user_company"))

val locationCountDF = cleanUserInfoDF
  .groupBy($"user_clean_location")
  .agg(count("*").alias("count_per_user_location"))

val transformedUserInfoDF = constructedUserInfoDF
  .join(companyCountDF, Seq("user_clean_company"))
  .join(locationCountDF, Seq("user_clean_location"))
  .withColumn("user_has_blog", when($"user_blog" === "", 0.0).otherwise(1.0))
  .withColumn("user_binned_company", when($"count_per_user_company" <= 5, "__other").otherwise($"user_clean_company"))
  .withColumn("user_binned_location", when($"count_per_user_location" <= 50, "__other").otherwise($"user_clean_location"))

針對 repo:

import org.apache.spark.sql.functions._

val languagesDF = cleanRepoInfoDF
  .groupBy($"repo_clean_language")
  .agg(count("*").alias("count_per_repo_language"))
  .select($"repo_clean_language", $"count_per_repo_language")
  .cache()

val transformedRepoInfoDF = constructedRepoInfoDF
  .join(languagesDF, Seq("repo_clean_language"))
  .withColumn("repo_binned_language", when($"count_per_repo_language" <= 30, "__other").otherwise($"repo_clean_language"))
  .withColumn("repo_clean_topics", split($"repo_topics", ","))

ref:
https://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html

Prepare the Feature Pipeline

我們過濾掉那些打星了超多 repo 的用戶。從收集到的數據發現,有些用戶甚至打星了一兩萬個 repo,這些用戶可能是個爬蟲專用帳號或是他看到什麼就打星什麼,推薦系統對這樣的用戶來說可能沒什麼意義,還不如從數據集中拿掉。

import org.apache.spark.sql.functions._

val maxStarredReposCount = 2000

val userStarredReposCountDF = rawStarringDS
  .groupBy($"user_id")
  .agg(count("*").alias("user_starred_repos_count"))

val reducedStarringDF = rawStarringDS
  .join(userStarredReposCountDF, Seq("user_id"))
  .where($"user_starred_repos_count" <= maxStarredReposCount)
  .select($"user_id", $"repo_id", $"starred_at", $"starring")

val profileStarringDF = reducedStarringDF
  .join(userProfileDF, Seq("user_id"))
  .join(repoProfileDF, Seq("repo_id"))

Build the Feature Pipeline

把處理特徵的一連串流程寫成 Spark ML Pipeline,方便抽換或是加入新的 Transformer,例如 Standardization、One-hot Encoding 和 Word2Vec,也把 ALS 模型的預測值做為其中一項特徵。

import org.apache.spark.ml.feature._
import org.apache.spark.ml.recommendation.ALSModel
import ws.vinta.albedo.transformers.UserRepoTransformer

val profileStarringDF = reducedStarringDF
  .join(userProfileDF, Seq("user_id"))
  .join(repoProfileDF, Seq("repo_id"))
  .cache()

categoricalColumnNames += "user_id"
categoricalColumnNames += "repo_id"

val userRepoTransformer = new UserRepoTransformer()
  .setInputCols(Array("repo_language", "user_recent_repo_languages"))

continuousColumnNames += "repo_language_index_in_user_recent_repo_languages"
continuousColumnNames += "repo_language_count_in_user_recent_repo_languages"

val alsModelPath = s"${settings.dataDir}/${settings.today}/alsModel.parquet"
val alsModel = ALSModel.load(alsModelPath)
  .setUserCol("user_id")
  .setItemCol("repo_id")
  .setPredictionCol("als_score")
  .setColdStartStrategy("drop")

continuousColumnNames += "als_score"

val categoricalTransformers = categoricalColumnNames.flatMap((columnName: String) => {
  val stringIndexer = new StringIndexer()
    .setInputCol(columnName)
    .setOutputCol(s"${columnName}__idx")
    .setHandleInvalid("keep")

  val oneHotEncoder = new OneHotEncoder()
    .setInputCol(s"${columnName}__idx")
    .setOutputCol(s"${columnName}__ohe")
    .setDropLast(false)

  Array(stringIndexer, oneHotEncoder)
})

val listTransformers = listColumnNames.flatMap((columnName: String) => {
  val countVectorizerModel = new CountVectorizer()
    .setInputCol(columnName)
    .setOutputCol(s"${columnName}__cv")
    .setMinDF(10)
    .setMinTF(1)

  Array(countVectorizerModel)
})

val textTransformers = textColumnNames.flatMap((columnName: String) => {
  val hanLPTokenizer = new HanLPTokenizer()
    .setInputCol(columnName)
    .setOutputCol(s"${columnName}__words")
    .setShouldRemoveStopWords(true)

  val stopWordsRemover = new StopWordsRemover()
    .setInputCol(s"${columnName}__words")
    .setOutputCol(s"${columnName}__filtered_words")
    .setStopWords(StopWordsRemover.loadDefaultStopWords("english"))
  val word2VecModelPath = s"${settings.dataDir}/${settings.today}/word2VecModel.parquet"
  val word2VecModel = Word2VecModel.load(word2VecModelPath)
    .setInputCol(s"${columnName}__filtered_words")
    .setOutputCol(s"${columnName}__w2v")

  Array(hanLPTokenizer, stopWordsRemover, word2VecModel)
})

val finalBooleanColumnNames = booleanColumnNames.toArray
val finalContinuousColumnNames = continuousColumnNames.toArray
val finalCategoricalColumnNames = categoricalColumnNames.map(columnName => s"${columnName}__ohe").toArray
val finalListColumnNames = listColumnNames.map(columnName => s"${columnName}__cv").toArray
val finalTextColumnNames = textColumnNames.map(columnName => s"${columnName}__w2v").toArray
val vectorAssembler = new SimpleVectorAssembler()
  .setInputCols(finalBooleanColumnNames ++ finalContinuousColumnNames ++ finalCategoricalColumnNames ++ finalListColumnNames ++ finalTextColumnNames)
  .setOutputCol("features")

val featureStages = mutable.ArrayBuffer.empty[PipelineStage]
featureStages += userRepoTransformer
featureStages += alsModel
featureStages ++= categoricalTransformers
featureStages ++= listTransformers
featureStages ++= textTransformers
featureStages += vectorAssembler

val featurePipeline = new Pipeline().setStages(featureStages.toArray)
val featurePipelineModel = featurePipeline.fit(profileStarringDF)

ref:
https://spark.apache.org/docs/latest/ml-pipeline.html
https://spark.apache.org/docs/latest/ml-features.html

Handle Imbalanced Data

因為我們要訓練一個 Binary Classification 二元分類模型,會同時需要 positive(正樣本)和 negative(負樣本)。但是我們的原始數據 rawStarringDS 都是正樣本,也就是說我們只有「用戶有對哪些 repo 打星的資料」(正樣本),卻沒有「用戶沒有對哪些 repo 打星的資料」(負樣本)。我們當然是可以用「所有用戶沒有打星的 repo 做為負樣本」,但是考慮到這種做法產生的負樣本的數量實在太大,而且也不太合理,因為那些用戶沒有打星的 repo 不見得是因為他不喜歡,可能只是因為他不知道有那個 repo 存在。

我們後來採用的做法是「用熱門但是用戶沒有打星的 repo 做為負樣本」,我們寫了一個 Spark Transformer 來做這件事:

import ws.vinta.albedo.transformers.NegativeBalancer

import scala.collection.mutable

val sc = spark.sparkContext

val popularReposDS = loadPopularRepoDF()
val popularRepos = popularReposDS
  .select($"repo_id".as[Int])
  .collect()
  .to[mutable.LinkedHashSet]
val bcPopularRepos = sc.broadcast(popularRepos)

val negativeBalancer = new NegativeBalancer(bcPopularRepos)
  .setUserCol("user_id")
  .setItemCol("repo_id")
  .setTimeCol("starred_at")
  .setLabelCol("starring")
  .setNegativeValue(0.0)
  .setNegativePositiveRatio(2.0)
val balancedStarringDF = negativeBalancer.transform(reducedStarringDF)

ref:
https://github.com/vinta/albedo/blob/master/src/main/scala/ws/vinta/albedo/evaluators/RankingEvaluator.scala
http://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html

Split Data

直接使用 holdout 的方式,隨機分配不同的 row 到 training set 和 test set。其他的做法可能是根據時間來拆分,用以前的數據來預測之後的行為。

val profileBalancedStarringDF = balancedStarringDF
  .join(userProfileDF, Seq("user_id"))
  .join(repoProfileDF, Seq("repo_id"))

val tmpDF = featurePipelineModel.transform(profileBalancedStarringDF)
val keepColumnName = tmpDF.columns.filter((columnName: String) => {
  !columnName.endsWith("__idx") &&
  !columnName.endsWith("__ohe") &&
  !columnName.endsWith("__cv") &&
  !columnName.endsWith("__words") &&
  !columnName.endsWith("__filtered_words") &&
  !columnName.endsWith("__w2v")
})
val featuredBalancedStarringDF = tmpDF.select(keepColumnName.map(col): _*)

val Array(trainingFeaturedDF, testFeaturedDF) = featuredBalancedStarringDF.randomSplit(Array(0.9, 0.1))

Build the Model Pipeline

為了方便之後的擴充性,這裡也使用 Spark ML Pipeline 的寫法。Spark ML 的 LogisticRegression 可以額外設置一個 weightCol 來調整不同 row 的權重。

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.{Pipeline, PipelineStage}

import scala.collection.mutable

val weightSQL = """
SELECT *,
       1.0 AS default_weight,
       IF (starring = 1.0, 0.9, 0.1) AS positive_weight,
       IF (starring = 1.0 AND datediff(current_date(), starred_at) <= 365, 0.9, 0.1) AS recent_starred_weight
FROM __THIS__
""".stripMargin
val weightTransformer = new SQLTransformer()
  .setStatement(weightSQL)

val lr = new LogisticRegression()
  .setMaxIter(200)
  .setRegParam(0.7)
  .setElasticNetParam(0.0)
  .setStandardization(true)
  .setLabelCol("starring")
  .setFeaturesCol("standard_features")
  .setWeightCol("recent_starred_weight")

val modelStages = mutable.ArrayBuffer.empty[PipelineStage]
modelStages += weightTransformer
modelStages += lr

val modelPipeline = new Pipeline().setStages(modelStages.toArray)
val modelPipelineModel = modelPipeline.fit(trainingFeaturedDF)

ref:
https://spark.apache.org/docs/latest/ml-classification-regression.html

Evaluate the Model: Classification

因為 Logistic Regression 是二元分類模型,所以我們可以用 Spark ML 的 BinaryClassificationEvaluator 來評估結果。不過因為我們做的是推薦系統,真正在乎的是 Top N 的排序問題,所以這裡的 AUC 的數值參考一下就好。

import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

val testRankedDF = modelPipelineModel.transform(testFeaturedDF)

val binaryClassificationEvaluator = new BinaryClassificationEvaluator()
  .setMetricName("areaUnderROC")
  .setRawPredictionCol("rawPrediction")
  .setLabelCol("starring")

val classificationMetric = binaryClassificationEvaluator.evaluate(testRankedDF)
println(s"${binaryClassificationEvaluator.getMetricName} = $classificationMetric")
// areaUnderROC = 0.9450631491281277

ref:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
https://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html

Generate Candidates

推薦系統的另外一個重要部分就是產生候選物品集,這裡我們使用以下幾種方式:

  • ALS: 協同過濾的推薦
  • Content-based: 基於內容的推薦
  • Popularity: 基於熱門的推薦

不過因為這篇文章的主題是排序和特徵工程的 Machine Learning Pipeline,所以產生候選物品集的部分就不多說了,有興趣的人可以直接看底下連結的 source code 或是這個系列的其他文章。

import ws.vinta.albedo.recommenders.ALSRecommender
import ws.vinta.albedo.recommenders.ContentRecommender
import ws.vinta.albedo.recommenders.PopularityRecommender

val topK = 30

val alsRecommender = new ALSRecommender()
  .setUserCol("user_id")
  .setItemCol("repo_id")
  .setTopK(topK)

val contentRecommender = new ContentRecommender()
  .setUserCol("user_id")
  .setItemCol("repo_id")
  .setTopK(topK)
  .setEnableEvaluationMode(true)

val popularityRecommender = new PopularityRecommender()
  .setUserCol("user_id")
  .setItemCol("repo_id")
  .setTopK(topK)

val recommenders = mutable.ArrayBuffer.empty[Recommender]
recommenders += alsRecommender
recommenders += contentRecommender
recommenders += popularityRecommender

val candidateDF = recommenders
  .map((recommender: Recommender) => recommender.recommendForUsers(testUserDF))
  .reduce(_ union _)
  .select($"user_id", $"repo_id")
  .distinct()

// 每個 Recommender 的結果類似這樣:
// +-------+-------+----------+------+
// |user_id|repo_id|score     |source|
// +-------+-------+----------+------+
// |652070 |1239728|0.6731846 |als   |
// |652070 |854078 |0.7187486 |als   |
// |652070 |1502338|0.70165294|als   |
// |652070 |1184678|0.7434903 |als   |
// |652070 |547708 |0.7956538 |als   |
// +-------+-------+----------+------+

ref:
https://github.com/vinta/albedo/blob/master/src/main/scala/ws/vinta/albedo/recommenders/ALSRecommender.scala
https://github.com/vinta/albedo/blob/master/src/main/scala/ws/vinta/albedo/recommenders/ContentRecommender.scala
https://github.com/vinta/albedo/blob/master/src/main/scala/ws/vinta/albedo/recommenders/PopularityRecommender.scala

Predict the Ranking

把這些候選物品集丟給我們訓練好的 Logistic Regression 模型來排序。結果中的 probability 欄位的第 0 項表示結果為 0 的機率(negative)、第 1 項表示結果為 1 的機率(positive)。

val profileCandidateDF = candidateDF
  .join(userProfileDF, Seq("user_id"))
  .join(repoProfileDF, Seq("repo_id"))

val featuredCandidateDF = featurePipelineModel
  .transform(profileCandidateDF)

val rankedCandidateDF = modelPipelineModel
  .transform(featuredCandidateDF)

// rankedCandidateDF 的結果類似這樣:
// +-------+--------+----------+----------------------------------------+
// |user_id|repo_id |prediction|probability                             |
// +-------+--------+----------+----------------------------------------+
// |652070 |83467664|1.0       |[0.12711894229094317,0.8728810577090568]|
// |652070 |55099616|1.0       |[0.1422859437320775,0.8577140562679224] |
// |652070 |42266235|1.0       |[0.1462014853157966,0.8537985146842034] |
// |652070 |78012800|1.0       |[0.15576081067098502,0.844239189329015] |
// |652070 |5928761 |1.0       |[0.16149848941925066,0.8385015105807493]|
// +-------+--------+----------+----------------------------------------+

ref:
https://stackoverflow.com/questions/37903288/what-do-colum-rawprediction-and-probability-of-dataframe-mean-in-spark-mllib

Evaluate the Model: Ranking

最後我們使用 Information Retrieval 領域中用來評價排序能力的指標 NDCG (Normalized Discounted Cumulative Gain) 來評估排序的結果。Spark MLlib 有現成的 RankingMetrics 可以用,但是它只適用於 RDD-based 的 API,所以我們改寫成適合 DataFrame-based 的 Evaluator

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import ws.vinta.albedo.evaluators.RankingEvaluator

val userActualItemsDF = reducedStarringDF
  .withColumn("rank", rank().over(Window.partitionBy($"user_id").orderBy($"starred_at".desc)))
  .where($"rank" <= topK)
  .groupBy($"user_id")
  .agg(collect_list($"repo_id").alias("items"))

val userPredictedItemsDF = rankedCandidateDF
  .withColumn("rank", rank().over(Window.partitionBy($"user_id").orderBy(toArrayUDF($"probability").getItem(1).desc)))
  .where($"rank" <= topK)
  .groupBy($"user_id")
  .agg(collect_list($"repo_id").alias("items"))

val rankingEvaluator = new RankingEvaluator(userActualItemsDF)
  .setMetricName("[email protected]")
  .setK(topK)
  .setUserCol("user_id")
  .setItemsCol("items")
val rankingMetric = rankingEvaluator.evaluate(userPredictedItemsDF)
println(s"${rankingEvaluator.getFormattedMetricName} = $rankingMetric")
// [email protected] = 0.021114356461615493

ref:
https://github.com/vinta/albedo/blob/master/src/main/scala/ws/vinta/albedo/evaluators/RankingEvaluator.scala
https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#ranking-systems
https://weekly.codetengu.com/issues/83#kOxuVxW

Spark best practices

Spark best practices

Architecture

A Spark application is a set of processes running on a cluster, all these processes are coordinated by the driver program. The driver program is 1) the process where the main() method of your program runs, 2) the process running the code that creates a SparkSession, RDDs, DataFrames, and stages up or sends off transformations and actions.

Those processes that run computations and store data for your application are executors. Executors are 1) returning computed results to the driver program, 2) provoiding in-memory storage for cached RDDs/DataFrames.

Execution of a Spark program:

  • The driver program runs the Spark application, which creates a SparkSession upon start-up.
  • The SparkSession connects to a cluster manager (e.g., YARN/Mesos) which allocates resources.
  • Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Actions are happening on workers.
  • The driver program sends your application code (e.g., functions which applied on map()) to the executors.
  • Transformations and actions are queued up and optimized by the driver program, sent to executors to run, and then the executors send results back to the driver program.

名詞解釋:

  • Driver program:就是 master node,負責建立 SparkSession,你的 Spark app 的 main() 就是跑在這上面
  • Worker node:cluster 裡除了 driver 之外的那些機器,實際執行分散式運算的機器,基本上一台機器就是一個 worker 或 slave
  • Executor:worker node 上的一個個 processes,通常一個 core 對應一個 executor
  • Cluster Manager: 負責管理資源,通常是 YARN。driver program 和 workers 之間會透過 cluster manager 來溝通

ref:
https://spark.apache.org/docs/latest/cluster-overview.html
https://databricks.com/blog/2016/06/22/apache-spark-key-terms-explained.html

Resilient Distributed Dataset (RDD)

RDDs are divided into partitions: each partition can be considered as an immutable subset of the entire RDD. When you execute your Spark program, each partition gets sent to a worker.

對 RDD 的操作可以分為兩種:

  • 所有 return RDD 的操作就是 transformation
    • 會 lazy 地建立新的 RDD,例如 .map().flatMap().filter()
  • 所有不是 return RDD 的就是 action
    • 會 eager 地執行操作,例如 reduce()count().collect()

someRDD.foreach(println) 為例,foreach 是個 action,因為它回傳的是 Unitprintln 實際上是 print 到 executor 的 stdout 了,所以你在 driver program 根本看不到。除非你改成 someRDD.collect().foreach(println)

someRDD.take(10) 為例,take(10) 實際上是在 executors 上被運算出來的,但是會把那 10 筆結果傳回到 driver program 上。

ref:
https://zhangyi.gitbooks.io/spark-in-action/content/chapter2/rdd.html

logRDD.filter(line => line.contains("ERROR")).take(10) 為例,filter() 是 lazy 的好處是,Spark 知道最後只需要 take(10),所以當它 filter 集滿 10 個符合條件的 lines 時就可以不用繼續執行下去了,不需要對整個 dataset 做 filter。

logRDD.map(_.toLowerCase).filter(_.contains("error")).count() 為例,雖然 map()filter() 是兩個操作,但是因為它們都是 lazy evaluation,所以 Spark 能夠在 count() 階段判斷,其實可以在讀取每一行 log 的時候同時做 .toLowerCase.contains("error")

By default, RDDs are recomputed each time you run an action on them! Spark allows you to control what should be cached in memory.

ref:
https://www.coursera.org/learn/scala-spark-big-data/lecture/0GZV7/evaluation-in-spark-unlike-scala-collections

Shuffling

Moving data from one node to another across network is called "shuffling". A shuffle can occur when the resulting RDD depends on other elements from the same RDD or another RDD, you can also figure out whether a shuffle has been planned via 1)the return type of certain transformations (e.g. ShuffledRDD); 2) using toDebugString to see its execution plan.

Operatons that might cause a shuffle:

  • cogroup
  • groupWith
  • join
  • leftOuterJoin
  • rightOuterJoin
  • groupByKey
  • reduceByKey
  • combineByKey
  • distinct
  • intersection
  • repartition
  • coalesce

Narrow dependencies:

  • map
  • mapValues
  • flatMap
  • filter
  • mapPartitions
  • mapPartitionsWithIndex

Wide dependencies:

  • cogroup
  • groupWith
  • join
  • leftOuterJoin
  • rightOuterJoin
  • groupByKey
  • reduceByKey
  • combineByKey
  • distinct
  • intersection

ref:
https://www.coursera.org/learn/scala-spark-big-data/lecture/bT1YR/shuffling-what-it-is-and-why-its-important

Partitioning

Partitions never span multiple machines, data in the same partition are guaranteed to be on the same machine. The default number of partitions is the total number of cores on all executor nodes.

Customizing partitions is only possible when working with Pair RDD, because of partitioning is done based on keys. The most importing thing is that you must cache() or persist() your RDDs after re-partitioning.

Following operations hold to (and propagate) a partitioner:

  • cogroup
  • groupWith
  • groupByKey
  • reduceByKey
  • foldByKey
  • combineByKey
  • partitionBy
  • join
  • sort
  • mapValues (if parent has a partitioner)
  • flatMapValues (if parent has a partitioner)
  • filter (if parent has a partitioner)

All other operations (e.g. map()) will produce a result without a partitioner.

ref:
https://www.coursera.org/learn/scala-spark-big-data/lecture/Vkhm0/partitioning

You can chain a call to .repartition(n) after reading the text file to specify a different and larger number of partitions. You might set this higher to match the number of cores in your cluster, for example.
make num of partitions equal the num of cores in your cluster

ref:
https://www.safaribooksonline.com/library/view/learning-apache-spark/9781785885136/ch01s04.html
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/sparksqlshufflepartitions_draft.html

It's really important to repartition your dataset if you are going to cache it and use for queries. The optimalmal number of partitions is around 4-6 for each executor core, with 40 nodes and 6 executor cores we use 1000 partitions for best performance.

ref:
https://databricks.com/blog/2015/10/13/interactive-audience-analytics-with-apache-spark-and-hyperloglog.html
https://stackoverflow.com/questions/40416357/spark-sql-difference-between-df-repartition-and-dataframewriter-partitionby

Dataset

Dataset 的 transformation API 分成 untyped 和 typed,使用了 untyped API 之後會回傳 DataFrame,會失去 type;typed API 則會回傳 Dataset。如果是像 select() 這樣的 API,只要顯式地加上 type casting,例如 ds.select($"name".as[String], $"age".as[Int]),回傳的東西就會是 Dataset 而不是 DataFrame。

Untyped transformations:

  • 幾乎所有那些 DataFrame 可以用的 transformations,例如 groupBy()

Typed transformations:

  • map()
  • flatMap()
  • reduce()
  • groupByKey()
  • agg()
  • mapGroups()
  • flatMapGroups()
  • reduceGroups()

Datasets don't get all of the optimization the DataFrames get!

盡量使用 Spark SQL 的寫法(即 relational operations),例如 filter($"city".as[String] === "Boston")select($"age".as[Int]) 等,少用 higher-order functions(即 functional operations),例如 filter(p => p.city == "Boston")map() 等,Catalyst 對後者的優化不好。

ref:
https://www.coursera.org/learn/scala-spark-big-data/lecture/yrfPh/datasets
https://www.51zero.com/blog/2016/2/24/type-safety-on-spark-dataframes-part-1

Encoders

Encoders are what convert your data between JVM objects and Spark SQL's specilized internal representation. They're required by all Datasets.

Two ways to introduce Encoders:

  • Automaticcly via import spark.implicits._
  • Explicitly via org.apache.spark.sql.Encoder

ref:
https://www.coursera.org/learn/scala-spark-big-data/lecture/yrfPh/datasets

Broadcast

某個 DataFrame 小到一台機器可以吃得下,就可以 broadcast 它。

ref:
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/when_to_use_broadcast_variable.html

Spark UI

If your dataset is large, you can try repartitioning to a larger number to allow more parallelism on your job. A good indication of this is if in the Spark UI – you don’t have a lot of tasks, but each task is very slow to complete.

ref:
https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html
https://databricks.com/blog/2016/10/18/7-tips-to-debug-apache-spark-code-faster-with-databricks.html
https://docs.databricks.com/spark/latest/rdd-streaming/debugging-streaming-applications.html

Configurations

ref:
https://spark.apache.org/docs/latest/configuration.html
https://spark.apache.org/docs/latest/tuning.html#memory-tuning

My spark-defaults.conf

spark.driver.maxResultSize       2g
spark.jars.packages              com.github.fommil.netlib:all:1.1.2,com.hankcs:hanlp:portable-1.3.4,mysql:mysql-connector-java:5.1.41,org.apache.httpcomponents:httpclient:4.5.2,org.elasticsearch.client:elasticsearch-rest-high-level-client:5.6.2
spark.kryoserializer.buffer.max  1g
spark.serializer                 org.apache.spark.serializer.KryoSerializer

Allocate Resources (Executors, Cores, and Memory)

When submitting a Spark application via spark-submit, you may specify following options:

  • --driver-memory MEM: Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  • --executor-memory MEM: Memory per executor (e.g. 1000M, 2G) (Default: 1G).

YARN and Spark standalone only:

  • --executor-cores NUM: Number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode)

YARN only:

  • --driver-cores NUM: Number of cores used by the driver, only in cluster mode (Default: 1).
  • --num-executors NUM: Number of executors to launch (Default: 2).

計算方式:

driver: 2cores 7.5GB x 1
worker: 8cores 30GB x 4

  • actual memory for driver = driver memory - (10% overhead x 2)
    • 7.5 - (7.5 x 0.1 x 2) = 6
    • spark.driver.memory=6g
  • cores per executor = 5
    • for good HDFS throughput
    • spark.executor.cores=5
    • --executor-cores 5
  • total cores in cluster = (cores per node - 1) x total nodes
    • leave 1 core per node for YARN daemons
    • (8 - 1) x 4 = 28
  • total executors in cluster = (total cores in cluster / cores per executor) - 1
    • leave 1 executor for ApplicationManager
    • (28 / 5) - 1 = 4
    • spark.executor.instances=4
    • --num-executors 4
  • executors per node = (total executors in cluster + 1) / total nodes
    • (4 + 1) / 4 = 1
  • memory per executor = (memory per node / executors per node) - (10% overhead x 2) - 3
    • (30 / 1) - (30 x 0.1 x 2) - 3 = 21
    • spark.executor.memory=21g

ref:
https://spark.apache.org/docs/latest/submitting-applications.html
https://spark.apache.org/docs/latest/configuration.html
https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application
https://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications/21
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Build a recommender system with Spark: Content-based and Elasticsearch

Build a recommender system with Spark: Content-based and Elasticsearch

在這個系列的文章裡,我們將使用 Apache Spark、XGBoost、Elasticsearch 和 MySQL 等工具來搭建一個推薦系統的 Machine Learning Pipeline。推薦系統的組成可以粗略地分成 Candidate Generation 和 Ranking 兩個部分,前者是針對用戶產生候選物品集,常用的方法有 Collaborative Filtering、Content-based、標籤配對、熱門排行或人工精選等;後者則是對這些候選物品排序,以 Top N 的方式呈現最終的推薦結果,常用的方法有 Logistic Regression。

在本篇文章中,我們將以 Candidate Generation 階段常用的方法之一:Content-based recommendation 基於內容的推薦為例,利用 Elasticsearch 的 More Like This query 建立一個 GitHub repositories 的推薦系統,以用戶最近打星過的 repo 作為輸入數據,比對出相似的其他 repo 作為候選物品集。

題外話,我原本是打算用 Spark 把 repo 的文本資料轉成 Word2Vec 向量,然後事先計算好各個 repo 之間的相似度(所謂的 Similarity Join),但是要計算這麼多 repo 之間的相似度實在太耗時間和機器了,就算用了 DIMSUM 和 Locality Sensitive Hashing (LSH) 的 Approximate Nearest Neighbor Search 的效果也不是很好。後來一想,尋找相似或相關物品這件事不就是搜尋引擎在做的嗎,所以直接把 repo 的各種資料丟進 Elasticsearch,用 document id 當作搜尋條件,一個 More Like This query 就解決了,爽快。畢竟不需要所有的事情都在 Spark 裡解決嘛。

完整的程式碼可以在 https://github.com/vinta/albedo 找到。

系列文章:

Setup Elasticsearch

為了讓事情簡單一點,我們直接用官方包裝好的 Docker image。另外要注意的是,Elasticsearch 5.x/6.x 跟之前的版本比起來有不小的改動,例如 X-Pack、high-level REST client 和以後每個 index 只能有一個 mapping type 等等,建議大家有空可以翻一下文件。

# in elasticsearch.yml
bootstrap.memory_lock: true
cluster.name: albedo
discovery.type: single-node
http.host: 0.0.0.0
node.name: ${HOSTNAME}
xpack.security.enabled: false
# in docker-compose.yml
version: "3"
services:
  django:
    build: .
    hostname: django
    working_dir: /app
    env_file: .docker-assets/django.env
    command: .docker-assets/django_start.sh
    ports:
      - 8000:8000
    volumes:
      - ".:/app"
      - "../albedo-vendors/bin:/usr/local/bin"
      - "../albedo-vendors/dist-packages:/usr/local/lib/python3.5/dist-packages"
    links:
      - mysql
      - elasticsearch
  mysql:
    image: vinta/mysql:5.7
    hostname: mysql
    env_file: .docker-assets/mysql.env
    command: mysqld --character-set-server=utf8 --collation-server=utf8_unicode_ci
    ports:
      - 3306:3306
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:5.6.2
    ports:
      - 9200:9200
      - 9300:9300
    volumes:
      - "./.docker-assets/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml"
    environment:
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
$ docker-compose up

然後就可以在 http://127.0.0.1:9200/ 存取你的 Elasticsearch cluster 了。

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/docker.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/security-settings.html

Define the Mapping (Data Schema)

這裡用 elasticsearch-dsl-py 定義了一個 index 和 mapping type。

from elasticsearch.helpers import bulk
from elasticsearch_dsl import analyzer
from elasticsearch_dsl import Date, Integer, Keyword, Text, Boolean
from elasticsearch_dsl import Index, DocType
from elasticsearch_dsl.connections import connections

client = connections.create_connection(hosts=['elasticsearch'])

repo_index = Index('repo')
repo_index.settings(
    number_of_shards=1,
    number_of_replicas=0
)

text_analyzer = analyzer(
    'text_analyzer',
    char_filter=["html_strip"],
    tokenizer="standard",
    filter=["asciifolding", "lowercase", "snowball", "stop"]
)
repo_index.analyzer(text_analyzer)

@repo_index.doc_type
class RepoInfoDoc(DocType):
    owner_id = Keyword()
    owner_username = Keyword()
    owner_type = Keyword()
    name = Text(text_analyzer, fields={'raw': Keyword()})
    full_name = Text(text_analyzer, fields={'raw': Keyword()})
    description = Text(text_analyzer)
    language = Keyword()
    created_at = Date()
    updated_at = Date()
    pushed_at = Date()
    homepage = Keyword()
    size = Integer()
    stargazers_count = Integer()
    forks_count = Integer()
    subscribers_count = Integer()
    fork = Boolean()
    has_issues = Boolean()
    has_projects = Boolean()
    has_downloads = Boolean()
    has_wiki = Boolean()
    has_pages = Boolean()
    open_issues_count = Integer()
    topics = Keyword(multi=True)

    class Meta:
        index = repo_index._name

    @classmethod
    def bulk_save(cls, documents):
        dicts = (d.to_dict(include_meta=True) for d in documents)
        return bulk(client, dicts)

    def save(self, **kwargs):
        return super(RepoInfoDoc, self).save(**kwargs)

RepoInfoDoc.init()

Elasticsearch: More than a Search Engine
https://vinta.ws/code/elasticsearch-more-than-a-search-engine.html

ref:
https://github.com/elastic/elasticsearch-dsl-py

Import Data into Elasticsearch

你可以透過很多種手段把存在 MySQL 裡的資料倒進 Elasticsearch,例如 cronjob、Celery 或 MySQL binglog replication,不過因為我們主要的 data models 是用 Django ORM 寫的,這裡就簡單地寫個 Django command 把資料倒進去就好。

from django.core.management.base import BaseCommand

from app.mappings import RepoInfoDoc
from app.models import RepoInfo

class Command(BaseCommand):
    def handle(self, *args, **options):
        def batch_qs(qs, batch_size=500):
            total = qs.count()
            for start in range(0, total, batch_size):
                end = min(start + batch_size, total)
                yield (start, end, total, qs[start:end])

        large_qs = RepoInfo.objects.filter(stargazers_count__gte=10, stargazers_count__lte=290000, fork=False)
        for start, end, total, qs_chunk in batch_qs(large_qs):
            documents = []
            for repo_info in qs_chunk:
                repo_info_doc = RepoInfoDoc()
                repo_info_doc.meta.id = repo_info.id
                repo_info_doc.owner_id = repo_info.owner_id
                repo_info_doc.owner_username = repo_info.owner_username
                repo_info_doc.owner_type = repo_info.owner_type
                repo_info_doc.name = repo_info.name
                repo_info_doc.full_name = repo_info.full_name
                repo_info_doc.description = repo_info.description
                repo_info_doc.language = repo_info.language
                repo_info_doc.created_at = repo_info.created_at
                repo_info_doc.updated_at = repo_info.updated_at
                repo_info_doc.pushed_at = repo_info.pushed_at
                repo_info_doc.homepage = repo_info.homepage
                repo_info_doc.size = repo_info.size
                repo_info_doc.stargazers_count = repo_info.stargazers_count
                repo_info_doc.forks_count = repo_info.forks_count
                repo_info_doc.subscribers_count = repo_info.subscribers_count
                repo_info_doc.fork = repo_info.fork
                repo_info_doc.has_issues = repo_info.has_issues
                repo_info_doc.has_projects = repo_info.has_projects
                repo_info_doc.has_downloads = repo_info.has_downloads
                repo_info_doc.has_wiki = repo_info.has_wiki
                repo_info_doc.has_pages = repo_info.has_pages
                repo_info_doc.open_issues_count = repo_info.open_issues_count
                repo_info_doc.topics = repo_info.topics

                documents.append(repo_info_doc)

            RepoInfoDoc.bulk_save(documents)

noplay/python-mysql-replication
https://github.com/noplay/python-mysql-replication

Find Similar Items

因為之後會在 Spark 裡作為推薦系統的候選物品集的來源之一,我們會把 Elasticsearch 的 More Like This API 封裝成一個 Spark 的 Transformer,所以以下的部分是用 Scala 寫的。

Initialize High-level REST Client

Elasticsearch 5.x 之後官方建議使用 High-level REST Client,用法跟以前 Java 的 TransportClient 稍微有點不同。

import org.apache.http.HttpHost
import org.elasticsearch.client.{RestClient, RestHighLevelClient}

val lowClient = RestClient.builder(new HttpHost("127.0.0.1", 9200, "http")).build()
val highClient = new RestHighLevelClient(lowClient)

ref:
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-low-usage-initialization.html
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-getting-started-initialization.html

Perform the More Like This Query

我們會輸入一個 userDF,是一個要產生候選物品集的用戶的 DataFrame,然後會先拿到每個用戶最近打星過的 repo 的列表,repo id 就是 Elasticsearch 的 document id,以此為條件用 More Like This query 找出相似的其他 repo。

val userRecommendedItemDF = userDF
  .flatMap {
    case (userId: Int) => {
      val itemIds = selectUserStarredRepos(userId)

      val lowClient = RestClient.builder(new HttpHost("127.0.0.1", 9200, "http")).build()
      val highClient = new RestHighLevelClient(lowClient)

      val fields = Array("description", "full_name", "language", "topics")
      val texts = Array("")
      val items = itemIds.map((itemId: Int) => new Item("repo", "repo_info_doc", itemId.toString))
      val queryBuilder = moreLikeThisQuery(fields, texts, items)
        .minTermFreq(1)
        .maxQueryTerms(20)

      val searchSourceBuilder = new SearchSourceBuilder()
      searchSourceBuilder.query(queryBuilder)
      searchSourceBuilder.from(0)
      searchSourceBuilder.size($(topK))

      val searchRequest = new SearchRequest()
      searchRequest.indices("repo")
      searchRequest.types("repo_info_doc")
      searchRequest.source(searchSourceBuilder)

      val searchResponse = highClient.search(searchRequest)
      val hits = searchResponse.getHits
      val searchHits = hits.getHits

      val userItemScoreTuples = searchHits.map((searchHit: SearchHit) => {
        val itemId = searchHit.getId.toInt
        val score = searchHit.getScore
        (userId, itemId, score)
      })

      lowClient.close()

      userItemScoreTuples
    }
  }
  .toDF($(userCol), $(itemCol), $(scoreCol))
  .withColumn($(sourceCol), lit(source))

userRecommendedItemDF.show()
// +-------+--------+---------+-------+
// |user_id|repo_id |score    |source |
// +-------+--------+---------+-------+
// |652070 |26152923|44.360096|content|
// |652070 |28451314|38.752697|content|
// |652070 |16175350|35.676353|content|
// |652070 |10885469|30.280012|content|
// |652070 |24037308|28.488512|content|
// +-------+--------+---------+-------+

你可以在 GitHub 找到完整的程式碼
https://github.com/vinta/albedo/blob/master/src/main/scala/ws/vinta/albedo/ContentRecommenderBuilder.scala

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/java-specialized-queries.html