python

Timezone in Python: Offset-naive and Offset-aware datetimes

2018-02-042026-03-17VintaPython

TL;DR: You should always store datetimes in UTC and convert to proper timezone on display.

A timezone offset refers to how many hours the timezone is from Coordinated Universal Time (UTC). The offset of UTC is +00:00, and the offset of Asia/Taipei timezone is UTC+08:00 (you could also present it as GMT+08:00). Basically, there is no perceptible difference between Greenwich Mean Time (GMT) and UTC.

The local time minus the offset of its timezone is UTC time. For instance, 18:00+08:00 of Asia/Taipei minuses timezone offset +08:00 is 10:00+00:00, 10 o'clock of UTC. On the other hand, UTC time plus local timezone offset is local time.

ref:
https://opensource.com/article/17/5/understanding-datetime-python-primer
https://julien.danjou.info/blog/2015/python-and-timezones

到底是 GMT+8 還是 UTC+8？
http://pansci.asia/archives/84978

Installation

$ pip install -U python-dateutil pytz tzlocal

Show System Timezone

import tzlocal

tzlocal.get_localzone()
# <DstTzInfo 'Asia/Taipei' LMT+8:06:00 STD>

tzlocal.get_localzone().zone
# 'Asia/Taipei'

from time import gmtime, strftime
print(strftime("%z", gmtime()))
# +0800

ref:
https://github.com/regebro/tzlocal
https://stackoverflow.com/questions/13218506/how-to-get-system-timezone-setting-and-pass-it-to-pytz-timezone/

Find Timezones Of A Certain Country

import pytz

pytz.country_timezones('tw')
# ['Asia/Taipei']

pytz.country_timezones('cn')
# ['Asia/Shanghai', 'Asia/Urumqi']

ref:
https://pythonhosted.org/pytz/#country-information

Offset-naive Datetime

Any naive datetime would be present as local timezone but without tzinfo, so it is buggy.

A naive datetime object contains no timezone information. The datetime_obj.tzinfo will be set to None if the object is naive. Actually, datetime objects without timezone should be considered as a "bug" in your application. It is up for the programmer to keep track of which timezone users are working in.

import datetime

import dateutil.parser

datetime.datetime.now()
# return the current date and time in local timezone, in this example: Asia/Taipei (UTC+08:00)
# datetime.datetime(2018, 2, 2, 9, 15, 6, 211358)), naive

datetime.datetime.utcnow()
# return the current date and time in UTC
# datetime.datetime(2018, 2, 2, 1, 15, 6, 211358), naive

dateutil.parser.parse('2018-02-04T16:30:00')
# datetime.datetime(2018, 2, 4, 16, 30), naive

ref:
https://docs.python.org/3/library/datetime.html
https://dateutil.readthedocs.io/en/stable/

Offset-aware Datetime

A aware datetime object embeds a timezone information. Rules of thumb for timezone in Python:

Always work with "offset-aware" datetime objects.
Always store datetime in UTC and do timezone conversion only when interacting with users.
Always use ISO 8601 as input and output string format.

There are two useful methods: pytz.utc.localize(naive_dt) for converting naive datetime to timezone be offset-aware, and aware_dt.astimezone(pytz.timezone('Asia/Taipei')) for adjusting timezones of offset-aware objects.

You should avoid naive_dt.astimezone(some_tzinfo) which would be converted to aware datetime as system timezone then convert to some_tzinfo timezone.

import datetime

import pytz

now_utc = pytz.utc.localize(datetime.datetime.utcnow())
# equals to datetime.datetime.now(pytz.utc)
# equals to datetime.datetime.utcnow().replace(tzinfo=datetime.timezone.utc)
# datetime.datetime(2018, 2, 4, 10, 17, 40, 679562, tzinfo=<UTC>), aware

now_taipei = now_utc.astimezone(pytz.timezone('Asia/Taipei'))
# convert to another timezone
# datetime.datetime(2018, 2, 4, 18, 17, 40, 679562, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>), aware

now_utc.isoformat()
# '2018-02-04T10:17:40.679562+00:00'

now_taipei.isoformat()
# '2018-02-04T18:17:40.679562+08:00'

now_utc == now_taipei
# True

For working with pytz, it is recommended to call tz.localize(naive_dt) instead of naive_dt.replace(tzinfo=tz). dt.replace(tzinfo=tz) does not handle daylight savings time correctly.

dt1 = datetime.datetime.now(pytz.timezone('Asia/Taipei'))
# datetime.datetime(2018, 2, 4, 18, 22, 28, 409332, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>), aware

dt2 = datetime.datetime(2018, 2, 4, 18, 22, 28, 409332, tzinfo=pytz.timezone('Asia/Taipei'))
# datetime.datetime(2018, 2, 4, 18, 22, 28, 409332, tzinfo=<DstTzInfo 'Asia/Taipei' LMT+8:06:00 STD>), aware

dt1 == dt2
# False

ref:
https://pythonhosted.org/pytz/

Naive and aware datetime objects are not comparable.

naive = datetime.datetime.utcnow()
aware = pytz.utc.localize(naive)

naive == aware
# False

naive >= aware
# TypeError: can't compare offset-naive and offset-aware datetimes

Parse String to Datetime

python-dateutil usually comes in handy.

import dateutil.parser
import dateutil.tz

dt1 = dateutil.parser.parse('2018-02-04T19:30:00+08:00')
# datetime.datetime(2018, 2, 4, 19, 30, tzinfo=tzoffset(None, 28800)), aware

dt2 = dateutil.parser.parse('2018-02-04T11:30:00+00:00')
# datetime.datetime(2018, 2, 4, 11, 30, tzinfo=tzutc()), aware

dt3 = dateutil.parser.parse('2018-02-04T11:30:00Z')
# datetime.datetime(2018, 2, 4, 11, 30, tzinfo=tzutc()), aware

dt1 == dt2 == dt3
# True

ref:
https://dateutil.readthedocs.io/en/stable/

Convert Datetime To Unix Timestamp

import datetime

naive_dt = datetime.datetime(2018, 9, 10, 0, 0, 0)
naive_timestamp = aware_dt.timestamp()
# naive_dt would be in local timezone, in this example: Asia/Taipei (UTC+08:00)

aware_dt = datetime.datetime(2018, 9, 10, 0, 0, 0, tzinfo=datetime.timezone(datetime.timedelta(hours=8)))
aware_timestamp = aware_dt.timestamp()

naive_timestamp == aware_timestamp
# True

# MongoDB stores all datetimes in UTC timezone
dt_fetched_from_mongodb.replace(tzinfo=datetime.timezone.utc).timestamp()

Parse Unix Timestamp To Datetime

import datetime
import time

import pytz

ts = time.time()
# seconds since the Epoch (1970-01-01T00:00:00 in UTC)
# 1517748706.063205

dt1 = datetime.datetime.fromtimestamp(ts)
# return the date and time of the timestamp in local timezone, in this example: Asia/Taipei (UTC+08:00)
# datetime.datetime(2018, 2, 4, 20, 51, 46, 63205), naive

dt2 = datetime.datetime.utcfromtimestamp(ts)
# return the date and time of the timestamp in UTC timezone
# datetime.datetime(2018, 2, 4, 12, 51, 46, 63205), naive

pytz.timezone('Asia/Taipei').localize(dt1) == pytz.utc.localize(dt2)
# True

ref:
https://stackoverflow.com/questions/13890935/does-pythons-time-time-return-the-local-or-utc-timestamp

We might receive a Unix timestamp from a JavaScript client.

var moment = require('moment')
var ts = moment('2018-02-02').unix()
// 1517500800

ref:
https://momentjs.com/docs/#/parsing/unix-timestamp/

Store Datetime In Databases

MySQL lets developers decide what timezone should be used, and you should convert datetime to UTC before saving into database.
MongoDB assumes that all timestamps are in UTC, and you have to normalize datetime to UTC.

ref:
https://tommikaikkonen.github.io/timezones/
https://blog.elsdoerfer.name/2008/03/03/fun-with-timezones-in-django-mysql/

Tools

ref:
https://www.epochconverter.com/
https://www.timeanddate.com/worldclock/converter.html

Build a recommender system with Spark: Content-based and Elasticsearch

2017-10-102026-02-18VintaAI, Big Data

在這個系列的文章裡，我們將使用 Apache Spark、XGBoost、Elasticsearch 和 MySQL 等工具來搭建一個推薦系統的 Machine Learning Pipeline。推薦系統的組成可以粗略地分成 Candidate Generation 和 Ranking 兩個部分，前者是針對用戶產生候選物品集，常用的方法有 Collaborative Filtering、Content-based、標籤配對、熱門排行或人工精選等；後者則是對這些候選物品排序，以 Top N 的方式呈現最終的推薦結果，常用的方法有 Logistic Regression。

在本篇文章中，我們將以 Candidate Generation 階段常用的方法之一：Content-based recommendation 基於內容的推薦為例，利用 Elasticsearch 的 More Like This query 建立一個 GitHub repositories 的推薦系統，以用戶最近打星過的 repo 作為輸入數據，比對出相似的其他 repo 作為候選物品集。

題外話，我原本是打算用 Spark 把 repo 的文本資料轉成 Word2Vec 向量，然後事先計算好各個 repo 之間的相似度（所謂的 Similarity Join），但是要計算這麼多 repo 之間的相似度實在太耗時間和機器了，就算用了 DIMSUM 和 Locality Sensitive Hashing (LSH) 的 Approximate Nearest Neighbor Search 的效果也不是很好。後來一想，尋找相似或相關物品這件事不就是搜尋引擎在做的嗎，所以直接把 repo 的各種資料丟進 Elasticsearch，用 document id 當作搜尋條件，一個 More Like This query 就解決了，爽快。畢竟不需要所有的事情都在 Spark 裡解決嘛。

完整的程式碼可以在 https://github.com/vinta/albedo 找到。

系列文章：

Setup Elasticsearch

為了讓事情簡單一點，我們直接用官方包裝好的 Docker image。另外要注意的是，Elasticsearch 5.x/6.x 跟之前的版本比起來有不小的改動，例如 X-Pack、high-level REST client 和以後每個 index 只能有一個 mapping type 等等，建議大家有空可以翻一下文件。

# in elasticsearch.yml
bootstrap.memory_lock: true
cluster.name: albedo
discovery.type: single-node
http.host: 0.0.0.0
node.name: ${HOSTNAME}
xpack.security.enabled: false

# in docker-compose.yml
version: "3"
services:
  django:
    build: .
    hostname: django
    working_dir: /app
    env_file: .docker-assets/django.env
    command: .docker-assets/django_start.sh
    ports:
      - 8000:8000
    volumes:
      - ".:/app"
      - "../albedo-vendors/bin:/usr/local/bin"
      - "../albedo-vendors/dist-packages:/usr/local/lib/python3.5/dist-packages"
    links:
      - mysql
      - elasticsearch
  mysql:
    image: vinta/mysql:5.7
    hostname: mysql
    env_file: .docker-assets/mysql.env
    command: mysqld --character-set-server=utf8 --collation-server=utf8_unicode_ci
    ports:
      - 3306:3306
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:5.6.2
    ports:
      - 9200:9200
      - 9300:9300
    volumes:
      - "./.docker-assets/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml"
    environment:
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"

$ docker-compose up

然後就可以在 http://127.0.0.1:9200/ 存取你的 Elasticsearch cluster 了。

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/docker.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/security-settings.html

Define the Mapping (Data Schema)

這裡用 elasticsearch-dsl-py 定義了一個 index 和 mapping type。

from elasticsearch.helpers import bulk
from elasticsearch_dsl import analyzer
from elasticsearch_dsl import Date, Integer, Keyword, Text, Boolean
from elasticsearch_dsl import Index, DocType
from elasticsearch_dsl.connections import connections

client = connections.create_connection(hosts=['elasticsearch'])

repo_index = Index('repo')
repo_index.settings(
    number_of_shards=1,
    number_of_replicas=0
)

text_analyzer = analyzer(
    'text_analyzer',
    char_filter=["html_strip"],
    tokenizer="standard",
    filter=["asciifolding", "lowercase", "snowball", "stop"]
)
repo_index.analyzer(text_analyzer)

@repo_index.doc_type
class RepoInfoDoc(DocType):
    owner_id = Keyword()
    owner_username = Keyword()
    owner_type = Keyword()
    name = Text(text_analyzer, fields={'raw': Keyword()})
    full_name = Text(text_analyzer, fields={'raw': Keyword()})
    description = Text(text_analyzer)
    language = Keyword()
    created_at = Date()
    updated_at = Date()
    pushed_at = Date()
    homepage = Keyword()
    size = Integer()
    stargazers_count = Integer()
    forks_count = Integer()
    subscribers_count = Integer()
    fork = Boolean()
    has_issues = Boolean()
    has_projects = Boolean()
    has_downloads = Boolean()
    has_wiki = Boolean()
    has_pages = Boolean()
    open_issues_count = Integer()
    topics = Keyword(multi=True)

    class Meta:
        index = repo_index._name

    @classmethod
    def bulk_save(cls, documents):
        dicts = (d.to_dict(include_meta=True) for d in documents)
        return bulk(client, dicts)

    def save(self, **kwargs):
        return super(RepoInfoDoc, self).save(**kwargs)

RepoInfoDoc.init()

Elasticsearch: More than a Search Engine
https://vinta.ws/code/elasticsearch-more-than-a-search-engine.html

ref:
https://github.com/elastic/elasticsearch-dsl-py

Import Data into Elasticsearch

你可以透過很多種手段把存在 MySQL 裡的資料倒進 Elasticsearch，例如 cronjob、Celery 或 MySQL binglog replication，不過因為我們主要的 data models 是用 Django ORM 寫的，這裡就簡單地寫個 Django command 把資料倒進去就好。

from django.core.management.base import BaseCommand

from app.mappings import RepoInfoDoc
from app.models import RepoInfo

class Command(BaseCommand):
    def handle(self, *args, **options):
        def batch_qs(qs, batch_size=500):
            total = qs.count()
            for start in range(0, total, batch_size):
                end = min(start + batch_size, total)
                yield (start, end, total, qs[start:end])

        large_qs = RepoInfo.objects.filter(stargazers_count__gte=10, stargazers_count__lte=290000, fork=False)
        for start, end, total, qs_chunk in batch_qs(large_qs):
            documents = []
            for repo_info in qs_chunk:
                repo_info_doc = RepoInfoDoc()
                repo_info_doc.meta.id = repo_info.id
                repo_info_doc.owner_id = repo_info.owner_id
                repo_info_doc.owner_username = repo_info.owner_username
                repo_info_doc.owner_type = repo_info.owner_type
                repo_info_doc.name = repo_info.name
                repo_info_doc.full_name = repo_info.full_name
                repo_info_doc.description = repo_info.description
                repo_info_doc.language = repo_info.language
                repo_info_doc.created_at = repo_info.created_at
                repo_info_doc.updated_at = repo_info.updated_at
                repo_info_doc.pushed_at = repo_info.pushed_at
                repo_info_doc.homepage = repo_info.homepage
                repo_info_doc.size = repo_info.size
                repo_info_doc.stargazers_count = repo_info.stargazers_count
                repo_info_doc.forks_count = repo_info.forks_count
                repo_info_doc.subscribers_count = repo_info.subscribers_count
                repo_info_doc.fork = repo_info.fork
                repo_info_doc.has_issues = repo_info.has_issues
                repo_info_doc.has_projects = repo_info.has_projects
                repo_info_doc.has_downloads = repo_info.has_downloads
                repo_info_doc.has_wiki = repo_info.has_wiki
                repo_info_doc.has_pages = repo_info.has_pages
                repo_info_doc.open_issues_count = repo_info.open_issues_count
                repo_info_doc.topics = repo_info.topics

                documents.append(repo_info_doc)

            RepoInfoDoc.bulk_save(documents)

noplay/python-mysql-replication
https://github.com/noplay/python-mysql-replication

Find Similar Items

因為之後會在 Spark 裡作為推薦系統的候選物品集的來源之一，我們會把 Elasticsearch 的 More Like This API 封裝成一個 Spark 的 Transformer，所以以下的部分是用 Scala 寫的。

Initialize High-level REST Client

Elasticsearch 5.x 之後官方建議使用 High-level REST Client，用法跟以前 Java 的 TransportClient 稍微有點不同。

import org.apache.http.HttpHost
import org.elasticsearch.client.{RestClient, RestHighLevelClient}

val lowClient = RestClient.builder(new HttpHost("127.0.0.1", 9200, "http")).build()
val highClient = new RestHighLevelClient(lowClient)

ref:
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-low-usage-initialization.html
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-getting-started-initialization.html

Perform the More Like This Query

我們會輸入一個 userDF，是一個要產生候選物品集的用戶的 DataFrame，然後會先拿到每個用戶最近打星過的 repo 的列表，repo id 就是 Elasticsearch 的 document id，以此為條件用 More Like This query 找出相似的其他 repo。

val userRecommendedItemDF = userDF
  .flatMap {
    case (userId: Int) => {
      val itemIds = selectUserStarredRepos(userId)

      val lowClient = RestClient.builder(new HttpHost("127.0.0.1", 9200, "http")).build()
      val highClient = new RestHighLevelClient(lowClient)

      val fields = Array("description", "full_name", "language", "topics")
      val texts = Array("")
      val items = itemIds.map((itemId: Int) => new Item("repo", "repo_info_doc", itemId.toString))
      val queryBuilder = moreLikeThisQuery(fields, texts, items)
        .minTermFreq(1)
        .maxQueryTerms(20)

      val searchSourceBuilder = new SearchSourceBuilder()
      searchSourceBuilder.query(queryBuilder)
      searchSourceBuilder.from(0)
      searchSourceBuilder.size($(topK))

      val searchRequest = new SearchRequest()
      searchRequest.indices("repo")
      searchRequest.types("repo_info_doc")
      searchRequest.source(searchSourceBuilder)

      val searchResponse = highClient.search(searchRequest)
      val hits = searchResponse.getHits
      val searchHits = hits.getHits

      val userItemScoreTuples = searchHits.map((searchHit: SearchHit) => {
        val itemId = searchHit.getId.toInt
        val score = searchHit.getScore
        (userId, itemId, score)
      })

      lowClient.close()

      userItemScoreTuples
    }
  }
  .toDF($(userCol), $(itemCol), $(scoreCol))
  .withColumn($(sourceCol), lit(source))

userRecommendedItemDF.show()
// +-------+--------+---------+-------+
// |user_id|repo_id |score    |source |
// +-------+--------+---------+-------+
// |652070 |26152923|44.360096|content|
// |652070 |28451314|38.752697|content|
// |652070 |16175350|35.676353|content|
// |652070 |10885469|30.280012|content|
// |652070 |24037308|28.488512|content|
// +-------+--------+---------+-------+

你可以在 GitHub 找到完整的程式碼
https://github.com/vinta/albedo/blob/master/src/main/scala/ws/vinta/albedo/ContentRecommenderBuilder.scala

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/java-specialized-queries.html

Spark troubleshooting

2017-08-052026-03-17VintaAI, Big Data, DevOps

Apache Spark 2.x Troubleshooting Guide
https://www.slideshare.net/jcmia1/a-beginners-guide-on-troubleshooting-spark-applications
https://www.slideshare.net/jcmia1/apache-spark-20-tuning-guide

Check your cluster UI to ensure that workers are registered and have sufficient resources

PYSPARK_DRIVER_PYTHON="jupyter" 
PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip 0.0.0.0" 
pyspark 
--packages "org.xerial:sqlite-jdbc:3.16.1,com.github.fommil.netlib:all:1.1.2" 
--driver-memory 4g 
--executor-memory 20g 
--master spark://TechnoCore.local:7077

TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

可能是你指定的 --executor-memory 超過了 worker 的 memory。

你可以在 Spark Master UI http://localhost:8080/ 看到各個 worker 總共有多少 memory 可以用。如果每台 worker 可以用的 memory 容量不同，Spark 就只會選擇那些 memory 大於 --executor-memory 的 workers。

ref:
https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application

SparkContext was shut down

ERROR Executor: Exception in task 1.0 in stage 6034.0 (TID 21592)
java.lang.StackOverflowError
...
ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(55,1494185401195,JobFailed(org.apache.spark.SparkException: Job 55 cancelled because SparkContext was shut down))

可能是 executor 的記憶體不夠，導致 Out Of Memory (OOM) 了。

ref:
http://stackoverflow.com/questions/32822948/sparkcontext-was-shut-down-while-running-spark-on-a-large-dataset

Container exited with a non-zero exit code 56 (or some other numbers)

WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1504241464590_0001_01_000002 on host: albedo-w-1.c.albedo-157516.internal. Exit status: 56. Diagnostics: Exception from container-launch.
Container id: container_1504241464590_0001_01_000002
Exit code: 56
Stack trace: ExitCodeException exitCode=56:
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
    at org.apache.hadoop.util.Shell.run(Shell.java:869)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)

Container exited with a non-zero exit code 56

可能是 executor 的記憶體不夠，導致 Out Of Memory (OOM) 了。

ref:
http://stackoverflow.com/questions/39038460/understanding-spark-container-failure

Exception in thread "main" java.lang.StackOverflowError

Exception in thread "main" java.lang.StackOverflowError
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
    at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
    at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
    at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    ...

解決辦法：

import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.sql.SparkSession

val spark: SparkSession = SparkSession.builder().getOrCreate()
val sc = spark.sparkContext
sc.setCheckpointDir("./spark-data/checkpoint")

// 因為 sc.setCheckpointDir() 就會啟用 checkpoint 了
// 所以可以不用特別指定 checkpointInterval
val als = new ALS()
  .setCheckpointInterval(2)

ref:
https://stackoverflow.com/questions/31484460/spark-gives-a-stackoverflowerror-when-training-using-als
https://stackoverflow.com/questions/35127720/what-is-the-difference-between-spark-checkpoint-and-persist-to-a-disk

Randomness of hash of string should be disabled via PYTHONHASHSEED

解決辦法：

$ cd $SPARK_HOME
$ cp conf/spark-env.sh.template conf/spark-env.sh
$ echo "export PYTHONHASHSEED=42" >> conf/spark-env.sh

ref:
https://issues.apache.org/jira/browse/SPARK-13330

It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

因為 spark.sparkContext 只能在 driver program 裡存取，不能被 worker 存取（例如那些丟給 RDD 執行的 lambda function 或是 UDF 就是在 worker 上執行的）。

ref:
https://spark.apache.org/docs/latest/rdd-programming-guide.html#passing-functions-to-spark
https://engineering.sharethrough.com/blog/2013/09/13/top-3-troubleshooting-tips-to-keep-you-sparking/

Spark automatically creates closures:

for functions that run on RDDs at workers,
and for any global variables that are used by those workers.

One closure is sent per worker for every task. Closures are one way from the driver to the worker.

ref:
https://gerardnico.com/wiki/spark/closure

Unable to find encoder for type stored in a Dataset

Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases. someDF.as[SomeCaseClass]

解決辦法：

import spark.implicits._

yourDF.as[YourCaseClass]

ref:
https://stackoverflow.com/questions/38664972/why-is-unable-to-find-encoder-for-type-stored-in-a-dataset-when-creating-a-dat

Task not serializable

Caused by: java.io.NotSerializableException: Settings
Serialization stack:
    - object not serializable (class: Settings, value: Settings@2dfe2f00)
    - field (class: Settings$$anonfun$1, name: $outer, type: class Settings)
    - object (class Settings$$anonfun$1, <function1>)

Caused by: org.apache.spark.SparkException:
    Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)

通常是你在 closure functions 裡使用了 driver program 裡的某個 object，因為 Spark 會自動 serialize 那個被引用的 object 一起丟給 worker node 執行，所以如果那個 object 或是 class 沒辦法被 serialize，就會出現這個錯誤。

ref:
https://www.safaribooksonline.com/library/view/spark-the-definitive/9781491912201/ch04.html#user-defined-functions
http://www.puroguramingu.com/2016/02/26/spark-dos-donts.html
https://stackoverflow.com/questions/36176011/spark-sql-udf-task-not-serialisable
https://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
https://mp.weixin.qq.com/s/BT6sXZlHcufAFLgTONCHsg

如果你只有在 Databricks Notebook 裡遇到這個錯誤，因為 Notebook 的運作機制跟一般的 Spark application 稍微有點不同，你可以試試 package cell。

ref:
https://docs.databricks.com/user-guide/notebooks/package-cells.html

java.lang.IllegalStateException: Cannot find any build directories.

java.lang.IllegalStateException: Cannot find any build directories.
    at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:248)
    at org.apache.spark.launcher.AbstractCommandBuilder.getScalaVersion(AbstractCommandBuilder.java:240)
    at org.apache.spark.launcher.AbstractCommandBuilder.buildClassPath(AbstractCommandBuilder.java:194)
    at org.apache.spark.launcher.AbstractCommandBuilder.buildJavaCommand(AbstractCommandBuilder.java:117)
    at org.apache.spark.launcher.WorkerCommandBuilder.buildCommand(WorkerCommandBuilder.scala:39)
    at org.apache.spark.launcher.WorkerCommandBuilder.buildCommand(WorkerCommandBuilder.scala:45)
    at org.apache.spark.deploy.worker.CommandUtils$.buildCommandSeq(CommandUtils.scala:63)
    at org.apache.spark.deploy.worker.CommandUtils$.buildProcessBuilder(CommandUtils.scala:51)
    at org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:145)
    at org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)

可能的原因是沒有設置 SPARK_HOME 或是你的 launch script 沒有讀到該環境變數。

Build a recommender system with Spark: Implicit ALS

2017-05-202026-02-18VintaAI, Big Data, Python

在本篇文章中，我們將以 Candidate Generation 階段常用的方法之一：Collaborative Filtering 協同過濾演算法為例，利用 Apache Spark 的 ALS (Alternating Least Squares) 模型建立一個 GitHub repositories 的推薦系統，以用戶對 repo 的打星紀錄作為訓練數據，推薦出用戶可能會感興趣的其他 repo 作為候選物品集。

完整的程式碼可以在 https://github.com/vinta/albedo 找到。

系列文章：

Submit the Application

因為需要使用 JDBC 讀取 MySQL 資料庫，必須安裝 MySQL driver，可以透過 --packages "mysql:mysql-connector-java:5.1.41" 參數在 cluster 的每一台機器上安裝需要的 Java packages。

$ spark-submit 
--packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" 
--master spark://YOUR_SPARK_MASTER:7077 
--py-files deps.zip 
train_als.py -u vinta

ref:
https://spark.apache.org/docs/latest/submitting-applications.html

Load Data

讀取來自 MySQL 資料庫的數據。你可以使用 predicates 參數來指定 WHERE 條件，雖然嚴格來說這個參數是用來控制 partition 數量的，一個條件就是一個 partition。

假設 app_repostarring 的欄位如下：

CREATE TABLE app_repostarring (
  id int(11) NOT NULL AUTO_INCREMENT,
  from_user_id int(11) NOT NULL,
  from_username varchar(39) NOT NULL,
  repo_owner_id int(11) NOT NULL,
  repo_owner_username varchar(39) NOT NULL,
  repo_owner_type varchar(16) NOT NULL,
  repo_id int(11) NOT NULL,
  repo_name varchar(100) NOT NULL,
  repo_full_name varchar(140) NOT NULL,
  repo_description varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  repo_language varchar(32) NOT NULL,
  repo_created_at datetime(6) NOT NULL,
  repo_updated_at datetime(6) NOT NULL,
  starred_at datetime(6) NOT NULL,
  stargazers_count int(11) NOT NULL,
  forks_count int(11) NOT NULL,
  PRIMARY KEY (id),
  UNIQUE KEY from_user_id_repo_id (full_name, repo_id)
);

def loadRawData():
    url = 'jdbc:mysql://127.0.0.1:3306/albedo?user=root&password=123&verifyServerCertificate=false&useSSL=false'
    properties = {'driver': 'com.mysql.jdbc.Driver'}
    rawDF = spark.read.jdbc(url, table='app_repostarring', properties=properties)
    return rawDF

rawDF = loadRawData()

ref:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=jdbc#pyspark.sql.DataFrameReader.jdbc
http://www.gatorsmile.io/numpartitionsinjdbc/

Preprocess Data

Format Data

把 raw data 整理成 user,item,rating,starred_at 這樣的格式。starred_at 只有評價 model 時有用來排序，訓練 model 時並沒有用到，因為 Spark 的 ALS 沒辦法輕易地整合 side information。

from pyspark.ml import Transformer

class RatingBuilder(Transformer):

    def _transform(self, rawDF):
        ratingDF = rawDF 
            .selectExpr('from_user_id AS user', 'repo_id AS item', '1 AS rating', 'starred_at') 
            .orderBy('user', F.col('starred_at').desc())
        return ratingDF

ratingBuilder = RatingBuilder()
ratingDF = ratingBuilder.transform(rawDF)
ratingDF.cache()

ref:
http://blog.ethanrosenthal.com/2016/11/07/implicit-mf-part-2/

Inspect Data

import pyspark.sql.functions as F

ratingDF.rdd.getNumPartitions()
# 200

ratingDF.agg(F.count('rating'), F.countDistinct('user'), F.countDistinct('item')).show()
# +-------------+--------------------+--------------------+
# |count(rating)|count(DISTINCT user)|count(DISTINCT item)|
# +-------------+--------------------+--------------------+
# |      3121629|               10483|              551216|
# +-------------+--------------------+--------------------+

stargazersCountDF = ratingDF 
    .groupBy('item') 
    .agg(F.count('user').alias('stargazers_count')) 
    .orderBy('stargazers_count', ascending=False)
stargazersCountDF.show(10)
# +--------+----------------+
# |    item|stargazers_count|
# +--------+----------------+
# | 2126244|            2211|
# |10270250|            1683|
# |  943149|            1605|
# |  291137|            1567|
# |13491895|            1526|
# | 9384267|            1480|
# | 3544424|            1468|
# | 7691631|            1441|
# |29028775|            1427|
# | 1334369|            1399|
# +--------+----------------+

starredCountDF = ratingDF 
    .groupBy('user') 
    .agg(F.count('item').alias('starred_count')) 
    .orderBy('starred_count', ascending=False)
starredCountDF.show(10)
# +-------+-------------+
# |   user|starred_count|
# +-------+-------------+
# |3947125|         8947|
# |5527642|         7978|
# | 446613|         7860|
# | 627410|         7800|
# |  13998|         6334|
# |2467194|         6327|
# |  63402|         6034|
# |2005841|         6024|
# |5073946|         5980|
# |   2296|         5862|
# +-------+-------------+

Clean Data

你可以過濾掉那些太少 user 打星的 item 和打星了太少 item 的 user，提昇矩陣的稠密度。這個現象也正好是 Cold Start 的問題，你就是沒有足夠多的關於這些 item 和 user 的數據（可以考慮使用 content-based 的推薦方式）。除此之外，如果你的推薦系統所推薦的 item 只有非常少人打星，即便你完美地挖掘了長尾效應，這樣的推薦結果給用戶的「第一印象」可能也不會太好（這可能決定了他要不要繼續使用這個系統或是他要不要真的去嘗試那個你推薦給他的東西）。

你也可以選擇要不要過濾掉那些超多人打星的 item 和打星了超多 item 的 user。如果某些 item 有超過八、九成的 user 都打星了，對於這麼熱門的 item，可能也沒有推薦的必要了，因為其他 user 早晚也會自己發現的；如果有少數的 user 幾乎打星了一半以上的 item，這些 user 可能是屬於某種 web crawler 的用途或是這些 user 就是那種看到什麼就打星什麼的人，無論是哪一種，他們可能都不是你想要 modeling 的對象，可以考慮從 dataset 中拿掉。

實務上，如果你有關於 user 或 item 的黑名單，例如一些 SPAM 帳號或 NSFW 的內容等，也可以在這個步驟把它們過濾掉。

from pyspark import keyword_only
from pyspark.ml import Transformer
from pyspark.ml.param.shared import Param
import pyspark.sql.functions as F

class DataCleaner(Transformer):

    @keyword_only
    def __init__(self, minItemStargazersCount=None, maxItemStargazersCount=None, minUserStarredCount=None, maxUserStarredCount=None):
        super(DataCleaner, self).__init__()
        self.minItemStargazersCount = Param(self, 'minItemStargazersCount', '移除 stargazer 數低於這個數字的 item')
        self.maxItemStargazersCount = Param(self, 'maxItemStargazersCount', '移除 stargazer 數超過這個數字的 item')
        self.minUserStarredCount = Param(self, 'minUserStarredCount', '移除 starred repo 數低於這個數字的 user')
        self.maxUserStarredCount = Param(self, 'maxUserStarredCount', '移除 starred repo 數超過這個數字的 user')
        self._setDefault(minItemStargazersCount=1, maxItemStargazersCount=50000, minUserStarredCount=1, maxUserStarredCount=50000)
        kwargs = self.__init__._input_kwargs
        self.setParams(**kwargs)

    @keyword_only
    def setParams(self, minItemStargazersCount=None, maxItemStargazersCount=None, minUserStarredCount=None, maxUserStarredCount=None):
        kwargs = self.setParams._input_kwargs
        return self._set(**kwargs)

    def setMinItemStargazersCount(self, value):
        self._paramMap[self.minItemStargazersCount] = value
        return self

    def getMinItemStargazersCount(self):
        return self.getOrDefault(self.minItemStargazersCount)

    def setMaxItemStargazersCount(self, value):
        self._paramMap[self.maxItemStargazersCount] = value
        return self

    def getMaxItemStargazersCount(self):
        return self.getOrDefault(self.maxItemStargazersCount)

    def setMinUserStarredCount(self, value):
        self._paramMap[self.minUserStarredCount] = value
        return self

    def getMinUserStarredCount(self):
        return self.getOrDefault(self.minUserStarredCount)

    def setMaxUserStarredCount(self, value):
        self._paramMap[self.maxUserStarredCount] = value
        return self

    def getMaxUserStarredCount(self):
        return self.getOrDefault(self.maxUserStarredCount)

    def _transform(self, ratingDF):
        minItemStargazersCount = self.getMinItemStargazersCount()
        maxItemStargazersCount = self.getMaxItemStargazersCount()
        minUserStarredCount = self.getMinUserStarredCount()
        maxUserStarredCount = self.getMaxUserStarredCount()

        toKeepItemsDF = ratingDF 
            .groupBy('item') 
            .agg(F.count('user').alias('stargazers_count')) 
            .where('stargazers_count >= {0} AND stargazers_count <= {1}'.format(minItemStargazersCount, maxItemStargazersCount)) 
            .orderBy('stargazers_count', ascending=False) 
            .select('item', 'stargazers_count')
        temp1DF = ratingDF.join(toKeepItemsDF, 'item', 'inner')

        toKeepUsersDF = temp1DF 
            .groupBy('user') 
            .agg(F.count('item').alias('starred_count')) 
            .where('starred_count >= {0} AND starred_count <= {1}'.format(minUserStarredCount, maxUserStarredCount)) 
            .orderBy('starred_count', ascending=False) 
            .select('user', 'starred_count')
        temp2DF = temp1DF.join(toKeepUsersDF, 'user', 'inner')

        cleanDF = temp2DF.select('user', 'item', 'rating', 'starred_at')
        return cleanDF

dataCleaner = DataCleaner(
    minItemStargazersCount=2,
    maxItemStargazersCount=4000,
    minUserStarredCount=2,
    maxUserStarredCount=5000
)
cleanDF = dataCleaner.transform(ratingDF)

cleanDF.agg(F.count('rating'), F.countDistinct('user'), F.countDistinct('item')).show()
# +-------------+--------------------+--------------------+
# |count(rating)|count(DISTINCT user)|count(DISTINCT item)|
# +-------------+--------------------+--------------------+
# |      2761118|               10472|              245626|
# +-------------+--------------------+--------------------+

Generate Negative Samples

對 implicit feedback 的 ALS 來說，手動加入負樣本（Rui = 0 的樣本）是沒有意義的，因為 missing value / non-observed value 對該演算法來說本來就是 0，表示用戶確實沒有對該物品做出行為，也就是 Pui = 0 沒有偏好，所以 Cui = 1 + alpha x 0 置信度也會比其他正樣本低。不過因為 Spark ML 的 ALS 只會計算 Rui > 0 的項目，所以即便你手動加入了 Rui = 0 或 Rui = -1 的負樣本，對整個模型其實沒有影響。

雖然沒有負樣本你就不能算 area under ROC curve 或是 area under Precision-Recall curve 等 binary classifier 用的指標，不過你可以改用 Learning to rank 的評估方式，例如 NDCG 或 Mean Average Precision 等。但是 ALS 的 loss function 也沒辦法直接優化 NDCG 這樣的指標就是了。

ref:
https://vinta.ws/code/generate-negative-samples-for-recommender-system.html

Split Data

因為 Matrix Factorization 需要考慮每個 user-item pair，如果你餵給 model 它沒見過的資料，它就沒辦法進行推薦（冷啟動問題）。只要 user 或 item 其中之一不存在於 dataset 裡，ALS model 所輸出的 prediction 值就會是 NaN。所以應該盡量保持每個 user 和 item 都出現在 training set 和 testing set 裡，例如隨機挑出每個 user 的任意 n 個或 n 比例的評分作為 test set，剩下的評分當作 training set（俗稱 leave-n-out）。如果使用 Machine Learning 中常見的 holdout 方式，隨機地把所有 data point 分散到 training set 和 test set（例如 df.randomSplit([0.7, 0.3])），會有很高的機率造成部分 user 或 item 只出現在其中一組 dataset 裡。

ref:
https://jessesw.com/Rec-System/
http://blog.ethanrosenthal.com/2016/10/19/implicit-mf-part-1/

從 LibRec 的文件上也可以發現還有許多拆分數據的方式，例如：

基于 Ratio 的分类方法为通过给定的比例来将数据分为两部分。这个分类过程可以在所有数据中进行随机分类，也可以在用户或者物品维度上进行分类。当有时间的特征时，可以根据时间顺序留出最后一定比例的数据来进行测试。
LooCV 的分割方法为 leave-one-user/item/rating-out，也就是随机选取每个 user 的任意一个 item 或者每个 item 的任意一个 user 作为测试数据，余下的数据来作为训练数据。在实现中实现了基于 User 和基于 Item 的多种分类方式。
GivenN 分割方法是指为每个用户留出指定数目 N 的数据来作为测试用例，余下的样本作为训练数据。
KCV 即 K 折交叉验证。将数据分割为 K 份，在每次执行时选择其中一份作为测试数据，余下的数据作为训练数据，共执行 K 次。综合 K 次的训练结果来对推荐算法的性能进行评估。

ref:
https://www.librec.net/dokuwiki/doku.php?id=DataModel_zh#splitter

這裡我們用 sampleBy() 簡單地寫了一個根據 user 來隨機劃分 item 到 training set 和 test set 的方法。

def randomSplitByUser(df, weights, seed=None):
    trainingRation = weights[0]
    fractions = {row['user']: trainingRation for row in df.select('user').distinct().collect()}
    training = df.sampleBy('user', fractions, seed)
    testRDD = df.rdd.subtract(training.rdd)
    test = spark.createDataFrame(testRDD, df.schema)
    return training, test

training, test = randomSplitByUser(ratingDF, weights=[0.7, 0.3])

ref:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sampleBy

Train the Model

from pyspark.ml.recommendation import ALS

als = ALS(implicitPrefs=True, seed=42) 
    .setRank(50) 
    .setMaxIter(22) 
    .setRegParam(0.5) 
    .setAlpha(40)

alsModel = als.fit(training)

# 這些就是訓練出來的 user 和 item 的 Latent Factors
alsModel.userFactors.show()
alsModel.itemFactors.show()

ref:
https://spark.apache.org/docs/latest/ml-collaborative-filtering.html
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.recommendation.ALS

Predict Preferences

from pyspark.ml import Transformer

predictedDF = alsModel.transform(testing)

class PredictionProcessor(Transformer):

    def _transform(self, predictedDF):
        nonNullDF = predictedDF.dropna(subset=['prediction', ])
        predictionDF = nonNullDF.withColumn('prediction', nonNullDF['prediction'].cast('double'))
        return predictionDF

# 刪掉那些 NaN 的數據
predictionProcessor = PredictionProcessor()
predictionDF = predictionProcessor.transform(predictedDF)

Evaluate the Model

因為 Spark ML 沒有提供給 DataFrame 用的 ranking evaluator，我們只好自己寫一個，但是內部還是使用 Spark MLlib 的 RankingMetrics。不過這個只是 offline 的評估方式而已，等到要實際上線的時候可能還需要做 A/B testing。

from pyspark import keyword_only
from pyspark.ml.evaluation import Evaluator
from pyspark.ml.param.shared import Param
from pyspark.mllib.evaluation import RankingMetrics
from pyspark.sql import Window
from pyspark.sql.functions import col
from pyspark.sql.functions import expr
import pyspark.sql.functions as F

class RankingEvaluator(Evaluator):

    @keyword_only
    def __init__(self, k=None):
        super(RankingEvaluator, self).__init__()
        self.k = Param(self, 'k', 'Top K')
        self._setDefault(k=30)
        kwargs = self.__init__._input_kwargs
        self.setParams(**kwargs)

    @keyword_only
    def setParams(self, k=None):
        kwargs = self.setParams._input_kwargs
        return self._set(**kwargs)

    def isLargerBetter(self):
        return True

    def setK(self, value):
        self._paramMap[self.k] = value
        return self

    def getK(self):
        return self.getOrDefault(self.k)

    def _evaluate(self, predictedDF):
        k = self.getK()

        predictedDF.show()

        windowSpec = Window.partitionBy('user').orderBy(col('prediction').desc())
        perUserPredictedItemsDF = predictedDF 
            .select('user', 'item', 'prediction', F.rank().over(windowSpec).alias('rank')) 
            .where('rank <= {0}'.format(k)) 
            .groupBy('user') 
            .agg(expr('collect_list(item) as items'))

        windowSpec = Window.partitionBy('user').orderBy(col('starred_at').desc())
        perUserActualItemsDF = predictedDF 
            .select('user', 'item', 'starred_at', F.rank().over(windowSpec).alias('rank')) 
            .where('rank <= {0}'.format(k)) 
            .groupBy('user') 
            .agg(expr('collect_list(item) as items'))

        perUserItemsRDD = perUserPredictedItemsDF.join(F.broadcast(perUserActualItemsDF), 'user', 'inner') 
            .rdd 
            .map(lambda row: (row[1], row[2]))

        if perUserItemsRDD.isEmpty():
            return 0.0

        rankingMetrics = RankingMetrics(perUserItemsRDD)
        metric = rankingMetrics.ndcgAt(k)
        return metric

k = 30
rankingEvaluator = RankingEvaluator(k=k)
ndcg = rankingEvaluator.evaluate(predictionDF)
print('NDCG', ndcg)

ref:
https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html

Recommend Items

實際感受一下推薦系統的效果如何。這裡是直接把結果 print 出來，而沒有把推薦結果儲存到資料庫。不過通常不會直接就把推薦系統輸出的東西展示給用戶，會先經過一些過濾、排序和產生推薦理由等等的步驟，或是加入一些人為的規則，比如說強制插入廣告、最近主打的商品或是過濾掉那些很多人點擊但是其實質量並不怎麼樣的東西。當然也有可能會把這個推薦系統的輸出作為其他機器學習 model 的輸入。

ref:
https://www.zhihu.com/question/28247353

def recommendItems(rawDF, alsModel, username, topN=30, excludeKnownItems=False):
    userID = rawDF 
        .where('from_username = "{0}"'.format(username)) 
        .select('from_user_id') 
        .take(1)[0]['from_user_id']

    userItemsDF = alsModel 
        .itemFactors. 
        selectExpr('{0} AS user'.format(userID), 'id AS item')
    if excludeKnownItems:
        userKnownItemsDF = rawDF 
            .where('from_user_id = {0}'.format(userID)) 
            .selectExpr('repo_id AS item')
        userItemsDF = userItemsDF.join(userKnownItemsDF, 'item', 'left_anti')

    userPredictedDF = alsModel 
        .transform(userItemsDF) 
        .select('item', 'prediction') 
        .orderBy('prediction', ascending=False) 
        .limit(topN)

    repoDF = rawDF 
        .groupBy('repo_id', 'repo_full_name', 'repo_language') 
        .agg(F.max('stargazers_count').alias('stargazers_count'))

    recommendedItemsDF = userPredictedDF 
        .join(repoDF, userPredictedDF['item'] == repoDF['repo_id'], 'inner') 
        .select('prediction', 'repo_full_name', 'repo_language', 'stargazers_count') 
        .orderBy('prediction', ascending=False)

    return recommendedItemsDF

k = 30
username = 'vinta'
recommendedItemsDF = recommendItems(rawDF, alsModel, username, topN=k, excludeKnownItems=False)
for item in recommendedItemsDF.collect():
    repoName = item['repo_full_name']
    repoUrl = 'https://github.com/{0}'.format(repoName)
    print(repoUrl, item['prediction'], item['repo_language'], item['stargazers_count'])

ref:
https://github.com/vinta/albedo/blob/master/src/main/python/train_als.ipynb

Cross-validate Models

使用 Spark ML 的 pipeline 來做 cross-validation，選出最適合的 hyperparameters 組合。

rank: The number of latent factors in the model, or equivalently, the number of columns k in the user-feature and product-feature matrices.
regParam: A standard overfitting parameter, also usually called lambda. Higher values resist overfitting, but values that are too high hurt the factorization’s accuracy.
alpha: Controls the relative weight of observed versus unobserved user-product interactions in the factorization.
maxIter: The number of iterations that the factorization runs. More iterations take more time but may produce a better factorization.

dataCleaner = DataCleaner()

als = ALS(implicitPrefs=True, seed=42)

predictionProcessor = PredictionProcessor()

pipeline = Pipeline(stages=[
    dataCleaner,
    als,
    predictionProcessor,
])

paramGrid = ParamGridBuilder() 
    .addGrid(dataCleaner.minItemStargazersCount, [1, 10, 100]) 
    .addGrid(dataCleaner.maxItemStargazersCount, [4000, ]) 
    .addGrid(dataCleaner.minUserStarredCount, [1, 10, 100]) 
    .addGrid(dataCleaner.maxUserStarredCount, [1000, 4000, ]) 
    .addGrid(als.rank, [50, 100]) 
    .addGrid(als.regParam, [0.01, 0.1, 0.5]) 
    .addGrid(als.alpha, [0.01, 0.89, 1, 40, ]) 
    .addGrid(als.maxIter, [22, ]) 
    .build()

rankingEvaluator = RankingEvaluator(k=30)

cv = CrossValidator(estimator=pipeline,
                    estimatorParamMaps=paramGrid,
                    evaluator=rankingEvaluator,
                    numFolds=2)

cvModel = cv.fit(ratingDF)

def printCrossValidationParameters(cvModel):
    metric_params_pairs = list(zip(cvModel.avgMetrics, cvModel.getEstimatorParamMaps()))
    metric_params_pairs.sort(key=lambda x: x[0], reverse=True)
    for pair in metric_params_pairs:
        metric, params = pair
        print('metric', metric)
        for k, v in params.items():
            print(k.name, v)
        print('')

printCrossValidationParameters(cvModel)

ref:
https://spark.apache.org/docs/latest/ml-pipeline.html

Generate negative samples for recommender system?

2017-05-172026-02-18VintaAI

根據「推荐系统实践」，挑選負樣本時應該遵循以下原則：

对每个用户，要保证正负样本的平衡（数目相似）。
对每个用户采样负样本时，要选取那些很热门，而用户却没有行为的物品。
一般认为，很热门而用户却没有行为更加代表用户对这个物品不感兴趣。因为对于冷门的物品，用户可能是压根没在网站中发现这个物品，所以谈不上是否感兴趣。

ref:
http://www.duokan.com/reader/www/app.html?id=ed873c9e323511e28a9300163e0123ac

不過如果你是用 Spark ML 的 ALS(implicitPrefs=True) 的話，並不需要手動加入負樣本。對 implicit feedback 的 ALS 來說，手動加入負樣本（Rui = 0 的樣本）是沒有意義的，因為 missing value / non-observed value 對該演算法來說本來就是 0，表示用戶確實沒有對該物品做出行為，也就是 Pui = 0 沒有偏好，所以 Cui = 1 + alpha x 0 置信度也會比其他正樣本低。不過因為 Spark ML 的 ALS 只會計算 Rui > 0 的項目，所以即便你手動加入了 Rui = 0 或 Rui = -1 的負樣本，對整個模型其實沒有影響。

用以下這三組 dataset 訓練出來的模型都是一樣的：

from pyspark.ml.recommendation import ALS

matrix = [
    (1, 1, 0),
    (1, 2, 1),
    (1, 3, 0),
    (1, 4, 1),
    (1, 5, 1),
    (2, 1, 1),
    (2, 2, 1),
    (2, 3, 0),
    (2, 4, 1),
    (2, 5, 1),
    (3, 1, 1),
    (3, 2, 1),
    (3, 3, 1),
    (3, 4, 1),
    (3, 5, 0),
]
df0 = spark.createDataFrame(matrix, ['user', 'item', 'rating'])

matrix = [
    (1, 1, -1),
    (1, 2, 1),
    (1, 3, -1),
    (1, 4, 1),
    (1, 5, 1),
    (2, 1, 1),
    (2, 2, 1),
    (2, 3, -1),
    (2, 4, 1),
    (2, 5, 1),
    (3, 1, 1),
    (3, 2, 1),
    (3, 3, 1),
    (3, 4, 1),
    (3, 5, -1),
]
df1 = spark.createDataFrame(matrix, ['user', 'item', 'rating'])

matrix = [
    (1, 2, 1),
    (1, 4, 1),
    (1, 5, 1),
    (2, 1, 1),
    (2, 2, 1),
    (2, 4, 1),
    (2, 5, 1),
    (3, 1, 1),
    (3, 2, 1),
    (3, 3, 1),
    (3, 4, 1),
]
df2 = spark.createDataFrame(matrix, ['user', 'item', 'rating'])

als = ALS(implicitPrefs=True, seed=42, nonnegative=False).setRank(7).setMaxIter(15).setRegParam(0.01).setAlpha(40)
alsModel = als.fit(df0)
alsModel.userFactors.select('features').show(truncate=False)
alsModel.itemFactors.select('features').show(truncate=False)

als = ALS(implicitPrefs=True, seed=42, nonnegative=False).setRank(7).setMaxIter(15).setRegParam(0.01).setAlpha(40)
alsModel = als.fit(df1)
alsModel.userFactors.select('features').show(truncate=False)
alsModel.itemFactors.select('features').show(truncate=False)

als = ALS(implicitPrefs=True, seed=42, nonnegative=False).setRank(7).setMaxIter(15).setRegParam(0.01).setAlpha(40)
alsModel = als.fit(df2)
alsModel.userFactors.select('features').show(truncate=False)
alsModel.itemFactors.select('features').show(truncate=False)

ref:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L1626
https://github.com/apache/spark/commit/b05b3fd4bacff1d8b1edf4c710e7965abd2017a7
https://www.mail-archive.com/[email protected]/msg60240.html
http://apache-spark-user-list.1001560.n3.nabble.com/implicit-ALS-dataSet-td7067.html