Setup Scalable WordPress Sites on Kubernetes

Setup Scalable WordPress Sites on Kubernetes

This article is about how to deploy a scalable WordPress site on Google Kubernetes Engine.

Using the container version of the popular LEMP stack:

  • Linux (Docker containers)
  • NGINX
  • MySQL (Google Cloud SQL)
  • PHP (PHP-FPM)

Google Cloud Platform Pricing

Deploying a personal blog on Kubernetes sounds like overkill (I must admit, it does). Still, it is fun and an excellent practice to containerize a traditional application, WordPress, which is harder than you thought. More importantly, the financial cost of running a Kubernetes cluster on GKE could be pretty low if you use preemptible VMs which also means native Chaos Engineering!

ref:
https://cloud.google.com/pricing/list
https://cloud.google.com/sql/pricing
https://cloud.google.com/compute/all-pricing

Google Cloud SQL

Cloud SQL is the fully managed relational database service on Google Cloud, though it currently only supports MySQL 5.6 and 5.7.

You can simply create a MySQL instance with few clicks on Google Cloud Platform Console or CLI. It is recommended to enable Private IP that allows VPC networking and never exposed to the public Internet. Nevertheless, you have to turn on Public IP if you would like to connect to it from your local machine. Otherwise, you might see something like couldn't connect to "xxx": dial tcp 10.x.x.x:3307: connect: network is unreachable. Remember to set IP whitelists for Public IP.

Connect to a Cloud SQL instance from your local machine:

$ gcloud components install cloud_sql_proxy
$ cloud_sql_proxy -instances=YOUR_INSTANCE_CONNECTION_NAME=tcp:0.0.0.0:3306

$ mysql --host 127.0.0.1 --port 3306 -u root -p

ref:
https://cloud.google.com/sql/docs/mysql
https://cloud.google.com/sql/docs/mysql/sql-proxy

Google Kubernetes Engine

The master of your Google Kubernetes Engine cluster is managed by GKE itself, as a result, you only need to provision and pay for worker nodes. No cluster management fees.

You can create a Kubernetes cluster on Google Cloud Platform Console or CLI, and there are some useful settings you might like to turn on:

Node Pools

Over-provisioning is human nature, so don't spend too much time on choosing the right machine type for your Kubernetes cluster at the beginning since you are very likely to overprovision without real usage data at hand. Instead, after deploying your workloads, you can find out the actual resource usage from Stackdriver Monitoring or GKE usage metering, then adjust your node pools.

Some useful node pool configurations:

  • Enable preemptible nodes
  • Access scopes > Set access for each API:
    • Enable Cloud SQL

After the cluster is created, you can now configure your kubectl:

$ gcloud container clusters get-credentials YOUR_CLUSTER_NAME --zone YOUR_SELECTED_ZONE --project YOUR_PROJECT_ID
$ kubectl get nodes

If you are not familiar with Kubernetes, check out The Incomplete Guide to Google Kubernetes Engine.

WordPress

Here comes the tricky part, containerizing a WordPress site is not as simple as pulling a Docker image and set replicas: 10 since WordPress is a totally stateful application. Especially:

  • MySQL Database
  • The wp-content folder

The dependency on MySQL is relatively easy to solve since it is an external service. Your MySQL database could be managed, self-hosted, single machine, master-slave, or multi-master. However, horizontally scaling a database would be another story, so we only focus on WordPress now.

The next one, our notorious wp-content folder which includes plugins, themes, and uploads.

ref:
https://engineering.bitnami.com/articles/scaling-wordpress-in-kubernetes.html
https://dev.to/mfahlandt/scaling-properly-a-stateful-app-like-wordpress-with-kubernetes-engine-and-cloud-sql-in-google-cloud-27jh
https://thecode.co/blog/moving-wordpress-to-multiserver/

User-uploaded Media

Users (site owners, editors, or any logged-in users) can upload images or even videos on a WordPress site if you allow them to do so. For those uploaded contents, it is best to copy them to Amazon S3 or Google Cloud Storage automatically after a user uploads a file. Also, don't forget to configure a CDN to point at your bucket. Luckily, there are already plugins for such tasks:

Both storage services support direct uploads: the uploading file goes to S3 or GCS directly without touching your servers, but you might need to write some code to achieve that.

Pre-installed Plugins and Themes

You would usually deploy multiple WordPress Pods in Kubernetes, and each pod has its own resources: CPU, memory, and storage. Anything writes to the local volume is ephemeral that only exists within the Pod's lifecycle. When you install a new plugin through WordPress admin dashboard, the plugin would be only installed on the local disk of one of Pods, the one serves your request at the time. Therefore, your subsequent requests inevitably go to any of the other Pods because of the nature of Service load balancing, and they do not have those plugin files, even the plugin is marked as activated in the database, which causes an inconsistent issue.

There are two solutions for plugins and themes:

  1. A shared writable network filesystem mounted by each Pod
  2. An immutable Docker image which pre-installs every needed plugin and theme

For the first solution, you can either setup an NFS server, a Ceph cluster, or any of network-attached filesystems. An NFS server might be the simplest way, although it could also easily be a single point of failure in your architecture. Fortunately, managed network filesystem services are available in major cloud providers, like Amazon EFS and Google Cloud Filestore. In fact, Kubernetes is able to provide ReadWriteMany access mode for PersistentVolume (the volume can be mounted as read-write by many nodes). Still, only a few types of Volume support it, which don't include gcePersistentDisk and awsElasticBlockStore.

However, I personally adopt the second solution, creating Docker images contain pre-installed plugins and themes through CI since it is more immutable and no network latency issue as in NFS. Besides, I don't frequently install new plugins. It is regretful that some plugins might still write data to the local disk directly, and most of the time we can not prevent it.

ref:
https://serverfault.com/questions/905795/dynamically-added-wordpress-plugins-on-kubernetes

Dockerfile

Here is a dead-simple script to download pre-defined plugins and themes, and you can use it in Dockerfile later:

#!/bin/bash
set -ex

mkdir -p plugins
for download_url in $(cat plugins.txt)
do
    curl -Ls $download_url -o plugin.zip
    unzip -oq plugin.zip -d plugins/
    rm -f plugin.zip
done

mkdir -p themes
for download_url in $(cat themes.txt)
do
    curl -Ls $download_url -o theme.zip
    unzip -oq theme.zip -d themes/
    rm -f theme.zip
done

plugins.txt and themes.txt look like this:

https://downloads.wordpress.org/plugin/prismatic.2.2.zip
https://downloads.wordpress.org/plugin/wp-githuber-md.1.11.8.zip
https://downloads.wordpress.org/plugin/wp-stateless.2.2.7.zip

Then you need to create a custom Dockerfile based on the official wordpress Docker image along with your customizations.

FROM wordpress:5.2.4-fpm as builder

WORKDIR /usr/src/wp-cli/
RUN curl -Os https://raw.githubusercontent.com/wp-cli/builds/gh-pages/phar/wp-cli.phar && \
    chmod +x wp-cli.phar && \
    mv wp-cli.phar wp

RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    unzip && \
    apt-get purge -y --auto-remove -o APT::AutoRemove::RecommendsImportant=false && \
    rm -rf /var/lib/apt/lists/*

WORKDIR /usr/src/app/
COPY wordpress/ /usr/src/app/
RUN chmod +x install.sh && \
    sh install.sh && \
    rm -rf \
    install.sh \
    plugins.txt \
    themes.txt

###

FROM wordpress:5.2.4-fpm

RUN mv "$PHP_INI_DIR/php.ini-production" "$PHP_INI_DIR/php.ini"
COPY php/custom.ini /usr/local/etc/php/conf.d/
COPY php-fpm/zz-docker.conf /usr/local/etc/php-fpm.d/

COPY --from=builder /usr/src/wp-cli/wp /usr/local/bin/
COPY --from=builder /usr/src/app/ /usr/src/wordpress/wp-content/
RUN cd /usr/src/wordpress/wp-content/ && \
    rm -rf \
    plugins/akismet/ \
    plugins/hello.php \
    themes/twentysixteen/ \
    themes/twentyseventeen/

# HACK: `101` is the user id of `nginx` user in `nginx:x.x.x-alpine` Docker image
# https://stackoverflow.com/questions/36824222/how-to-change-the-nginx-process-user-of-the-official-docker-image-nginx
RUN usermod -u 101 www-data && \
    groupmod -g 101 www-data

ENTRYPOINT ["docker-entrypoint.sh"]
CMD ["php-fpm"]

The multiple FROM statements are for multi-stage builds.

See more details on the GitHub repository:
https://github.com/vinta/vinta.ws/tree/master/docker/code-blog

Google Cloud Build

Next, a small cloudbuild.yaml file to build Docker images in Google Cloud Build triggered by GitHub commits automatically.

substitutions:
  _BLOG_IMAGE_NAME: my-blog
steps:
- id: my-blog-cache-image
  name: gcr.io/cloud-builders/docker
  entrypoint: "/bin/bash"
  args:
   - "-c"
   - |
     docker pull asia.gcr.io/$PROJECT_ID/$_BLOG_IMAGE_NAME:$BRANCH_NAME || exit 0
  waitFor: ["-"]
- id: my-blog-build-image
  name: gcr.io/cloud-builders/docker
  args: [
    "build",
    "--cache-from", "asia.gcr.io/$PROJECT_ID/$_BLOG_IMAGE_NAME:$BRANCH_NAME",
    "-t", "asia.gcr.io/$PROJECT_ID/$_BLOG_IMAGE_NAME:$BRANCH_NAME",
    "-t", "asia.gcr.io/$PROJECT_ID/$_BLOG_IMAGE_NAME:$SHORT_SHA",
    "docker/my-blog/",
  ]
  waitFor: ["my-blog-cache-image"]
images:
- asia.gcr.io/$PROJECT_ID/$_BLOG_IMAGE_NAME:$SHORT_SHA

Just put it into the root directory of your GitHub repository. Don't forget to store Docker images near your server's location, in my case, asia.gcr.io.

Moreover, it is recommended by the official documentation to use --cache-from for speeding up Docker builds.

ref:
https://cloud.google.com/container-registry/docs/pushing-and-pulling#tag_the_local_image_with_the_registry_name
https://cloud.google.com/cloud-build/docs/speeding-up-builds

Deployments

Finally, here comes Kubernetes manifests. The era of YAML developers.

WordPress, PHP-FPM, and NGINX

You can configure the WordPress site as Deployment with an NGINX sidecar container which proxies to PHP-FPM via UNIX socket.

ConfigMaps for both WordPress and NGINX:

apiVersion: v1
kind: ConfigMap
metadata:
  name: my-blog-wp-config
data:
  wp-config.php: |
    <?php
    define('DB_NAME', 'xxx');
    define('DB_USER', 'xxx');
    define('DB_PASSWORD', 'xxx');
    define('DB_HOST', 'xxx');
    define('DB_CHARSET', 'utf8mb4');
    define('DB_COLLATE', '');

    define('AUTH_KEY',         'xxx');
    define('SECURE_AUTH_KEY',  'xxx');
    define('LOGGED_IN_KEY',    'xxx');
    define('NONCE_KEY',        'xxx');
    define('AUTH_SALT',        'xxx');
    define('SECURE_AUTH_SALT', 'xxx');
    define('LOGGED_IN_SALT',   'xxx');
    define('NONCE_SALT',       'xxx');

    $table_prefix = 'wp_';

    define('WP_DEBUG', false);

    if (isset($_SERVER['HTTP_X_FORWARDED_PROTO']) && $_SERVER['HTTP_X_FORWARDED_PROTO'] === 'https') {
      $_SERVER['HTTPS'] = 'on';
    }

    // WORDPRESS_CONFIG_EXTRA
    define('AUTOSAVE_INTERVAL', 86400);
    define('WP_POST_REVISIONS', false);

    if (!defined('ABSPATH')) {
      define('ABSPATH', dirname( __FILE__ ) . '/');
    }

    require_once(ABSPATH . 'wp-settings.php');
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: my-blog-nginx-site
data:
  default.conf: |
    server {
      listen 80;
      root /var/www/html;
      index index.php;

      if ($http_user_agent ~* (GoogleHC)) { # https://cloud.google.com/kubernetes-engine/docs/concepts/ingress#health_checks
        return 200;
      }

      location /blog/ { # WordPress is installed in a subfolder
        try_files $uri $uri/ /blog/index.php?q=$uri&$args;
      }

      location ~ [^/]\.php(/|$) {
        try_files $uri =404;
        fastcgi_split_path_info ^(.+?\.php)(/.*)$;
        include fastcgi_params;
        fastcgi_param HTTP_PROXY "";
        fastcgi_pass unix:/var/run/php-fpm.sock;
        fastcgi_index index.php;
        fastcgi_buffers 8 16k;
        fastcgi_buffer_size 32k;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        fastcgi_param PATH_INFO $fastcgi_path_info;
      }
    }

The wordpress image supports setting configurations through environment variables, though I prefer to store the whole wp-config.php in ConfigMap, which is more convenient. It is also worth noting that you need to use the same set of WordPress secret keys (AUTH_KEY, LOGGED_IN_KEY, etc.) for all of your WordPress replicas. Otherwise, you might encounter login failures due to mismatched login cookies.

Of course, you can use a base64 encoded (NOT ENCRYPTED!) Secret to store sensitive data.

ref:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/
https://kubernetes.io/docs/concepts/configuration/secret/

Service:

apiVersion: v1
kind: Service
metadata:
  name: my-blog
spec:
  selector:
    app: my-blog
  type: NodePort
  ports:
  - name: http
    port: 80
    targetPort: http

ref:
https://kubernetes.io/docs/concepts/services-networking/service/

Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-blog
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-blog
  template:
    metadata:
      labels:
        app: my-blog
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100 # prevent the scheduler from locating two pods on the same node
            podAffinityTerm:
              topologyKey: kubernetes.io/hostname
              labelSelector:
                matchExpressions:
                  - key: "app"
                    operator: In
                    values:
                    - my-blog
      volumes:
      - name: php-fpm-unix-socket
        emptyDir:
          medium: Memory
      - name: wordpress-root
        emptyDir:
          medium: Memory
      - name: my-blog-wp-config
        configMap:
          name: my-blog-wp-config
      - name: my-blog-nginx-site
        configMap:
          name: my-blog-nginx-site
      containers:
      - name: wordpress
        image: asia.gcr.io/YOUR_PROJECT_ID/YOUR_IMAGE_NAME:YOUR_IMAGE_TAG
        workingDir: /var/www/html/blog # HACK: specify the WordPress installation path: subfolder
        volumeMounts:
        - name: php-fpm-unix-socket
          mountPath: /var/run
        - name: wordpress-root
          mountPath: /var/www/html/blog
        - name: my-blog-wp-config
          mountPath: /var/www/html/blog/wp-config.php
          subPath: wp-config.php
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
      - name: nginx
        image: nginx:1.17.5-alpine
        volumeMounts:
        - name: php-fpm-unix-socket
          mountPath: /var/run
        - name: wordpress-root
          mountPath: /var/www/html/blog
          readOnly: true
        - name: my-blog-nginx-site
          mountPath: /etc/nginx/conf.d/
          readOnly: true
        ports:
        - name: http
          containerPort: 80
        resources:
          requests:
            cpu: 50m
            memory: 100Mi
          limits:
            cpu: 100m
            memory: 100Mi

Setting podAntiAffinity is important for running apps on Preemptible nodes.

Pro tip: you can set the emptyDir.medium: Memory to mount a tmpfs (RAM-backed filesystem) for Volumes.

ref:
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

CronJob

WP-Cron is the way WordPress handles scheduling time-based tasks. The problem is how WP-Cron works: on every page load, a list of scheduled tasks is checked to see what needs to be run. Therefore, you might consider replacing WP-Cron with a regular Kubernetes CronJob.

// in wp-config.php
define('DISABLE_WP_CRON', true);
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: my-blog-wp-cron
spec:
  schedule: "0 * * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          volumes:
          - name: my-blog-wp-config
            configMap:
              name: my-blog-wp-config
          containers:
          - name: wp-cron
            image: asia.gcr.io/YOUR_PROJECT_ID/YOUR_IMAGE_NAME:YOUR_IMAGE_TAG
            command: ["/usr/local/bin/php"]
            args:
            - /usr/src/wordpress/wp-cron.php
            volumeMounts:
            - name: my-blog-wp-config
              mountPath: /usr/src/wordpress/wp-config.php
              subPath: wp-config.php
              readOnly: true
          restartPolicy: OnFailure

ref:
https://developer.wordpress.org/plugins/cron/

Ingress

Lastly, you would need external access to Services in your Kubernetes cluster:

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: load-balancer
  annotations:
    kubernetes.io/ingress.class: "gce" # https://github.com/kubernetes/ingress-gce
spec:
  rules:
  - host: example.com
    http:
      paths:
      - path: /blog/*
        backend:
          serviceName: my-blog
          servicePort: http
      - backend:
          serviceName: frontend
          servicePort: http

There is a default NGINX Deployment to serve requests other than WordPress.

See more details on the GitHub repository:
https://github.com/vinta/vinta.ws/tree/master/kubernetes

ref:
https://kubernetes.io/docs/concepts/services-networking/ingress/
https://cloud.google.com/kubernetes-engine/docs/concepts/ingress

SSL Certificates

HTTPS is absolutely required nowadays. There are some solutions to automatically provision and manage TLS certificates for you:

Conclusions

If a picture is worth a thousand words, then a video is worth a million. This video accurately describes how we ultimately deploy a WordPress site on Kubernetes.

碼天狗週刊 第 140 期 @vinta - MongoDB, Kubernetes, NGINX, Google Cloud Platform, MySQL

碼天狗週刊 第 140 期 @vinta - MongoDB, Kubernetes, NGINX, Google Cloud Platform, MySQL

本文同步發表於 CodeTengu Weekly - Issue 140

MongoDB cookbook: Queries and Aggregations

Issue 130 有提到,MongoDB 的 Aggregation 其實很強大,尤其搭配 $elemMatch$project$let$unwind$facet 等功能,可以直接完成很多複雜的業務邏輯,不需要多寫一行 code,雖然哪些事應該讓 DB 做、哪些事得在 API server 做,這就見仁見智啦。

不過 MongoDB Aggregation 寫起來的阿雜程度實在也跟 Elasticsearch 的 Query DSL 不遑多讓了(Thanks JSON),因為老是記不起來各種 operators 的用法和限制,所以就遵循之前提過的 Cookbook 模式,幫自己寫了一份筆記,複習、速查、分享各相宜。

Kubernetes Best Practices with Sandeep Dinesh (Google)

這個影片是 Google 的工程師在講使用 Kubernetes 和 containers 時的最佳實踐,影片的後半段則是 Weaveworks 的人在講他們搭建自己的 Kubernetes cluster 時遇到的各種挑戰和解法。

雖然前半段的內容有不少在 Kubernetes 和 GKE 的官方文件裡都有提到,但是有人貼心地幫你整理好還是挺棒的(就像你訂閱的這個 weekly 一樣),畢竟 Kubernetes 的文件真心多到靠北,看完都已經是 YAML 的形狀了。不過我對於越來越多人都推薦 Helm 這點還是不太能領略,總覺得 Helm 對一般使用者的意義好像不大啊(又不是 PaaS),我還不如直接幹一份 Chart 回來自己維護,之後要升級或客製化也比較方便,畢竟也就是一堆 YAML 檔。比較可行的用途似乎是團隊共用一套 Chart 來部署 production、staging 或 dev 環境?

延伸閱讀:

Tuning NGINX behind Google Cloud Platform HTTP(S) Load Balancer

因為 Google Cloud HTTP Load Balancing 的某些特性,如果你在 Google Kubernetes Engine 裡面跑 NGINX(或 OpenResty)的話,會有一些額外的 config 需要設定,尤其是 keepalive_timeout 620s;

題外話,Google Cloud 的 Load Balancer 也是很強啊,除了支援 QUIC 之外,更是默認啟用 TCP BBR

延伸閱讀:

别废话,各种 SQL 到底加了什么锁?

這個系列的文章專門在講 MySQL InnoDB 在各種情況下會使用的各種 lock,作者寫得非常淺顯易懂,最喜歡讀這種技術文章了~

延伸閱讀:

TeePublic

上禮拜發現的一個專門賣 T-shirt 的網站,重點是上面賣的 T-shirt 都!超!宅!它甚至有一個叫做 Programmer 的分類,或是你也可以隨便拿幾個你喜歡的電影、遊戲或動漫畫作品的名字去搜尋看看,保證有驚喜。我看到的第一天就買了八件。推薦各位臭宅去感受一下。

@vinta 分享!

碼天狗週刊 第 125 期 @vinta - Amazon Web Services, Google Cloud Platform, Kubernetes, DevOps, MySQL, Redis

碼天狗週刊 第 125 期 @vinta - Amazon Web Services, Google Cloud Platform, Kubernetes, DevOps, MySQL, Redis

本文同步發表於 CodeTengu Weekly - Issue 125

Apex and Terraform: The easiest way to manage AWS Lambda functions

因為一直都有訂閱 RSS 的習慣,但是常常工作一忙就積了一堆文章忘記看,可是又發現自己就算上班事情很多還是會三不五時刷一下 Twitter 順便抱怨幾句,所以就乾脆建了一個 @vinta_rss_bot,透過 Zapier 同步 Feedly 裡的文章到 Twitter,讓自己在刷推的時候很容易不小心就看到。實測了一個多禮拜,效果不錯,大家可以自己建一個 RSS bot 試試。

雖然這個 RSS bot 用了 Zapier 才花五分鐘就搞定了,連一行 code 都不用寫,但是因為不是每個人都是「空格之神」的信徒,一看到 @vinta_rss_bot 推了幾則沒有在標題的中英文之間加上空格的文章之後,開始覺得渾身不舒服。最後實在受不了,就用 AWS Lambda 寫了一個加空格的 web API - api.pangu.space,讓 Zapier 在輸出到 Twitter 之前先打一次。

(前情提要有點太長)

這篇文章就是紀錄我當初用 ApexTerraform 部署 AWS Lambda functions 的過程,主要的邏輯很簡單,是用 Go 寫的,比較麻煩的反而是在配置 Amazon API Gateway 和 custom domain 的 HTTPS 之類的。因為只是個 side project,所以就沒用太重量級的 Serverless 了。

延伸閱讀:

cert-manager: Automatically provision TLS certificates in Kubernetes

目前公司的 Kubernetes cluster 是用 kube-lego 自動從 Let's Encrypt 取得 TLS/SSL 憑證,但是因為 kube-lego 之前宣佈只支援到 Kubernetes v1.8 為止,所以希望大家改用另外一套由同一群人開發的在做同一件事的工具:cert-manager。

這篇文章就是紀錄我當初部署 cert-manager 的過程,準備之後從 kube-lego 遷移過去。不過因為當時測試的時候發現 cert-manager 有些功能還不是很完善,例如 ingress-shim,再加上我們在 Kubernetes v1.9.6 用 kube-lego 其實也沒遇到什麼問題,所以後來的結論是暫時先不遷移。不過文章寫都寫了,還是跟大家分享一下,希望對其他人有幫助。

延伸閱讀:

GCP products described in 4 words or less

之前都是用 AWS 比較多,但是現在公司是用 Google Cloud Platform,這篇文章可以讓你快速了解 GCP 上面有哪些東西可以用。

忍不住抱怨一下,Google Cloud Memorystore 到底什麼時候才要上線呢?

雖然 GCP 在各方面都還是差了 AWS 一截(Google Kubernetes Engine 除外),但是 Google Cloud 的 Stackdriver 系列真心好用,例如 Logging 可以直接全文搜尋所有 containers 的 stdout,什麼配置都不用(轉頭望向 ELK)。說到看 logs,kubetail 也是不錯,就是強化版的 kubectl logs -f;另外還有 Debugger 可以直接在 production code 上跑 debugger,實在炫炮。

延伸閱讀:

One Giant Leap For SQL: MySQL 8.0 Released

MySQL 8.0 前陣子發佈了,這個版本對 SQL 標準的支援有了長足的進步,終於從 SQL-92 的魔障中走出來了。有望擺脫 Friends don't let friends use MySQL 的罵名(目前看來會繼承這個污名的應該是 MongoDB)。

是說因為以前一直都在用 MySQL,根本不知道 Window functions 是什麼,第一次用 OVER (PARTITION BY ... ORDER BY ...) 反而是在 Apache Spark 裡啊(SQL 俗)。

延伸閱讀:

Redis in Action

上禮拜花了一點時間研究 Redis 的 RDB/AOF persistence 和 Master/Slave replication 的原理,發現除了官方文件之外,Redis in Action 這本書寫得也非常詳細(雖然有些內容可能有點舊了),但是畢竟是經過 Redis 作者本人背書的,值得一讀。

忍不住分享一下,我上禮拜仔細看了 Redis 4.0 的 redis.conf 之後,才發現現在多了一個 aof-use-rdb-preamble 設定,實測啟用之後可以讓 appendonly.aof 的檔案大小減少 50%,大家有空可以試試。

延伸閱讀:

金丝雀发布、滚动发布、蓝绿发布到底有什么差别?关键点是什么?

看了這篇文章我才終於知道 Canary Releases, Blue-green Deployment, Rolling Update 是什麼意思(汗顏)。

HTTP codes as Valentine’s Day comics

這篇文章用漫畫的方式介紹了各種 HTTP status code,有點太可愛了。

@vinta 分享。

Monty Python's Flying Circus on Netflix

各位觀眾,Netflix 上有 Monty Python's Flying Circus 了!不知道 Monty Python 是誰的,我們在 Issue 6 有介紹過!

@vinta 分享!

碼天狗週刊 第 104 期 @vinta - Recommender System, Apache Spark, Machine Learning, MySQL

碼天狗週刊 第 104 期 @vinta - Recommender System, Apache Spark, Machine Learning, MySQL

本文同步發表於 CodeTengu Weekly - Issue 104

Build a recommender system with Spark: Logistic Regression

前陣子寫了幾篇文章專門在講用 Apache Spark 搭建一個 GitHub repo 的推薦系統,打算寫成一個系列,不過因為身體不適中斷了好一陣子,所以寫著寫著 GitHub 都推出自己的推薦系統了(攤手)。言歸正傳,這篇文章主要是在講用 Logistic Regression 來對推薦結果排序,重點放在特徵工程和 Machine Learning Pipeline,對 LR 演算法本身沒有太多著墨,畢竟它就是個線性模型嘛。

延伸閱讀:

接下來是個沈重的題外話,由衷地建議大家真的要好好注意自己用電腦、用手機的姿勢,因為我最近才被診斷出頸椎椎間盤突出壓迫到神經,這個發作起來真的不是在開玩笑的,手腳又麻又癢又痛,你根本沒辦法專心做任何事,連好好睡一覺都不行。我最近吃了一堆止痛藥和肌肉鬆弛劑,都他媽快變成燕南天了。大家就想像一下有個跳蛋在你的骨頭或是神經的深處,三不五時就震一下,喔,那可一點都不好玩。

Build a recommender system with Spark: Content-based and Elasticsearch

這一篇也是 GitHub 推薦系統系列的文章之一,講的是大家喜聞樂見的 Content-based Recommendation。我原本是打算把 repo 的文本資料轉成 Word2Vec 向量,然後計算各個 repo 之間的相似度(所謂的 Similarity Join),但是要計算這麼多 repo 之間的相似度實在太花時間了,就算用了 Locality Sensitive Hashing 也還是太久又容易 OOM。後來一想,尋找相似或相關物品這件事不就是搜尋引擎在做的嗎,所以後來就直接改用 Elasticsearch 了。用 document id 當作搜尋條件,一個 More Like This query 就解決了,清爽利落。畢竟不需要所有的事情都在 Spark 裡解決。

基于 Spark UI 性能优化与调试 —— 初级篇

在寫程式的時候偶爾會發生「我明明只是加了一行簡單的 code,為什麼整個程式的效能就掉了這麼多」的情形,只因為我們對那行 code 實際上到底做了什麼其實並不清楚。更慘的是你的程式還是跑在一個分散式系統上時。幸好 Spark 提供了一個非常棒的工具:Spark UI。透過 Event Timeline 和 DAG Visualization,你可以看到非常詳細的整個 Spark application 的執行過程,例如某個 job 的某個 stage 的某個 task 做了什麼、花了多少時間和在哪一台機器上執行,甚至能夠精確地定位到是在你的 code 的某個檔案的某一行的某個 function call。真希望所有語言和框架都有這樣方便的工具啊。

不過老實說 DAG Visualization 第一眼看起來真的是讓人眼花撩亂,尤其是當你操作的是 Spark SQL 和 DataFrame,但是 Spark UI 顯示的其實卻是低層的 RDD operations,是需要花一點時間熟悉的。雖然你可能得先對 Executor、RDD、Partition 或 Shuffe 這些東西有點概念。

延伸閱讀:

Google - Machine Learning Glossary

這是 Google 製作的一份機器學習的常見詞彙表,非常實用!

MySQL vs. MariaDB: Reality Check

Percona 製作了一份表格,比較了 MySQL、MariaDB 和 Percona Server for MySQL 之間的異同。對有在評估選用或是跳槽到其中之一的資料庫的人應該很有幫助。不過如果你看完還是不知道該選哪個,黃金法則:你就閉著眼睛挑最多人用的那個就好了。

雖然當年 MariaDB 是打著 "a drop-in replacement for MySQL" 的名號,但是現在都 2017 年了,滄海桑田啊。而且大家對「相容性」這三個字可能都有更現實的認知了,畢竟它們就是由不同的人在不同的時間以不同的方式開發的不同的產品啊。

延伸閱讀: