碼天狗週刊 第 115 期 @vinta - Python, MongoDB, Apache Spark

碼天狗週刊 第 115 期 @vinta - Python, MongoDB, Apache Spark

本文同步發表於 CodeTengu Weekly - Issue 115

Timezone in Python: Offset-naive and Offset-aware datetimes

關於 datetime 和 timezone 的處理可能是每個軟體工程師遲早都會遭遇的難題之一,以前一直沒有搞得很清楚,前陣子總算有時間稍微整理了一下 timezone 在 Python 裡需要注意的地方。跟大家分享一下。

之前幾期的 issues 裡,其他 curator 也都不約而同分享過 Falsehoods Programmers Believe About XXX,其中就有幾個主題是「時間」,有興趣的人可以翻一下。

延伸閱讀:

MongoDB Query Performance over Ranges & Indexes that Help

上禮拜換了新工作,負責解的第一個 issue 就是改善某個複雜的 MongoDB aggregation 的效能。雖然對 MongoDB 還不是很熟(只有在第一份工作用過一陣子,而且後來就改用 PostgreSQL 了),不過遇到 slow query 問題的起手式基本上都是先 explain 一下,從 queryPlannerexecutionStats 欄位可以知道哪個 stage 的哪個 query 用了哪個 index、掃描了多少數據和回傳了多少數據。最後透過修改了 query 的寫法並且加了一個「欄位順序正確」的 compound index,總算成功地把執行時間從 1.x 秒降到 0.01x 秒。

這篇文章講的就是「欄位順序」對 MongoDB 的 compound index 的影響。作者還給出了一個簡單的判斷方式:

The order of fields in an index should be:

  • First, fields on which you will query for exact values.
  • Second, fields on which you will sort.
  • Finally, fields on which you will query for a range of values.

延伸閱讀:

Spark 2.x Troubleshooting Guide

這份簡報和延伸閱讀的那篇文章列出了許多 Apache Spark 在使用上常見的問題和解法,這陣子在用 Spark 做推薦系統的時候,幾乎把這些地雷都踩過了,如果可以早一點讀過這些資料就好啦。

題外話,如果你也是用 Google Cloud Dataproc 跑 Spark cluster 的話,千萬記得可以用 Preemptible VMs,能夠省下不少費用。不要像我一樣白白被多收了好幾百塊美金,心寒啊。

延伸閱讀:

Learning at work

這篇文章是 Stripe 的工程師分享的「如何在工作中學習」的文章。不過重點其實不是如何學習,而是「在工作中」學習。

畢竟要求每個工程師都應該在下班之後還要花很多時間學新技術、開發 side project 或是貢獻 open source 專案其實是不太現實的嘛。而且一天工作 8 小時,但是可以真的進入 mindflow 專心寫程式的時間可能連 4 小時都不到(我承認我超少)。有時候可能是狀態不對或是你就是欠缺完成任務所需要的知識,這時候與其硬幹、用 workaround 或是產出一些自己都不明所以只是剛好能動的程式碼,那還不如好好地利用上班時間,仔細鑽研一下你要解決的問題、搞懂那些你之前沒搞清楚的東西。再不然也可以重新審視一遍那個你剛剛解決的 bug 的 root cause,因為每個 bug 其實都是一個學習的契機啊。

就像陈皓說的「多些时间能少写些代码」,共勉之。

Timezone in Python: Offset-naive and Offset-aware datetimes

Timezone in Python: Offset-naive and Offset-aware datetimes

TL;DR: You should always store datetimes in UTC and convert to proper timezone on display.

A timezone offset refers to how many hours the timezone is from Coordinated Universal Time (UTC). The offset of UTC is +00:00, and the offset of Asia/Taipei timezone is UTC+08:00 (you could also present it as GMT+08:00). Basically, there is no perceptible difference between Greenwich Mean Time (GMT) and UTC.

The local time subtracts the offset of its timezone is UTC time. For instance, 18:00+08:00 of Asia/Taipei minuses timezone offset +08:00 is 10:00+00:00, 10 o'clock of UTC. On the other hand, UTC time pluses local timezone offset is local time.

ref:
https://opensource.com/article/17/5/understanding-datetime-python-primer
https://julien.danjou.info/blog/2015/python-and-timezones

到底是 GMT+8 還是 UTC+8?
http://pansci.asia/archives/84978

Install

$ pip install -U python-dateutil pytz tzlocal

Show System Timezone

import tzlocal

tzlocal.get_localzone()
# <DstTzInfo 'Asia/Taipei' LMT+8:06:00 STD>

tzlocal.get_localzone().zone
# 'Asia/Taipei'

from time import gmtime, strftime
print(strftime("%z", gmtime()))
# +0800

ref:
https://github.com/regebro/tzlocal
https://stackoverflow.com/questions/13218506/how-to-get-system-timezone-setting-and-pass-it-to-pytz-timezone/

Find Timezones of a Certain Country

import pytz

pytz.country_timezones('tw')
# ['Asia/Taipei']

pytz.country_timezones('cn')
# ['Asia/Shanghai', 'Asia/Urumqi']

ref:
https://pythonhosted.org/pytz/#country-information

Offset-naive datetime

A naive datetime object contains no timezone information. The datetime_obj.tzinfo will be set to None if the object is naive. Actually, datetime objects without timezone should be considered as a "bug" in your application. It is up for the programmer to keep track of which timezone users are working in.

import datetime

import dateutil.parser

datetime.datetime.now()
# return the current date and time in local timezone, in this example: Asia/Taipei (UTC+08:00)
# datetime.datetime(2018, 2, 2, 9, 15, 6, 211358)), naive

datetime.datetime.utcnow()
# return the current date and time in UTC
# datetime.datetime(2018, 2, 2, 1, 15, 6, 211358), naive

dateutil.parser.parse('2018-02-04T16:30:00')
# datetime.datetime(2018, 2, 4, 16, 30), naive

ref:
https://docs.python.org/3/library/datetime.html
https://dateutil.readthedocs.io/en/stable/

Offset-aware datetime

A aware datetime object embeds a timezone information. Rules of thumb for timezone in Python:

  • Always work with "offset-aware" datetime objects.
  • Always store datetime in UTC and do timezone conversion only when interacting with users.
  • Always use ISO 8601 as input and output string format.

There are two useful methods: pytz.utc.localize(naive_dt) for converting naive datetime to timezone offset-aware one, and aware_dt.astimezone(pytz.timezone('Asia/Taipei')) for switching timezones.

import datetime

import pytz

now_utc = pytz.utc.localize(datetime.datetime.utcnow())
# equals to datetime.datetime.now(pytz.utc)
# datetime.datetime(2018, 2, 4, 10, 17, 40, 679562, tzinfo=<UTC>), aware

now_taipei = now_utc.astimezone(pytz.timezone('Asia/Taipei'))
# convert to another timezone
# datetime.datetime(2018, 2, 4, 18, 17, 40, 679562, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>), aware

now_utc.isoformat()
# '2018-02-04T10:17:40.679562+00:00'

now_taipei.isoformat()
# '2018-02-04T18:17:40.679562+08:00'

now_utc == now_taipei
# True

For working with pytz, it is recommended to call tz.localize(naive_dt) instead of naive_dt.replace(tzinfo=tz). dt.replace(tzinfo=tz) does not handle daylight savings time correctly.

dt1 = datetime.datetime.now(pytz.timezone('Asia/Taipei'))
# datetime.datetime(2018, 2, 4, 18, 22, 28, 409332, tzinfo=<DstTzInfo 'Asia/Taipei' CST+8:00:00 STD>), aware

dt2 = datetime.datetime(2018, 2, 4, 18, 22, 28, 409332, tzinfo=pytz.timezone('Asia/Taipei'))
# datetime.datetime(2018, 2, 4, 18, 22, 28, 409332, tzinfo=<DstTzInfo 'Asia/Taipei' LMT+8:06:00 STD>), aware

dt1 == dt2
# False

ref:
https://pythonhosted.org/pytz/

Naive and aware datetime objects are not comparable.

naive = datetime.datetime.utcnow()
aware = pytz.utc.localize(naive)

naive == aware
# False

naive >= aware
# TypeError: can't compare offset-naive and offset-aware datetimes

Parse String to Datetime

python-dateutil usually comes in handy.

import dateutil.parser
import dateutil.tz

dt1 = dateutil.parser.parse('2018-02-04T19:30:00+08:00')
# datetime.datetime(2018, 2, 4, 19, 30, tzinfo=tzoffset(None, 28800)), aware

dt2 = dateutil.parser.parse('2018-02-04T11:30:00+00:00')
# datetime.datetime(2018, 2, 4, 11, 30, tzinfo=tzutc()), aware

dt3 = dateutil.parser.parse('2018-02-04T11:30:00Z')
# datetime.datetime(2018, 2, 4, 11, 30, tzinfo=tzutc()), aware

dt1 == dt2 == dt3
# True

ref:
https://dateutil.readthedocs.io/en/stable/

Parse Unix Timestamp to Datetime

import datetime
import time

import pytz

ts = time.time()
# seconds since the Epoch (1970-01-01T00:00:00 in UTC)
# 1517748706.063205

dt1 = datetime.datetime.fromtimestamp(ts)
# return the date and time of the timestamp in local timezone, in this example: Asia/Taipei (UTC+08:00)
# datetime.datetime(2018, 2, 4, 20, 51, 46, 63205), naive

dt2 = datetime.datetime.utcfromtimestamp(ts)
# return the date and time of the timestamp in UTC timezone
# datetime.datetime(2018, 2, 4, 12, 51, 46, 63205), naive

pytz.timezone('Asia/Taipei').localize(dt1) == pytz.utc.localize(dt2)
# True

ref:
https://stackoverflow.com/questions/13890935/does-pythons-time-time-return-the-local-or-utc-timestamp

We might receive an Unix timestamp from a JavaScript client.

var moment = require('moment')
var ts = moment('2018-02-02').unix()
// 1517500800

ref:
https://momentjs.com/docs/#/parsing/unix-timestamp/

Store datetime in Databases

  • MySQL lets developers decide what timezone should be used, and you should convert datetime to UTC before saving into database.
  • MongoDB assumes that all the timestamp are in UTC, and you have to normalize datetime to UTC.

ref:
https://tommikaikkonen.github.io/timezones/
https://blog.elsdoerfer.name/2008/03/03/fun-with-timezones-in-django-mysql/

Tools

ref:
https://www.epochconverter.com/
https://www.timeanddate.com/worldclock/converter.html

datetime format in Python / JavaScript

Python

I recommend dateutil.

import datetime
import dateutil

>>> dateutil.parser.parse('2014-12-24T16:15:16')
datetime.datetime(2014, 12, 24, 16, 15, 16)

>>> datetime_obj = datetime.datetime.strptime('2014-12-24T16:15:16', '%Y-%m-%dT%H:%M:%S')
datetime.datetime(2014, 12, 24, 16, 15, 16)

>>> datetime_obj = datetime.datetime.strptime('201408282300', '%Y%m%d%H%M')
datetime.datetime(2014, 8, 28, 23, 0)

>>> datetime_obj.strftime('%Y-%m-%d %H:%M')

strftime >> datetime -> str
strptime >> str --> datetime

ref:
https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior

Django Template

class DriverInfoForm(forms.ModelForm):
    service_time_start = forms.TimeField(
        widget=forms.TimeInput(format='%H:%M'),
        input_formats=['%H:%M', ]
    )
日期:{{ withdraw.presented_at|date:"%Y 年 %n 月" }}
聯絡時間:{{ driver.service_time_start|date:"H:i" }} - {{ driver.service_time_end|date:"H:i" }}

要注意的是
Django 似乎不能 parse AM / PM
所以儘量用 24 小時制

ref:
https://docs.djangoproject.com/en/dev/ref/templates/builtins/#date

JavaScript

moment("201408292300", "YYYYMMDDHHmm");