AWS DynamoDB Notes

AWS DynamoDB Notes

AWS DynamoDB is a fully managed key-value store (also document store) NoSQL database as a service provided by Amazon Web Services. Its pricing model is that you only pay for the throughput (read and write) you use instead of the storage usage and the running hours of database instances.

ref:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html
http://www.slideshare.net/AmazonWebServices/design-patterns-using-amazon-dynamodb

Glossary

DynamoDB is schema-less.

  • table: a table is a collection of items.
  • item: an item is a collection of attributes (key-value pairs).
  • attribute: attribute is similar to fields or columns in other databases.
  • primary key: one or two attributes that can uniquely identify every item in a table.
    • partition key (aka hash key): a simple primary key, composed of one attribute.
    • partition key and sort key (aka range key): a composite primary key, composed of two attributes.

ref:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.CoreComponents.html

Global Secondary Index (GSI)

secondary index 指的是除了 primary key 之外的第二組 key
可以有很多組 secondary index
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SecondaryIndexes.html

GSI 可以用在是 partition key 或 partition + sort key 的 table
GSI 跟 primary key 一樣可以 simple 或是 composite 的
GSI 可以隨時增減

如果你不需要 strong consistency 或個別 partition 的資料量大於 10GB
那就用 GSI

ref:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
http://iamgarlic.blogspot.tw/2015/01/amazon-dynamodb-global-secondary-index.html

Local Secondary Index (LSI)

LSI 只能用在是 partition + sort key 的 table
LSI 必須用原本的 partition key 搭配其他 attribute 做為新的 partition + sort key(LSI 只會是 composite 的)
LSI 只能在建立 table 的時候定義

ref:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LSI.html
http://iamgarlic.blogspot.tw/2015/01/amazon-dynamodb-local-secondary-index.html

Query and Scan

能不用 scan 就不用
畢竟這個操作就是去掃 table 裡的所有 item

primary key 和 local secondary index 只能在建立 table 時指定
一旦建立就不能改了
但是 global secondary index 就沒有這個限制

如果是用 partition + sork key 當 primary key
get 的時候要同時給 partition key 和 sort key
query 的時候可以只給 partition key 而 sort key 可給可不給(但是 partition key 一定要給)

無論是當 primary key、GSI 或 LSI
只要是 partition key 的 attribute 一律只能使用 = 來 query
該 attribute 沒有 rich query 的能力(就是 >, <, between, contains 那些條件)
sort key 才會有 rich query

Best Practices
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BestPractices.html

Choosing a Partition Key
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

Querying DynamoDB by date
http://stackoverflow.com/questions/14836600/querying-dynamodb-by-date

Pick an item randomly
http://stackoverflow.com/questions/10666364/aws-dynamodb-pick-a-record-item-randomly

ref:
https://www.uplift.agency/blog/posts/2016/03/clearcare-dynamodb
https://medium.com/building-timehop/one-year-of-dynamodb-at-timehop-f761d9fe5fa1#.3g97b3lqy

Commands

DynamoDB is schema-less, so that you can only define keys you need for specifying primary key or local secondary index when creating table.

# 可以用 project name 作為 table name 的 prefix
# 之後可以隨時修改 read / write capacity units
$ aws dynamodb create-table \
--table-name CodeTengu_Preference \
--attribute-definitions AttributeName=name,AttributeType=S \
--key-schema AttributeName=name,KeyType=HASH \
--provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5

$ aws dynamodb create-table \
--table-name CodeTengu_WeeklyIssue \
--attribute-definitions AttributeName=number,AttributeType=N AttributeName=publication,AttributeType=S AttributeName=publishedAt,AttributeType=N \
--key-schema AttributeName=number,KeyType=HASH \
--global-secondary-indexes IndexName=publication_published_at,KeySchema='[{AttributeName=publication,KeyType=HASH},{AttributeName=publishedAt,KeyType=RANGE}]',Projection='{ProjectionType=ALL}',ProvisionedThroughput='{ReadCapacityUnits=5,WriteCapacityUnits=5}' \
--provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5

$ aws dynamodb create-table \
--table-name CodeTengu_WeeklyPost \
--attribute-definitions AttributeName=issueNumber,AttributeType=N AttributeName=id,AttributeType=N  AttributeName=categoryCode,AttributeType=S \
--key-schema AttributeName=issueNumber,KeyType=HASH AttributeName=id,KeyType=RANGE \
--global-secondary-indexes IndexName=categoryCode_id,KeySchema='[{AttributeName=categoryCode,KeyType=HASH},{AttributeName=id,KeyType=RANGE}]',Projection='{ProjectionType=ALL}',ProvisionedThroughput='{ReadCapacityUnits=5,WriteCapacityUnits=5}' \
--provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5

ref:
http://docs.aws.amazon.com/cli/latest/reference/dynamodb/create-table.html
http://docs.aws.amazon.com/cli/latest/reference/dynamodb/update-table.html

$ aws dynamodb put-item \
--table-name CodeTengu_Preference \
--item file://fixtures/curated_api_config.json \
--return-consumed-capacity TOTAL

# fixtures/curated_api_config.json
{
  "name": { "S": "curated_api_config" },
  "apiKey": { "S": "xxx" }
}

ref:
http://docs.aws.amazon.com/cli/latest/reference/dynamodb/put-item.html

$ aws dynamodb get-item \
--table-name CodeTengu_WeeklyIssue \
--key '{"number": {"N": "42"}}'

ref:
http://docs.aws.amazon.com/cli/latest/reference/dynamodb/get-item.html

Usage

你應該用 AWS.DynamoDB.DocumentClient
而不是直接用 AWS.DynamoDB

const AWS = require('aws-sdk');

const dynamodb = new AWS.DynamoDB({ apiVersion: '2012-08-10', region: 'ap-northeast-1' });
const dynamodbClient = new AWS.DynamoDB.DocumentClient({ service: dynamodb });

const params = {
  RequestItems: {
    CodeTengu_Preference: {
      Keys: [
        { name: 'xxx' },
      ],
    },
  },
};

dynamodbClient.batchGet(params, (err, data) => {
  if (err) {
    console.log('fail');
    console.log(err);
  } else {
    console.log('success');
    console.log(data);
  }
});

ref:
http://aws.amazon.com/sdk-for-node-js/
http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/DynamoDB.html
http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/DynamoDB/DocumentClient.html

完整的程式碼放在 GitHub 上
https://github.com/CodeTengu/lambdabaku

argparse: Create a Command-line App with Python

argparse: Create a Command-line App with Python

argparse is a Python standard library makes it easy to write a CLI application. You should use this module instead of optparse which is deprecated since Python 2.7.

ref:
https://docs.python.org/3/library/argparse.html

Basic example

#!/usr/bin/env python

from __future__ import print_function

import argparse
import sys

import jokekappa

class JokeKappaCLI(object):

    def __init__(self):
        parser = argparse.ArgumentParser(
            prog='jokekappa',
            description='humor is a serious thing, you should take it seriously',
        )
        self.parser = parser
        self.parser.add_argument('-v', '--version', action='version', version=jokekappa.__version__)

        self.subparsers = parser.add_subparsers(title='sub commands')
        self.subparsers \
            .add_parser('one', help='print one joke randomly') \
            .set_defaults(func=self.tell_joke)
        self.subparsers \
            .add_parser('all', help='print all jokes') \
            .set_defaults(func=self.tell_jokes)
        self.subparsers \
            .add_parser('update', help='update jokes from sources') \
            .set_defaults(func=self.update_jokes)

    def parse_args(self):
        if len(sys.argv) == 1:
            namespace = self.parser.parse_args(['one', ])
        else:
            namespace = self.parser.parse_args()
        namespace.func()

    def tell_joke(self):
        joke = jokekappa.get_joke()
        print(joke['content'])

    def tell_jokes(self):
        for joke in jokekappa.get_jokes():
            print(joke['content'])

    def update_jokes(self):
        jokekappa.update_jokes()
        print('Done')

def main():
    JokeKappaCLI().parse_args()

if __name__ == '__main__':
    main()

ref:
https://github.com/CodeTengu/JokeKappa
https://stackoverflow.com/questions/5176691/argparse-how-to-specify-a-default-subcommand

Advanced example

In following code, you're able to create a Python module called pangu and a command-line tool also called pangu, both share the same codebase.

in pangu.py

from __future__ import print_function

import argparse
import sys

__version__ = '3.3.0'
__all__ = ['spacing_text', 'PanguCLI']

def spacing_text(text):
    """
    You could find real code in https://github.com/vinta/pangu.py
    """
    return text.upper().strip()

class PanguCLI(object):

    def __init__(self):
        parser = argparse.ArgumentParser(
            prog='pangu',
            description='paranoid text spacing',
        )
        self.parser = parser
        self.parser.add_argument('-v', '--version', action='version', version=__version__)
        self.parser.add_argument('text', action='store', type=str)

    def parse(self):
        if not sys.stdin.isatty():
            print(spacing_text(sys.stdin.read()))
        elif len(sys.argv) > 1:
            namespace = self.parser.parse_args()
            print(spacing_text(namespace.text))
        else:
            self.parser.print_help()
        sys.exit(0)

if __name__ == '__main__':
    PanguCLI().parse()

in bin/pangu

#!/usr/bin/env python

from pangu import PanguCLI

if __name__ == '__main__':
    PanguCLI().parse()

in setup.py

from setuptools import setup

setup(
    name='pangu',
    py_modules=['pangu', ],
    scripts=['bin/pangu', ],
    ...
)

As a result, there're multiple usages:

$ pangu "abc"
$ python -m pangu "abc"
$ echo "abc" | pangu
$ echo "abc" | python -m pangu
ABC

ref:
https://github.com/vinta/pangu.py

Accept a list as option

parser.add_argument('-u', '--usernames', type=lambda x: x.split(','), dest='usernames', required=True)
# your_command -u vinta
# your_command -u vinta,saiday

Accept conditional argument

ref:
https://stackoverflow.com/questions/15459997/passing-integer-lists-to-python

awscli: Command-line Interface for Amazon Web Services

awscli: Command-line Interface for Amazon Web Services

awscli is the official command-line interface for all Amazon Web Services (AWS).

ref:
https://github.com/aws/aws-cli

Configuration

$ pip install awscli

$ aws configure

ref:
https://docs.aws.amazon.com/cli/latest/index.html

S3

Download A Folder

$ aws s3 sync \
s3://files.vinta.ws/static/images/stickers/ \
.

ref:
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
https://docs.aws.amazon.com/cli/latest/userguide/cli-services-s3-commands.html#using-s3-commands-managing-objects

Rename A Folder

$ aws s3 cp \
s3://files.vinta.ws/static/images/stickers_BACKUP/ \
s3://files.vinta.ws/static/images/stickers/ \
--recursive

ref:
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html

Make A Folder Public Read

$ aws s3 sync \
s3://files.vinta.ws/static/ \
s3://files.vinta.ws/static/ \
--grants read=uri=http://acs.amazonaws.com/groups/global/AllUsers

Upload Files

# also make them public read
$ aws s3 cp \
. \
s3://files.vinta.ws/static/images/stickers/ \
--recursive \
--grants read=uri=http://acs.amazonaws.com/groups/global/AllUsers

$ aws s3 cp \
db.sqlite3 \
s3://files.albedo.one/

$ aws s3 sync \
./ \
s3://files.albedo.one/ \
--recursive --exclude "*" --include "*.pickle"

Copy Files Between S3 Buckets

$ aws s3 sync s3://your_bucket_1/media s3://your_bucket_2/media \
--acl "public-read" \
--exclude "track_audio/*"

Remove Files

$ aws s3 rm s3://your_bucket_1/media/track_audio --recursive

ref:
https://docs.aws.amazon.com/cli/latest/reference/s3/rm.html

你都去哪裡看技術文章?

你都去哪裡看技術文章?

因為前陣子跟朋友們一起弄了一個技術週刊:CodeTengu Weekly 碼天狗週刊,每個禮拜在考慮要放哪些內容的時候,突然覺得:「你都去哪裡看技術文章?」或許也會是個有價值而且實用的主題,所以乾脆就來跟大家分享一下,我覺得不錯的每日資訊來源。

你可以訂閱的週報

五花八門

程式語言

資料庫

DevOps

Machine Learning

你可以瀏覽的網站

如果要推薦值得一看的網站或網誌,說八年都說不完,而且現在大家也都不用 RSS reader 了(真的很可惜,明明就很方便),這裡就只提幾個「內容聚合網站」(news aggregator)。你可以在這些網站上 follow 特定的主題,例如 Python、Golang、Apache Cassandra、Docker 之類的,他們就會自動把相關的文章推送給你,比較特別的是,網站還會根據你的個人喜好和你在 Twitter 上關注的對象來調整推送給你的內容。

我最早用過這一類的服務是 Zite,但是直到它老是推薦「印度一條六公尺的巨蟒(Python)吞食了一個人類小孩」的新聞給我之後,我就把它刪掉了。雖然說 Zite 已經被收購,整合進 Flipboard 裡,但是我已經對它沒信心啦。

2015.09.06 更新:

你可以關注的人

以下列出的是許多喜歡在 Twitter 上分享技術文章而且推文頻率又比較高的開發者:

出沒於 Twitter

出沒於 Facebook

Tools for Profiling your Python Projects

Tools for Profiling your Python Projects

The first aim of profiling is to test a representative system to identify what's slow, using too much RAM, causing too much disk I/O or network I/O. You should keep in mind that profiling typically adds an overhead to your code.

In this post, I will introduce tools you could use to profile your Python or Django projects, including: timer, pycallgraph, cProfile, line-profiler, memory-profiler.

ref:
https://stackoverflow.com/questions/582336/how-can-you-profile-a-script
https://www.airpair.com/python/posts/optimizing-python-code

timer

The simplest way to profile a piece of code.

ref:
https://docs.python.org/3/library/timeit.html

pycallgraph

pycallgraph is a Python module that creates call graph visualizations for Python applications.

ref:
https://pycallgraph.readthedocs.org/en/latest/

$ sudo apt-get install graphviz
$ pip install pycallgraph
# in your_app/middlewares.py
from pycallgraph import Config
from pycallgraph import PyCallGraph
from pycallgraph.globbing_filter import GlobbingFilter
from pycallgraph.output import GraphvizOutput
import time

class PyCallGraphMiddleware(object):

    def process_view(self, request, callback, callback_args, callback_kwargs):
        if 'graph' in request.GET:
            config = Config()
            config.trace_filter = GlobbingFilter(include=['rest_framework.*', 'api.*', 'music.*'])
            graphviz = GraphvizOutput(output_file='pycallgraph-{}.png'.format(time.time()))
            pycallgraph = PyCallGraph(output=graphviz, config=config)
            pycallgraph.start()

            self.pycallgraph = pycallgraph

    def process_response(self, request, response):
        if 'graph' in request.GET:
            self.pycallgraph.done()

        return response
# in settings.py
MIDDLEWARE_CLASSES = (
    'your_app.middlewares.PyCallGraphMiddleware',
    ...
)
$ python manage.py runserver 0.0.0.0:8000
$ open http://127.0.0.1:8000/your_endpoint/?graph=true

cProfile

cProfile is a tool in Python's standard library to understand which functions in your code take the longest to run. It will give you a high-level view of the performance problem so you can direct your attention to the critical functions.

ref:
http://igor.kupczynski.info/2015/01/16/profiling-python-scripts.html
https://ymichael.com/2014/03/08/profiling-python-with-cprofile.html

$ python -m cProfile manage.py test member
$ python -m cProfile -o my-profile-data.out manage.py test --failtest
$ python -m cProfile -o my-profile-data.out manage.py runserver 0.0.0.0:8000

$ pip install cprofilev
$ cprofilev -f my-profile-data.out -a 0.0.0.0 -p 4000
$ open http://127.0.0.1:4000

cProfile with django-cprofile-middleware

$ pip install django-cprofile-middleware
# in settings.py
MIDDLEWARE_CLASSES = (
    ...
    'django_cprofile_middleware.middleware.ProfilerMiddleware',
)

Open any url with a ?prof suffix to do the profiling, for instance, http://localhost:8000/foo/?prof

ref:
https://github.com/omarish/django-cprofile-middleware

cProfile with django-extension and kcachegrind

kcachegrind is a profiling data visualization tool, used to determine the most time consuming execution parts of a program.

ref:
http://django-extensions.readthedocs.org/en/latest/runprofileserver.html

$ pip install django-extensions
# in settings.py
INSTALLED_APPS += (
    'django_extensions',
)
$ mkdir -p my-profile-data

$ python manage.py runprofileserver \
--noreload \
--nomedia \
--nostatic \
--kcachegrind \
--prof-path=my-profile-data \
0.0.0.0:8000

$ brew install qcachegrind --with-graphviz
$ qcachegrind my-profile-data/root.003563ms.1441992439.prof
# or
$ sudo apt-get install kcachegrind
$ kcachegrind my-profile-data/root.003563ms.1441992439.prof

cProfile with django-debug-toolbar

You're only able to use django-debug-toolbar if your view returns HTML, it needs a place to inject the debug panels into your DOM on the webpage.

ref:
https://github.com/django-debug-toolbar/django-debug-toolbar

$ pip install django-debug-toolbar
# in settiangs.py
INSTALLED_APPS += (
    'debug_toolbar',
)

DEBUG_TOOLBAR_PANELS = [
    ...
    'debug_toolbar.panels.profiling.ProfilingPanel',
    ...
]

line-profiler

line-profiler is a module for doing line-by-line profiling of functions. One of my favorite tools.

ref:
https://github.com/rkern/line_profiler

$ pip install line-profiler
# in your_app/views.py
def do_line_profiler(view=None, extra_view=None):
    import line_profiler

    def wrapper(view):
        def wrapped(*args, **kwargs):
            prof = line_profiler.LineProfiler()
            prof.add_function(view)
            if extra_view:
                [prof.add_function(v) for v in extra_view]
            with prof:
                resp = view(*args, **kwargs)
            prof.print_stats()
            return resp

        return wrapped

    if view:
        return wrapper(view)

    return wrapper

@do_line_profiler
def your_view(request):
    pass

ref:
https://djangosnippets.org/snippets/10483/

There is a pure Python alternative: pprofile.
https://github.com/vpelletier/pprofile

line-profiler with django-devserver

ref:
https://github.com/dcramer/django-devserver

$ pip install git+git://github.com/dcramer/django-devserver#egg=django-devserver

in settings.py

INSTALLED_APPS += (
    'devserver',
)

DEVSERVER_MODULES = (
    ...
    'devserver.modules.profile.LineProfilerModule',
    ...
)

DEVSERVER_AUTO_PROFILE = False

in your_app/views.py

from devserver.modules.profile import devserver_profile

@devserver_profile()
def your_view(request):
    pass

line-profiler with django-debug-toolbar-line-profiler

ref:
http://django-debug-toolbar.readthedocs.org/en/latest/
https://github.com/dmclain/django-debug-toolbar-line-profiler

$ pip install django-debug-toolbar django-debug-toolbar-line-profiler
# in settings.py
INSTALLED_APPS += (
    'debug_toolbar',
    'debug_toolbar_line_profiler',
)

DEBUG_TOOLBAR_PANELS = [
    ...
    'debug_toolbar_line_profiler.panel.ProfilingPanel',
    ...
]

memory-profiler

This is a Python module for monitoring memory consumption of a process as well as line-by-line analysis of memory consumption for Python programs.

ref:
https://pypi.python.org/pypi/memory_profiler

$ pip install memory-profiler psutil
# in your_app/views.py
from memory_profiler import profile

@profile(precision=4)
def your_view(request):
    pass

There are other options:
http://stackoverflow.com/questions/110259/which-python-memory-profiler-is-recommended

dogslow

ref:
https://bitbucket.org/evzijst/dogslow

django-slow-tests

ref:
https://github.com/realpython/django-slow-tests