Apex and Terraform: The easiest way to manage AWS Lambda functions

Apex and Terraform: The easiest way to manage AWS Lambda functions

AWS Lambda lets you run code without provisioning or managing servers, which is so-called Serverless or Function as a Service (FaaS).

Apex is a Go command-line tool to manage and deploy your serverless functions on AWS Lambda. Apex is also integrated with Terraform to provide cloud infrastructure management, for instance, configuring your AWS Lambda functions with Amazon API Gateway.

ref:
https://aws.amazon.com/lambda/
https://aws.amazon.com/api-gateway/
https://github.com/apex/apex

You could browse projects created in this post on GitHub:
https://github.com/vinta/pangu.space
https://github.com/CodeTengu/LambdaBaku

Install

$ curl https://raw.githubusercontent.com/apex/apex/master/install.sh | sh

ref:
http://apex.run/#installation

Initialize

It is recommended to configure your AWS credentials with awscli.

$ pip install awscli
$ aws configure

ref:
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html

To use Apex to manage Lambda functions, you have to make sure your AWS credential has minimum IAM permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "iam:CreateRole",
        "iam:CreatePolicy",
        "iam:AttachRolePolicy",
        "iam:PassRole",
        "lambda:GetFunction",
        "lambda:ListFunctions",
        "lambda:CreateFunction",
        "lambda:DeleteFunction",
        "lambda:InvokeFunction",
        "lambda:GetFunctionConfiguration",
        "lambda:UpdateFunctionConfiguration",
        "lambda:UpdateFunctionCode",
        "lambda:CreateAlias",
        "lambda:UpdateAlias",
        "lambda:GetAlias",
        "lambda:ListAliases",
        "lambda:ListVersionsByFunction",
        "logs:FilterLogEvents",
        "cloudwatch:GetMetricStatistics"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}
$ apex init

ref:
http://apex.run/#getting-started

After running apex init, Apex creates a Role and a Policy. You should be able to find them on AWS IAM Management Console. If you want to access other AWS resources, for instance, S3 buckets, DynamoDB tables, SNS, in your Lambda functions, you must create a new Policy which grants appropriate permissions and attachs itself to the Role that Apex created.

Here is a Policy example of operating certain DynamoDB tables:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt123456789",
            "Effect": "Allow",
            "Action": [
                "dynamodb:*"
            ],
            "Resource": [
                "arn:aws:dynamodb:ap-northeast-1:123456789:table/CodeTengu_Preference",
                "arn:aws:dynamodb:ap-northeast-1:123456789:table/CodeTengu_Preference/*",
                "arn:aws:dynamodb:ap-northeast-1:123456789:table/CodeTengu_WeeklyIssue",
                "arn:aws:dynamodb:ap-northeast-1:123456789:table/CodeTengu_WeeklyIssue/*",
                "arn:aws:dynamodb:ap-northeast-1:123456789:table/CodeTengu_WeeklyPost",
                "arn:aws:dynamodb:ap-northeast-1:123456789:table/CodeTengu_WeeklyPost/*"
            ]
        }
    ]
}

Write Lambda Functions

ref:
https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html
https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html

Node.js

The simplest handler:

const aws = require('aws-sdk');

exports.handle = (event, context, callback) => {
  doYourShit();
  callback(null, 'DONE');
};

ref:
https://docs.aws.amazon.com/lambda/latest/dg/programming-model.html

Call another Lambda function in a Lambda function:

You must make sure your Lambda role has the permission of invoking other Lambda functions.

const util = require('util');

const aws = require('aws-sdk');

const params = {
  FunctionName: 'LambdaBaku_syncIssue',
  InvocationType: 'Event', // means asynchronous execution
  Payload: JSON.stringify({ issue_number: curatedIssue.number }),
};

lambda.invoke(params, (err, data) => {
  if (err) {
    console.log('FAIL', params);
    console.log(util.inspect(err));
  } else {
    console.log(data);
  }
});

ref:
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Lambda.html
https://stackoverflow.com/questions/31714788/can-an-aws-lambda-function-call-another

Go

Write a Lambda function triggered by Amazon API Gateway:

package main

import (
    "encoding/json"
    "errors"
    "log"

    "github.com/aws/aws-lambda-go/events"
    "github.com/aws/aws-lambda-go/lambda"
    "github.com/vinta/pangu"
)

var (
    // ErrTextNotProvided is thrown when text is not provided in HTTP query string
    ErrTextNotProvided = errors.New("No text was provided in HTTP query string")
)

// Handler is the AWS Lambda function handler
func Handler(request events.APIGatewayProxyRequest) (events.APIGatewayProxyResponse, error) {
    log.Printf("request id: %s\n", request.RequestContext.RequestID)

    text, ok := request.QueryStringParameters["t"]
    if !ok {
        errMap := map[string]string{
            "message": ErrTextNotProvided.Error(),
        }
        errMapJSON, _ := json.MarshalIndent(errMap, "", " ")

        return events.APIGatewayProxyResponse{
            Body: string(errMapJSON),
            StatusCode: 400,
        }, nil
    }

    log.Printf("text: %s\n", text)

    textPlainHeaders := map[string]string{
        "content-type": "text/plain; charset=utf-8",
    }

    return events.APIGatewayProxyResponse{
        Body: pangu.SpacingText(text),
        Headers: textPlainHeaders,
        StatusCode: 200,
    }, nil
}

func main() {
    lambda.Start(Handler)
}

ref:
https://aws.amazon.com/blogs/compute/announcing-go-support-for-aws-lambda/
https://docs.aws.amazon.com/lambda/latest/dg/go-programming-model-handler-types.html
https://docs.aws.amazon.com/lambda/latest/dg/go-programming-model-errors.html

Your "Integration Request" configurations in API Gateway should be like:

  • Integration type: Lambda Function
  • Use Lambda Proxy integration: Yes
  • Lambda Region: ap-northeast-1
  • Lambda Function: panguspace_spacing_text
  • Invoke with caller credentials: No
  • Credentials cache: Do not add caller credentials to cache key
  • Use Default Timeout: Yes

It's also worth noting that the API response is mainly defined by APIGatewayProxyResponse in Lambda function code. Configurations in API Gateway, i.e., "Integration Response" and "Method Response" do not matter.

ref:
https://docs.aws.amazon.com/apigateway/latest/developerguide/getting-started-with-lambda-integration.html

Usage

Deploy all functions:

$ apex deploy

ref:
http://apex.run/#deploying-functions

Invoke a function:

# invoke a function directly
$ apex invoke spacing_text --logs
{
    "statusCode": 400,
    "headers": null,
    "body":"{\"message\": \"No text was provided in the HTTP query string\"}"
}

# invoke a function with an API Gateway event
$ cat fixtures/spacing_text_event.json
{
    "queryStringParameters": {"t": "與PM戰鬥的人,應當小心自己不要成為PM"}
}
$ apex invoke spacing_text --logs < fixtures/spacing_text_event.json
{
    "statusCode": 200,
    "headers": {"content-type": "text/plain; charset=utf-8"},
    "body": "與 PM 戰鬥的人,應當小心自己不要成為 PM"
}

ref:
http://apex.run/#invoking-functions

View logs which might delay several seconds:

$ apex logs -f

Pack a function:

$ apex build spacing_text > spacing_text.zip

Configure API Gateway

Create API Keys

To setup API keys, do the following:

  1. Configure your API methods to require an API key
  2. Deploy your API
  3. Create an API key for the API in a region
  4. Create an Usage Plan and assign an API key with a certain Stage

In step 1, your "Method Request" configurations in API Gateway should be like:

  • Authorization: NONE
  • Request Validator: NONE
  • API Key Required: true

Now you are able to call the API with a x-api-key header:

$ curl -H "x-api-key: YOUR-API-KEY" https://xxx.execute-api.ap-northeast-1.amazonaws.com/v1/your-endpoint/

ref:
https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-create-usage-plans-with-rest-api.html
https://docs.aws.amazon.com/apigateway/latest/developerguide/how-to-use-postman-to-call-api.html

Actually, you could release your APIs without API keys if you like.

Setup a Custom Domain

To setup a custom domain which managed by Cloudflare, see the following link:
https://stackoverflow.com/a/46061708/885524

It might take a long time to generate "Target Domain Name" (xxx.cloudfront.net).

Don't forget to add "Base Path Mappings" in API Gateway Custom Domain Names:

  • api.pangu.space
    • Target Domain Name: xxx.cloudfront.net
    • ACM Certificate: *.pangu.space
    • Base Path Mappings:
      • Path: /v1
      • Destination: Pangu:v1

Manage Infrastructures with Terraform

Terraform is a tool to manage your cloud infrastructures as code.

$ brew install terraform

$ tree .
.
├── functions
│   ├── introduce
│   │   └── main.go
│   └── spacing_text
│       └── main.go
└── infrastructure
    ├── main.tf
    └── variables.tf

Define variables and data sources:

# infrastructure/variables.tf
data "aws_caller_identity" "current" {}

variable "aws_region" {}
variable "apex_environment" {}
variable "apex_function_role" {}

variable "apex_function_arns" {
  type = "map"
}

variable "apex_function_names" {
  type = "map"
}

variable "apex_function_introduce" {}
variable "apex_function_spacing_text" {}

ref:
https://www.terraform.io/docs/providers/aws/d/caller_identity.html

Define AWS resources:

# infrastructure/main.tf
resource "aws_api_gateway_rest_api" "pangu" {
  name = "Pangu"
}

resource "aws_api_gateway_method" "pangu_root" {
  rest_api_id   = "${aws_api_gateway_rest_api.pangu.id}"
  resource_id   = "${aws_api_gateway_rest_api.pangu.root_resource_id}"
  http_method   = "GET"
  authorization = "NONE"
}

resource "aws_api_gateway_integration" "pangu_root_get" {
  rest_api_id             = "${aws_api_gateway_rest_api.pangu.id}"
  resource_id             = "${aws_api_gateway_rest_api.pangu.root_resource_id}"
  http_method             = "${aws_api_gateway_method.pangu_root.http_method}"
  integration_http_method = "POST"
  type                    = "AWS_PROXY"
  uri                     = "arn:aws:apigateway:${var.aws_region}:lambda:path/2015-03-31/functions/${var.apex_function_introduce}/invocations"
}

resource "aws_api_gateway_method_response" "pangu_root_get_200" {
  rest_api_id = "${aws_api_gateway_rest_api.pangu.id}"
  resource_id = "${aws_api_gateway_rest_api.pangu.root_resource_id}"
  http_method = "${aws_api_gateway_method.pangu_root.http_method}"
  status_code = "200"

  response_models = {
    "application/json" = "Empty"
  }

  response_parameters = {
    "method.response.header.Access-Control-Allow-Origin" = true
  }
}

resource "aws_api_gateway_resource" "pangu_spacing_text" {
  rest_api_id = "${aws_api_gateway_rest_api.pangu.id}"
  parent_id   = "${aws_api_gateway_rest_api.pangu.root_resource_id}"
  path_part   = "spacing-text"
}

resource "aws_api_gateway_method" "pangu_spacing_text_get" {
  rest_api_id      = "${aws_api_gateway_rest_api.pangu.id}"
  resource_id      = "${aws_api_gateway_resource.pangu_spacing_text.id}"
  http_method      = "GET"
  authorization    = "NONE"
  api_key_required = true
}

resource "aws_api_gateway_integration" "pangu_spacing_text_get" {
  rest_api_id             = "${aws_api_gateway_rest_api.pangu.id}"
  resource_id             = "${aws_api_gateway_resource.pangu_spacing_text.id}"
  http_method             = "${aws_api_gateway_method.pangu_spacing_text_get.http_method}"
  integration_http_method = "POST"
  type                    = "AWS_PROXY"
  uri                     = "arn:aws:apigateway:${var.aws_region}:lambda:path/2015-03-31/functions/${var.apex_function_spacing_text}/invocations"
}

resource "aws_api_gateway_method_response" "pangu_spacing_text_get_200" {
  rest_api_id = "${aws_api_gateway_rest_api.pangu.id}"
  resource_id = "${aws_api_gateway_resource.pangu_spacing_text.id}"
  http_method = "${aws_api_gateway_method.pangu_spacing_text_get.http_method}"
  status_code = "200"

  response_models = {
    "application/json" = "Empty"
  }

  response_parameters = {
    "method.response.header.Access-Control-Allow-Origin" = true
  }
}

resource "aws_api_gateway_deployment" "pangu" {
  depends_on = [
    "aws_api_gateway_method.pangu_root",
    "aws_api_gateway_integration.pangu_root_get",
    "aws_api_gateway_method_response.pangu_root_get_200",
    "aws_api_gateway_resource.pangu_spacing_text",
    "aws_api_gateway_method.pangu_spacing_text_get",
    "aws_api_gateway_integration.pangu_spacing_text_get",
    "aws_api_gateway_method_response.pangu_spacing_text_get_200",
  ]

  rest_api_id = "${aws_api_gateway_rest_api.pangu.id}"
  stage_name  = "v1"
}

resource "aws_lambda_permission" "pangu_root_get" {
  statement_id  = "AllowInvokeFromAPIGateway"
  action        = "lambda:InvokeFunction"
  function_name = "${var.apex_function_introduce}"
  principal     = "apigateway.amazonaws.com"

  source_arn = "arn:aws:execute-api:${var.aws_region}:${data.aws_caller_identity.current.account_id}:${aws_api_gateway_rest_api.pangu.id}/*/${aws_api_gateway_integration.pangu_root_get.http_method}/"
}

resource "aws_lambda_permission" "pangu_spacing_text" {
  statement_id  = "AllowInvokeFromAPIGateway"
  action        = "lambda:InvokeFunction"
  function_name = "${var.apex_function_spacing_text}"
  principal     = "apigateway.amazonaws.com"

  source_arn = "arn:aws:execute-api:${var.aws_region}:${data.aws_caller_identity.current.account_id}:${aws_api_gateway_rest_api.pangu.id}/*/${aws_api_gateway_integration.pangu_spacing_text_get.http_method}${aws_api_gateway_resource.pangu_spacing_text.path}"
}

ref:
https://www.terraform.io/docs/providers/aws/guides/serverless-with-aws-lambda-and-api-gateway.html

# donwload provider plugins
$ apex infra init

# view the generated execution plan
$ apex infra plan

# deploy your infrastructures
$ apex infra apply
$ apex infra apply -auto-approve

ref:
http://apex.run/#managing-infrastructure

Remotely debug a Python app inside a Docker container in Visual Studio Code

Remotely debug a Python app inside a Docker container in Visual Studio Code

Visual Studio Code with Python extension has "Remote Debugging" feature which means you could attach to a real remote host as well as a container on localhost.

In this article, we are going to debug a Flask app inside a local Docker container through VS Code's fancy debugger, and simultaneously we are still able to leverage Flask's auto-reloading mechanism. It should apply to other Python apps.

ref:
https://code.visualstudio.com/docs/editor/debugging
https://code.visualstudio.com/docs/python/debugging#_remote-debugging

Install

On both host OS and the container, install ptvsd==3.0.0. Currently, later versions of PTVSD are experimentally supported.

$ pip3 install ptvsd==3.0.0

ref:
https://github.com/Microsoft/ptvsd
https://github.com/Microsoft/vscode-python/projects/6

Prepare

There are some configurations.

# Dockerfile
FROM python:3.6.4-alpine3.6 AS builder

WORKDIR /usr/src/app/

RUN apk add --no-cache --virtual .build-deps \
    build-base \
    openjpeg-dev \
    openssl-dev \
    zlib-dev

COPY requirements.txt .
RUN pip install --user -r requirements.txt

FROM python:3.6.4-alpine3.6

ENV PATH=$PATH:/root/.local/bin
ENV FLASK_APP=app.py

WORKDIR /usr/src/app/

RUN apk add --no-cache --virtual .run-deps \
    openjpeg \
    openssl

EXPOSE 8000/tcp

COPY --from=builder /root/.local/ /root/.local/
COPY . .
# docker-compose.yml
version: '3'
services:
    db:
        image: mongo:3.4
        ports:
            - "27017:27017"
        volumes:
            - mongo-volume:/data/db
    web:
        build: .
        command: .docker-assets/start-web.sh
        ports:
            - "3000:3000"
            - "8000:8000"
        volumes:
            - .:/usr/src/app
            - ../vendors:/root/.local
        depends_on:
            - db
volumes:
    mongo-volume:

Usage

Method 1: Debug with --no-debugger, --reload and --without-threads

The convenient but a little fragile way: with auto-reloading enabled, you could change your source code on the fly. However, you might find that this method is much slower for the debugger to attach. It seems like --reload is not fully compatible with Remote Debugging.

We put ptvsd code to sitecustomize.py, as a result, ptvsd will run every time auto-reloading is triggered.

Steps:

  1. Set breakpoints
  2. Run your Flask app with --no-debugger, --reload and --without-threads
  3. Start the debugger with {"type": "python", "request": "attach", "preLaunchTask": "Enable remote debug"}
  4. Add ptvsd code to site-packages/sitecustomize.py by the pre-launch task automatically
  5. Click "Debug Anyway" button
  6. Access the part of code contains breakpoints
# site-packages/sitecustomize.py
try:
    import socket
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.close()
    import ptvsd
    ptvsd.enable_attach('my_secret', address=('0.0.0.0', 3000))
    print('ptvsd is started')
    # ptvsd.wait_for_attach()
    # print('debugger is attached')
except OSError as exc:
    print(exc)

ref:
https://docs.python.org/3/library/site.html

# .docker-assets/start-web.sh
rm -f /root/.local/lib/python3.6/site-packages/sitecustomize.py
pip3 install --user -r requirements.txt ptvsd==3.0.0
python -m flask run -h 0.0.0.0 -p 8000 --no-debugger --reload --without-threads
// .vscode/tasks.json
{
    "version": "2.0.0",
    "tasks": [
        {
            "label": "Enable remote debug",
            "type": "shell",
            "isBackground": true,
            "command": " docker cp sitecustomize.py project_web_1:/root/.local/lib/python3.6/site-packages/sitecustomize.py"
        }
    ]
}

ref:
https://code.visualstudio.com/docs/editor/tasks

// .vscode/launch.json
{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Attach",
            "type": "python",
            "request": "attach",
            "localRoot": "${workspaceFolder}",
            "remoteRoot": "/usr/src/app",
            "port": 3000,
            "secret": "my_secret",
            "host": "localhost",
            "preLaunchTask": "Enable remote debug"
        }
    ]
}

ref:
https://code.visualstudio.com/docs/editor/debugging#_launch-configurations

Method 2: Debug with --no-debugger and --no-reload

The inconvenient but slightly reliable way: if you change any Python code, you need to restart the Flask app and re-attach debugger in Visual Studio Code.

Steps:

  1. Set breakpoints
  2. Add ptvsd code to your FLASK_APP file
  3. Run your Flask app with --no-debugger and --no-reload
  4. Start the debugger with {"type": "python", "request": "attach"}
  5. Access the part of code contains breakpoints
# in app.py
import ptvsd
ptvsd.enable_attach('my_secret', address=('0.0.0.0', 3000))
print('ptvsd is started')
# ptvsd.wait_for_attach()
# print('debugger is attached')

ref:
http://ramkulkarni.com/blog/debugging-django-project-in-docker/

# .docker-assets/start-web.sh
pip3 install --user -r requirements.txt ptvsd==3.0.0
python -m flask run -h 0.0.0.0 -p 8000 --no-debugger --no-reload
// .vscode/launch.json
{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Attach",
            "type": "python",
            "request": "attach",
            "localRoot": "${workspaceFolder}",
            "remoteRoot": "/usr/src/app",
            "port": 3000,
            "secret": "my_secret",
            "host": "localhost"
        }
    ]
}

Method 3: Don't use Remote Debugging, Run Debugger Locally

You just run your Flask app on localhost (macOS) instead of putting it in a container. However, you could still host your database, cache server and message queue inside containers. Your Python app communicates with those services through ports which exposed to 127.0.0.1. Therefore, you could just use VS Code's debugger without strange tricks.

In practice, it is okay that your local development environment is different from the production environment.

# docker-compose.yml
version: '3'
services:
    db:
        image: mongo:3.6
        ports:
            - "27017:27017"
        volumes:
            - mongo-volume:/data/db
    cache:
        image: redis:4.0
        ports:
            - "6379:6379"
volumes:
    mongo-volume:
// .vscode/launch.json
{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Flask",
            "type": "python",
            "request": "launch",
            "stopOnEntry": false,
            "pythonPath": "${config:python.pythonPath}",
            "module": "flask",
            "cwd": "${workspaceFolder}",
            "args": [
                "run",
                "-h",
                "0.0.0.0",
                "-p",
                "8000",
                "--no-debugger",
                "--no-reload"
            ],
            "envFile": "${workspaceFolder}/.env",
            "debugOptions": [
                "RedirectOutput"
            ]
        }
    ]
}

Sadly, you cannot use --reload while launching your app in the debugger. Nevertheless, most of the time you don't really need the debugger - a fast auto-reloading workflow is good enough. All you need is a Makefile for running Flask app and Celery worker on macOS: make run_web and make run_worker.

# Makefile
install:
    pipenv install
    pipenv run pip install git+https://github.com/gorakhargosh/watchdog.git

shell:
    pipenv run python -m flask shell

run_web:
    pipenv run python -m flask run -h 0.0.0.0 -p 8000 --debugger --reload

run_worker:
    pipenv run watchmedo auto-restart -d . -p '*.py' -R -- celery -A app:celery worker -l info -E -P gevent -Ofair

Bonus

You should try enabling debug.inlineValues which shows variable values inline in editor while debugging. It's awesome!

// settings.json
{
    "debug.inlineValues": true
}

ref:
https://code.visualstudio.com/updates/v1_9#_inline-variable-values-in-source-code

Issues

Starting the Python debugger is fucking slow
https://github.com/Microsoft/vscode-python/issues/106

Debugging library functions won't work currently
https://github.com/Microsoft/vscode-python/issues/111

Pylint for remote projects
https://gist.github.com/IBestuzhev/d022446f71267591be76fb48152175b7

Run a Celery task at a specific time

Run a Celery task at a specific time

Schedule Tasks

You are able to run any Celery task at a specific time through eta (means "Estimated Time of Arrival") parameter.

import datetime

import celery

@celery.shared_task(bind=True)
def add_tag(task, user_id, tag):
    User.objects.filter(id=user_id, tags__ne=tag).update(push__tags=tag)
    return True

user_id = '582ee32a5b9c861c87dc297e'
tag = 'new_tag'
started_at = datetime.datetime(2018, 3, 12, tzinfo=datetime.timezone.utc)
add_tag.apply_async((user_id, tag), eta=started_at)

ref:
http://docs.celeryproject.org/en/master/userguide/calling.html#eta-and-countdown

Revoke Tasks

Revoked tasks will be discarded until their eta.

from celery.result import AsyncResult

AsyncResult(task_id).revoke()

ref:
http://docs.celeryproject.org/en/latest/reference/celery.result.html#celery.result.AsyncResult.revoke

Revoking tasks works by sending a broadcast message to all the workers, the workers then keep a list of revoked tasks in memory. When a worker starts up it will synchronize revoked tasks with other workers in the cluster.

The list of revoked tasks is in-memory so if all workers restart the list of revoked ids will also vanish. If you want to preserve this list between restarts you need to specify a file for these to be stored in by using the –statedb argument to celery worker.

ref:
http://docs.celeryproject.org/en/latest/userguide/workers.html#worker-persistent-revokes

Lazy evaluation in Django middlewares

Lazy evaluation in Django middlewares

Attach a lazy evaluated function as a property of request in a middleware.

from django.contrib.gis.geoip import GeoIP
from django.utils.functional import SimpleLazyObject


def get_country_code(request):
    g = GeoIP()
    location = g.country(request.META['REMOTE_ADDR'])
    country_code = location.get('country_code', 'TW')

    return country_code


class CountryAndSiteMiddleware(object):

    def process_request(self, request):
        request.COUNTRY_CODE = SimpleLazyObject(lambda: get_country_code(request))

Then you could use request.COUNTRY_CODE whenever you want.

Integrate with webpages using CasperJS (built on top of PhantomJS)

Integrate with webpages using CasperJS (built on top of PhantomJS)

PhantomJS is a headless and scriptable WebKit runtime (aka browser) with JavaScript API.

Usage

in script.js

Login and delete spare movie tags on Douban.

var casper = require('casper').create({
  pageSettings: {
    loadImages: true,
    loadPlugins: false
  },
  logLevel: 'debug',
  verbose: true
});

// save session cookies
var fs = require('fs');
var page = require('webpage').create();

var cookieFile = 'cookies.json';

var saveSessionCookie = function() {
  try {
    fs.statSync(cookieFile);
  } catch (e) {
    fs.write(cookieFile, JSON.stringify(phantom.cookies), 'w');
  }
}

if (fs.isFile(cookieFile)) {
  Array.prototype.forEach.call(JSON.parse(fs.read(cookieFile)), function(x) {
    phantom.addCookie(x);
  });
}

// script
var loginUrl = 'https://accounts.douban.com/login';
var startUrl = 'https://movie.douban.com/people/vinta/all';

var tags_do_not_delete = [
  '丹麦', '新西兰', '新加坡', '以色列', '印度', '意大利', '瑞典', '墨西哥', '俄罗斯', '西班牙', '比利时'
];

casper.start(loginUrl, function() {
  this.echo(this.getCurrentUrl());
  this.echo(this.getTitle());

  this.capture('login.png');

  var data = {
    form_email: 'xxx',
    form_password: 'xxx'
  };

  // 可能會被豆瓣要求輸入驗證碼
  // 可以用 casperjs script.js --remote-debugger-port=9000
  // 先打開 login.png 看驗證碼是什麼
  // 到 http://127.0.0.1:9000/ 的 console 手動輸入驗證碼
  // data['captcha-solution'] = '123';

  this.waitForSelector('form#lzform');
  this.fill('form#lzform', data, true);
});

casper.then(function() {
  this.echo(this.getCurrentUrl());
  this.echo(this.getTitle());

  saveSessionCookie();

  this.capture('all.png');

  this.open(startUrl).then(function() {
    this.waitForSelector('#open_tags', function() {
      this.click('#open_tags');
    });

    this.waitWhileSelector('#open_tags');
  });
});

casper.then(function() {
  this.echo(this.getCurrentUrl());
  this.echo(this.getTitle());

  var links = this.evaluate(function() {
    var tagList = document.querySelectorAll('ul.tag-list li a');
    var theLinks = Array.prototype.map.call(tagList, function(elem) {
        return {
          tag: elem.textContent.trim(),
          href: elem.getAttribute('href'),
          count: parseInt(elem.nextElementSibling.textContent, 10)
        };
    });

    return theLinks;
  });

  var filteredLinks = links.filter(function(link) {
    if (link.count < 5 && tags_do_not_delete.indexOf(link.tag) == -1) {
      return true;
    }
    return false;
  });

  this.each(filteredLinks, function(self, link) {
    this.echo(link.tag + ', ' + link.count);

    self.thenOpen(link.href, function() {
      this.echo(this.getCurrentUrl());
      this.echo(this.getTitle());

      this.waitForSelector('#tag-del', function() {
        this.click('#tag-del');
      });

      this.waitForSelector('input[name="del_submit"]', function() {
        this.click('input[name="del_submit"]');
      });
    });
  });
});

casper.run();

To evaluate JavaScript code in the context of the webpage, you must use evaluate() function. The context is a sandbox.

ref:
http://docs.casperjs.org/en/latest/modules/index.html

ref:
https://github.com/vinta/playground/blob/master/casperjs/script.js

Save session cookies

--cookies-file=xxx.txt only store non-session cookies (which remain your logged-in or authenticated status). You have to save every cookie manually.

var casper = require('casper').create();

// save session cookies
var fs = require('fs');
var page = require('webpage').create();

var cookieFile = 'cookies.json';

var saveSessionCookie = function() {
  try {
    fs.statSync(cookieFile);
  } catch (e) {
    fs.write(cookieFile, JSON.stringify(phantom.cookies), 'w');
  }
}

if (fs.isFile(cookieFile)) {
  Array.prototype.forEach.call(JSON.parse(fs.read(cookieFile)), function(x) {
    phantom.addCookie(x);
  });
}

casper.start('yourUrl', function() {
  // do your shit
});

ref:
http://stackoverflow.com/questions/18739354/how-can-i-use-persisted-cookies-from-a-file-using-phantomjs

Run

$ docker run --rm -v `pwd`:/data vinta/casperjs:1.1.3 script.js

# or

$ brew install casperjs
$ casperjs script.js --disk-cache=true

ref:
https://hub.docker.com/r/vinta/casperjs/
https://hub.docker.com/r/zopanix/casperjs/

ref:
http://phantomjs.org/api/command-line.html

Run in debugging mode

$ casperjs script.js --remote-debugger-port=9000
$ open http://127.0.0.1:9000/
  • Click the first link (something like "file:///usr/local/Cellar/xxx").
  • In Sources tab, press "Enable Debugging" button.
  • In Console tab, type "__run();" to start.
  • Once breakpoints worked, you could go to Console tab to debug.

ref:
http://phantomjs.org/troubleshooting.html