Amazon EKS: Setup kubernetes-external-secrets with AWS Secret Manager

Amazon EKS: Setup kubernetes-external-secrets with AWS Secret Manager

kubernetes-external-secrets allows you to use external secret management systems, like AWS Secrets Manager, to securely add Secrets in Kubernetes, so Pods can access Secrets normally.

ref:
https://github.com/external-secrets/kubernetes-external-secrets

AWS Secret Manager

For instance, we create a secret named YOUR_SECRET on AWS Secret Manager in the same region as our EKS cluster, using DefaultEncryptionKey as the encryption key. The content of the secret entity look like:

{
  "KEY_1": "VALUE_1",
  "KEY_2": "VALUE_2",
}

We can retrieve the secret value:

aws secretsmanager get-secret-value --profile=perp \
--region ap-northeast-1 \
--secret-id YOUR_SECRET

kubernetes-external-secrets

For kubernetes-external-secrets to work properly, it must be granted access to AWS Secrets Manager. To achieve that, we need to create an IAM role for kubernetes-external-secrets' service account.

ref:
https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html

Configure Secrets Backends

Create an IAM OIDC provider for the cluster:

eksctl utils associate-iam-oidc-provider --profile=perp \
--region ap-northeast-1 \
--cluster perp-staging \
--approve

aws iam list-open-id-connect-providers --profile=perp

ref:
https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html

Create an IAM policy that allows the role to access all secrets we created on AWS Secret Manager:

AWS_ACCOUNT_ID=$(aws sts get-caller-identity --profile=perp --query "Account" --output text)

cat <<EOF > policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetResourcePolicy",
        "secretsmanager:GetSecretValue",
        "secretsmanager:DescribeSecret",
        "secretsmanager:ListSecretVersionIds"
      ],
      "Resource": [
        "arn:aws:secretsmanager:ap-northeast-1:${AWS_ACCOUNT_ID}:secret:*"
      ]
    }
  ]
}
EOF

aws iam create-policy --profile=perp \
--policy-name perp-staging-secrets-policy --policy-document file://policy.json

Attach the above IAM policy to an IAM role, and define AssumeRole for the service account external-secrets-kubernetes-external-secrets which will be created later:

AWS_ACCOUNT_ID=$(aws sts get-caller-identity --profile=perp --query "Account" --output text)
OIDC_PROVIDER=$(aws eks describe-cluster --profile=perp --name perp-staging --region ap-northeast-1 --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\/\///") 

cat <<EOF > trust.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${OIDC_PROVIDER}:aud": "sts.amazonaws.com",
          "${OIDC_PROVIDER}:sub": "system:serviceaccount:default:external-secrets-kubernetes-external-secrets"
        }
      }
    }
  ]
}
EOF

aws iam create-role --profile=perp \
--role-name perp-staging-secrets-role \
--assume-role-policy-document file://trust.json

aws iam attach-role-policy --profile=perp \
--role-name perp-staging-secrets-role \
--policy-arn YOUR_POLICY_ARN

ref:
https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html
https://gist.github.com/lukaszbudnik/f1f42bd5a57430e3c25034200ba44c2e

Deploy kubernetes-external-secrets Controller

helm repo add external-secrets https://external-secrets.github.io/kubernetes-external-secrets/

helm install external-secrets \
external-secrets/kubernetes-external-secrets \
--skip-crds \
--set env.AWS_REGION=ap-northeast-1 \
--set securityContext.fsGroup=65534 \
--set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"='YOUR_ROLE_ARN'

helm list

It would automatically create a service account named external-secrets-kubernetes-external-secrets in Kubernetes.

ref:
https://github.com/external-secrets/kubernetes-external-secrets/tree/master/charts/kubernetes-external-secrets

Deploy ExternalSecret

ExternalSecret app-secrets will generate a Secret object with the same name, and the content would look like:

apiVersion: kubernetes-client.io/v1
kind: ExternalSecret
metadata:
  name: example-secret
spec:
  backendType: secretsManager
  region: ap-northeast-1
  dataFrom:
    - YOUR_SECRET
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: example-app
  template:
    metadata:
      labels:
        app: example-app
    spec:
      containers:
      - name: example-app
        image: busybox:latest
        envFrom:
        - secretRef:
            name: example-secret
kubectl get secret example-secret -o jsonpath="{.data.KEY_1}" | base64 --decode

ref:
https://gist.github.com/lukaszbudnik/f1f42bd5a57430e3c25034200ba44c2e

Amazon EKS: Create a Kubernetes cluster via ClusterConfig

Amazon EKS: Create a Kubernetes cluster via ClusterConfig

Amazon Elastic Kubernetes Service (Amazon EKS) is a managed Kubernetes as a Service on AWS. IMAO, Google Cloud's GKE is still the best choice of managed Kubernetes service if you're not stuck in AWS.

ref:
https://docs.aws.amazon.com/eks/latest/userguide/eks-ug.pdf
https://docs.aws.amazon.com/eks/latest/userguide/getting-started.html

Also see:
https://vinta.ws/code/the-complete-guide-to-google-kubernetes-engine-gke.html

Installation

We need to install some command-line tools: aws, eksctl and kubectl.

brew tap weaveworks/tap
brew install awscli weaveworks/tap/eksctl kubernetes-cli

k9s and fubectl are also recommended which provides fancy terminal UIs to interact with your Kubernetes clusters.

brew install k9s

curl -LO https://rawgit.com/kubermatic/fubectl/master/fubectl.source
source <path-to>/fubectl.source

ref:
https://github.com/derailed/k9s
https://github.com/kubermatic/fubectl

Create Cluster

We use a ClusterConfig to define our cluster.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: perp-staging
  region: ap-northeast-1
# All workloads in the "fargate" Kubernetes namespace will be scheduled onto Fargate
fargateProfiles:
  - name: fp-default
    selectors:
      - namespace: fargate
# https://eksctl.io/usage/schema/
managedNodeGroups:
  - name: managed-ng-m5-4xlarge
    instanceType: m5.4xlarge
    instancePrefix: m5-4xlarge
    minSize: 1
    maxSize: 5
    desiredCapacity: 3
    volumeSize: 100
    iam:
      withAddonPolicies:
        cloudWatch: true
        albIngress: true
        ebs: true
        efs: true
# Enable envelope encryption for Kubernetes Secrets
secretsEncryption:
  keyARN: "arn:aws:kms:YOUR_KMS_ARN"
# Enable CloudWatch logging for Kubernetes components
cloudWatch:
  clusterLogging:
    enableTypes: ["*"]

Create the cluster.

eksctl create cluster --profile=perp -f clusterconfig.yaml

ref:
https://eksctl.io/usage/creating-and-managing-clusters/
https://github.com/weaveworks/eksctl/tree/master/examples

We can also use the same config file to update our cluster, but not all configurations are supported currently.

eksctl upgrade cluster --profile=perp -f clusterconfig.yaml

Access Cluster

aws eks --profile=perp update-kubeconfig \
--region ap-northeast-1 \
--name perp-staging \
--alias vinta@perp-staging

ref:
https://docs.aws.amazon.com/eks/latest/userguide/create-kubeconfig.html

Delete Cluster

# You might need to manually delete/detach following resources first:
# Detach non-default policies for FargatePodExecutionRole and NodeInstanceRole
# Fargate Profile
# EC2 Network Interfaces 
# EC2 ALB
eksctl delete cluster --profile=perp \
--region ap-northeast-1 \
--name perp-staging

Then you can delete the CloudFormation stack on AWS Management Console.

Cluster Authentication

kubectl get configmap aws-auth -n kube-system -o yaml

We must copy mapRoles from the above ConfigMap, and add the mapUsers section:

apiVersion: v1
kind: ConfigMap
metadata:
  name: aws-auth
  namespace: kube-system
data:
  # NOTE: mapRoles are copied from "kubectl get configmap aws-auth -n kube-system -o yaml"
  mapRoles: |
    - rolearn: YOUR_ARN_FargatePodExecutionRole
      username: system:node:{{SessionName}}
      groups:
      - system:bootstrappers
      - system:nodes
      - system:node-proxier
    - rolearn: YOUR_ARN_NodeInstanceRole
      username: system:node:{{EC2PrivateDNSName}}
      groups:
        - system:bootstrappers
        - system:nodes
  # Only IAM users listed here can access this cluster
  mapUsers: |
    - userarn: YOUR_USER_ARN
      username: YOUR_AWS_USERNAME
      groups:
        - system:masters
kubectl apply -f aws-auth.yaml
kubectl describe configmap -n kube-system aws-auth

ref:
https://docs.aws.amazon.com/eks/latest/userguide/add-user-role.html
https://aws.amazon.com/premiumsupport/knowledge-center/amazon-eks-cluster-access/

Setup Container Logging for Fargate Nodes

Create an IAM policy and attach the IAM policy to the pod execution role specified for your Fargate profile. The --role-name should be the name of FargatePodExecutionRole, you can find it under "Resources" tab in the CloudFormation stack of your EKS cluster.

curl -so permissions.json https://raw.githubusercontent.com/aws-samples/amazon-eks-fluent-logging-examples/mainline/examples/fargate/cloudwatchlogs/permissions.json

aws iam create-policy --profile=perp \
--policy-name eks-fargate-logging-policy \
--policy-document file://permissions.json

aws iam attach-role-policy --profile=perp \
--policy-arn arn:aws:iam::XXX:policy/eks-fargate-logging-policy \
--role-name eksctl-perp-staging-cluste-FargatePodExecutionRole-XXX

Configure Kubernetes to send container logs on Fargate nodes to CloudWatch via Fluent Bit.

kind: Namespace
apiVersion: v1
metadata:
  name: aws-observability
  labels:
    aws-observability: enabled
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: aws-logging
  namespace: aws-observability
data:
  output.conf: |
    [OUTPUT]
        Name cloudwatch_logs
        Match *
        region ap-northeast-1
        log_group_name /aws/eks/perp-staging/containers
        log_stream_prefix fluent-bit-
        auto_create_group On
kubectl apply -f aws-logs-fargate.yaml

ref:
https://docs.aws.amazon.com/eks/latest/userguide/fargate-logging.html
https://docs.fluentbit.io/manual/pipeline/outputs/cloudwatch

Setup Container Logging for EC2 Nodes (CloudWatch Container Insights)

Deploy Fluent Bit as DaemonSet to send container logs to CloudWatch Logs.

ClusterName=perp-staging
RegionName=ap-northeast-1
FluentBitHttpPort='2020'
FluentBitReadFromHead='Off'
[[ ${FluentBitReadFromHead} = 'On' ]] && FluentBitReadFromTail='Off'|| FluentBitReadFromTail='On'
[[ -z ${FluentBitHttpPort} ]] && FluentBitHttpServer='Off' || FluentBitHttpServer='On'
curl -s https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml | sed 's/{{cluster_name}}/'${ClusterName}'/;s/{{region_name}}/'${RegionName}'/;s/{{http_server_toggle}}/"'${FluentBitHttpServer}'"/;s/{{http_server_port}}/"'${FluentBitHttpPort}'"/;s/{{read_from_head}}/"'${FluentBitReadFromHead}'"/;s/{{read_from_tail}}/"'${FluentBitReadFromTail}'"/' > aws-logs-ec2.yaml
kubectl apply -f aws-logs-ec2.yaml

ref:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-quickstart.html

Fix "No Access-Control-Allow-Origin header" for S3 and CloudFront

Fix "No Access-Control-Allow-Origin header" for S3 and CloudFront

To avoid the error "No 'Access-Control-Allow-Origin' header is present on the requested resource":

  • Enable CORS on your S3 bucket
  • Forward the appropriate headers on your CloudFront distribution

Enable CORS on S3 Bucket

In S3 -> [your bucket] -> Permissions -> Cross-origin resource sharing (CORS):

[
    {
        "AllowedHeaders": [
            "*"
        ],
        "AllowedMethods": [
            "GET"
        ],
        "AllowedOrigins": [
            "*"
        ],
        "ExposeHeaders": []
    }
]

ref:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/cors.html
https://docs.aws.amazon.com/AmazonS3/latest/userguide/ManageCorsUsing.html

Configure Behaviors on CloudFront Distribution

In CloudFront -> [your distribution] -> Behaviors -> Create Behavior:

  • Path Pattern: *
  • Allowed HTTP Methods: GET, HEAD, OPTIONS
  • Cached HTTP Methods: +OPTIONS
  • Origin Request Policy: Managed-CORS-S3Origin
    • This policy actually whitelists the following headers:
      • Access-Control-Request-Headers
      • Access-Control-Request-Method
      • Origin

ref:
https://aws.amazon.com/premiumsupport/knowledge-center/no-access-control-allow-origin-error/

Validate it's working:

fetch("https://metadata.perp.exchange/config.production.json")
.then((res) => res.json())
.then((out) => { console.log(out) })
.catch((err) => { throw err });
How HTTPS Works in Layman's Terms - TLS 1.2 and 1.3

How HTTPS Works in Layman's Terms - TLS 1.2 and 1.3

HTTPS stands for Hypertext Transfer Protocol Secure, also commonly referred to as HTTP over TLS or HTTP over SSL. HTTPS is not a separate protocol from HTTP, it merely means using SSL/TLS to encrypt an HTTP request and response.

ref:
https://howhttps.works/

SSL/TLS

Basically, SSL (Secure Sockets Layer) and TLS (Transport Layer Security) is the same thing. TLS is the modern version of now-deprecated SSL. In most contexts, both terms are exchangeable.

TLS is a security protocol which mainly performs three tasks:

  • Privacy - encrypting data between client and server using Encryption Algorithms.
  • Authentication - ensuring that server is who it claims to be using Certificates.
  • Integrity - verifying that data have not been forged using Message Authentication Code (MAC).

In addition to the use case of HTTPS, TLS can also be used to encrypt other communications such as Email or VoIP.

ref:
https://www.cloudflare.com/learning/ssl/transport-layer-security-tls/

Encryption Algorithms

There are 2 types of encryption algorithms:

  • Symmetric Encryption
    • There is only one key: the client and server use the same key to encrypt and decrypt.
    • Fast and cheap (nanoseconds per operation).
    • A common algorithm is AES.
  • Asymmetric Encryption (also known as Public-key Encryption)
    • There is a pair of two keys: the public key encrypts the message, and only the corresponding private key can decrypt it.
    • Slow and expensive (microseconds to milliseconds per operation).
    • Some common algorithms are RSA and Diffie-Hellman (DH).

TLS actually uses both Asymmetric Encryption and Symmetric Encryption, so-called a hybrid cryptosystem. Simply speaking, TLS first uses an asymmetric algorithm to exchange shared secrets between both sides, then generates a symmetric key (the session key) from the shared secrets, finally uses the session key to encrypt application data (HTTP request/response). A cryptographic system involves certificates and public-key encryption is often called Public Key Infrastructure (PKI).

I'm not an expert in Cryptography or Information Security, but I'm going to talk about what RSA and Diffie-Hellman are a little bit since they are crucial to TLS.

ref:
https://www.thesslstore.com/blog/types-of-encryption-encryption-algorithms-how-to-choose-the-right-one/

RSA

There is a pair of two keys in RSA, the public key can be shared publicly, and the private key, as the name suggests, must be kept secret. Data encrypted with the public key can only be decrypted with the private key, and vice versa. Since RSA is a relatively slow algorithm, we usually generate a key pair and use them for every connection. So RSA keys are considered static.

RSA is also often used for Digital Signature. In this case, the message is hashed first since RSA operations can't handle messages longer than the key size. The sender generates a signature by signing (encrypting) the hash with its own private key, and sends both the message and the signature to the receiver. The receiver also hashes the message first, and verifies (decrypts) the signature with the corresponding public key, and checks whether the decrypting hash equals the hash of the message. If they are equal, the message is indeed sent by the sender because no one else has the private key.

ref:
https://www.comparitech.com/blog/information-security/digital-signatures/

Diffie-Hellman (DH)

There are many variants of Diffie-Hellman, for instance, Diffie-Hellman Ephemeral (DHE), Elliptic Curve Diffie–Hellman (ECDH), and Elliptic Curve Diffie-Hellman Ephemeral (ECDHE).

Let's talk about how DHE works first:

  1. Both client and server agree on a set of DH parameters: g (generator) and p (prime).
    • Instead of exchange, g and p are usually predefined in the software that client and server use.
    • These values are public so it's ok that an attacker knows them.
  2. Each of client and server generates a random number as the private key, and calculates the public key from DH formula 1: (g^own_private_key) mod p.
    • There are 2 key pairs:
      • Client private key
      • Client public key
      • Server private key
      • Server public key
    • Since the 2 private keys are randomly generated for every connection, this is the "Ephemeral" part of DHE.
  3. Each of client and server sends their public keys to the other side.
  4. Each of client and server calculates the same shared secret from DH formula 2: (the_other's_public_key^own_private_key) mod p.
    • Client's shared secret = (server_public_key^client_private_key) mod p.
    • Server's shared secret = (client_public_key^server_private_key) mod p.
    • Magically, client's shared secret == server's shared secret.

Then we can use the shared secret for Symmetric Encryption.

In step 2, if both client and server always use the same private keys for every connection, that is Static DH. Because the key pairs are temporary, a compromise of private keys does not jeopardize the privacy of other DH connections. This is known as Perfect Forward Secrecy (PFS). Moreover, if we replace the DH formula in step 2 and 4 with an elliptic curve formula, that is ECDH.

ref:
https://www.wst.space/ssl-part-2-diffie-hellman-key-exchange/
https://crypto.stackexchange.com/questions/67797/in-diffie-hellman-are-g-and-p-universal-constants-or-are-they-chosen-by-one

Certificates

To obtain a valid SSL certificate, the server first needs to create a Certificate Signing Request (CSR) file with an RSA private key and submits it to a Certificate Authority (CA). A CA is an organization, a trusted third-party that generates and gives out SSL certificates. The CA will also sign the certificate with its private key, allowing clients to verify it with CA's public key. Operating systems and browsers have pre-installed public keys of all of the major CAs.

A SSL certificate contains:

  • The domain name and associated subdomains
  • The issuer (CA)
  • The CA's digital signature
  • The expiration date of the certificate
  • The server's public key

A certificate is actually a chain of multiple certificates (Chain of Trust), usually three or more: the server's certificate, the intermediate CA's certificate, and the root CA's certificate. In a TLS communication, the client first verifies the signature of the server's certificate with the intermediate CA's public key, and checks the signature of the intermediate CA's certificate with the root CA's public key. Finally, the root CA's certificate is inherently trusted by OSs or browsers. If any verification fails or the root certificate is not trusted (this would be a self-signing certificate), the TLS communication terminates.

Clients also checks if the certificate is for the correct domain.

ref:
https://www.cloudflare.com/learning/ssl/what-is-an-ssl-certificate/
https://security.stackexchange.com/questions/56389/ssl-certificate-framework-101-how-does-the-browser-actually-verify-the-validity

Message Authentication Code (MAC) Algorithms

Simply speaking, the sender calculates a MAC code by doing mac_code = mac_function(key, message), then sends the message along with the MAC code to the receiver. The receiver also calculates a MAC code of the message using the same MAC algorithm and the shared secret key, and checks whether both MAC values are equal. If they are equal, the integrity of the message is confirmed. A MAC code is sometimes called a checksum.

A common MAC algorithm is HMAC.

ref:
https://crypto.stackexchange.com/questions/5646/what-are-the-differences-between-a-digital-signature-a-mac-and-a-hash

TLS Handshake

TLS handshake is the foundational part of a HTTPS communication which happens after establishing TCP connection and before the HTTP request/response cycle. The purpose of TLS handshake is to negotiate a session key used to encrypt/decrypt HTTP data.

Most of major browsers have already dropped support for TLS 1.0 and 1.1 in 2020, also, TLS 1.2 and 1.3 are currently the most widely used versions of TLS. We will focus on later ones. Since TLS 1.2 supports multiple algorithms for key exchange but TLS 1.3 only uses Diffie-Hellman (RSA has been completely removed), so we are going to talk about TLS handshake with RSA in TLS 1.2 and TLS handshake with Diffie-Hellman in TLS 1.3.

ref:
https://www.cloudflare.com/learning/ssl/what-happens-in-a-tls-handshake/
https://www.thesslstore.com/blog/explaining-ssl-handshake/
https://www.thesslstore.com/blog/cipher-suites-algorithms-security-settings/

TLS Handshake with RSA in TLS 1.2

Before the TLS handshake, both client and server need to establish a TCP connection if HTTP/1.1 or HTTP/2 is using (HTTP/3 would be another story). TCP is bi-directional, and there could be multiple packets sent in one trip.

TLS handshake in TLS 1.2 takes 2 roundtrips:

| Client                                            | Server                                            |
|---------------------------------------------------|---------------------------------------------------|
| -> TCP: SYN                                       |                                                   |
|                                                   | <- TCP: SYN ACK                                   |
| -> TCP: ACK                                       |                                                   |
| -> TLS: Client Hello (plaintext)                  |                                                   |
|                                                   | <- TLS: Server Hello (plaintext)                  |
|                                                   | <- TLS: Server Certificate (plaintext)            |
|                                                   | <- TLS: Server Hello Done (plaintext)             |
| -> TLS: Client Key Exchange (plaintext)           |                                                   |
| -> TLS: Client Change Cipher Spec (plaintext)     |                                                   |
| -> TLS: Client Handshake Finished (encrypted)     |                                                   |
|                                                   | <- TLS: Server Change Cipher Spec (plaintext)     |
|                                                   | <- TLS: Server Handshake Finished (encrypted)     |
| -> HTTP: Request (encrypted)                      |                                                   |
|                                                   | <- HTTP: Response (encrypted)                     |
  • Client Hello
    • Sending following data:
      • A random "Client Random" string.
      • A list of supported Cipher Suites (a list of cryptographic algorithms).
  • Server Hello
    • Sending following data:
      • A random "Server Random" string.
      • The selected Cipher Suite.
  • Server Certificate
    • Sending SSL certificate contains server's public key.
  • Server Hello Done
    • Telling client that it has sent over above messages.
  • Client Key Exchange
    • Sending a random "Pre-master Secret" encrypted with server's public key.
  • Client Change Cipher Spec
    • Generating the session key from Client Random, Server Random, and Pre-master Secret.
    • Telling server that it's ready for encrypted communication.
  • Client Finished
    • Sending a verification data encrypted with the session key.
  • Server Change Cipher Spec
    • Generating the session key from Client Random, Server Random, and Pre-master Secret.
    • Telling client that it's ready for encrypted communication.
  • Server Finished
    • Sending a verification data encrypted with the session key.

After the TLS handshake, both client and server start to encrypt/decrypt application data with the session key.

For a complete byte-by-byte illustration of TLS 1.2 Handshake, see:
https://tls.ulfheim.net/

ref:
https://medium.com/kuranda-labs-engineering/tls-6d9f75adba9f

TLS Handshake with Elliptic Curve Diffie-Hellman Ephemeral (ECDHE) in TLS 1.3

TLS or any encrypted communications have always added overhead when it comes to performance. One of the significant changes of TLS 1.3 is that the TLS handshake in TLS 1.3 only requires one roundtrip instead of 2, which makes TLS 1.3 much faster than older versions.

TLS 1.3 has also reduced the number of supported Cipher Suites from 37 to 5 by removing weak and less-used cryptographic algorithms, and only supports ECDHE for key exchange algorithms, which means that client can send its key share information right away at the beginning of the handshake. In other words, TLS 1.3 merges Client Key Exchange into Client Hello.

TLS handshake in TLS 1.3 takes 1 (and a half actually) roundtrip:

| Client                                            | Server                                            |
|---------------------------------------------------|---------------------------------------------------|
| -> TCP: SYN                                       |                                                   |
|                                                   | <- TCP: SYN ACK                                   |
| -> TCP: ACK                                       |                                                   |
| -> TLS: Client Hello (plaintext)                  |                                                   |
|                                                   | <- TLS: Server Hello (plaintext)                  |
|                                                   | <- TLS: Wrapper (encrypted)                       |
|                                                   |         Server Certificate                        |
|                                                   |         Server Handshake Finished                 |
| -> TLS: Wrapper (encrypted)                       |                                                   |
|         Client Handshake Finished                 |                                                   |
| -> HTTP: Request (encrypted)                      |                                                   |
|                                                   | <- HTTP: Response (encrypted)                     |
  • Client Hello
    • Before Client Hello, calculating an ephemeral key pair for ECDHE key share based on the selected curve.
    • Sending following data:
      • A random "Client Random" string.
      • A list of supported Cipher Suites.
      • The selected curve (DH parameters: g and p) for ECDHE.
      • The client's ECDHE public key.
  • Server Hello
    • Before Server Hello, calculating an ephemeral key pair for ECDHE key share based on client's selected curve.
    • Sending following data:
      • A random "Server Random" string.
      • The selected Cipher Suite.
      • The server's ECDHE public key.
  • Server Wrapper
    • Before Server Wrapper, generating the handshake key from its own ECDHE private key and client's ECDHE public key.
    • Containing following records:
      • Server Certificate
      • Server Handshake Finished
      • They are encrypted with the handshake key.
    • After Server Wrapper, generating the session key from Client Random, Server Random, and the handshake key.
  • Client Wrapper
    • Before Client Wrapper, generating the handshake key from its own ECDHE private key and server's ECDHE public key.
    • Containing following records:
      • Client Handshake Finished
      • They are encrypted with the handshake key.
    • After Client Wrapper, generating the session key from Client Random, Server Random, and the handshake key.

After the TLS handshake, both client and server start to encrypt/decrypt application data with the session key.

For a complete byte-by-byte illustration of TLS 1.3 Handshake, see:
https://tls13.ulfheim.net/

ref:
https://www.thesslstore.com/blog/tls-1-3-everything-possibly-needed-know/

The Incomplete Guide to Google Kubernetes Engine

The Incomplete Guide to Google Kubernetes Engine

Kubernetes is the de facto standard of container orchestration (deploying workloads on distributed systems). Google Kubernetes Engine (GKE) is the managed Kubernetes as a Service provided by Google Cloud Platform.

Currently, GKE is still your best choice compares to other managed Kubernetes services, i.e., Azure Container Service (AKS) and Amazon Elastic Container Service for Kubernetes (EKS).

ref:
https://kubernetes.io/
https://cloud.google.com/kubernetes-engine/

You could find the sample project on GitHub.
https://github.com/vinta/simple-project-on-k8s

Installation

Install gcloud to create Kubernetes clusters on Google Cloud Platform.

Install kubectl to interact with any Kubernetes cluster.

$ brew install kubernetes-cli
# or
$ gcloud components install kubectl
$ gcloud components update

ref:
https://cloud.google.com/sdk/docs/
https://kubernetes.io/docs/tasks/tools/install-kubectl/

Some useful tools:

Concepts

Nodes

  • Cluster: A set of machines, called nodes, that run containerized applications.
  • Node: A single virtual or physical machine that provides hardware resources.
  • Edge Node: The node which is exposed to the Internet.
  • Master Node: The node which is responsible for managing the whole cluster.

Objects

  • Pod: A group of tightly related containers. Each pod is like a logical host has its own IP, hostname, and storages.
  • PodPreset: A set of pre-defined configurations can be injected into Pods automatically.
  • Service: A load balancer of a set of Pods which selected by labels, also called Service Discovery.
  • Ingress: A revered proxy acts as an entry point to the cluster, which allows domain-based and path-based routing to different Services.
  • ConfigMap: Key-value configuration data can be mounted into containers or consumed as environment variables.
  • Secret: Similar to ConfigMap but for storing sensitive data only.
  • Volume: A ephemeral file system whose lifetime is the same as the Pod.
  • PersistentVolume: A persistent file system that can be mounted to the cluster, without being associated with any particular node.
  • PersistentVolumeClaim: A binding between a Pod and a PersistentVolume.
  • StorageClass: A storage provisioner which allows users to request storages dynamically.
  • Namespace: The way to partition a single cluster into multiple virtual groups.

Controllers

  • ReplicationController: Ensures that a specified number of Pods are always running.
  • ReplicaSet: The next-generation ReplicationController.
  • Deployment: The recommended way to deploy stateless Pods.
  • StatefulSet: Similar to Deployment but provides guarantees about the ordering and unique names of Pods.
  • DaemonSet: Ensures a copy of a Pod is running on every node.
  • Job: Creates Pods that runs to completion (exit with 0).
  • CronJob: A Job which can run at a specific time or run regularly.
  • HorizontalPodAutoscaler: Automatically scales the number of Pods based on CPU and memory utilization or custom metric targets.

ref:
https://kubernetes.io/docs/concepts/
https://kubernetes.io/docs/reference/glossary/?all=true

Setup Google Cloud Accounts

Make sure you use the right Google Cloud Platform account.

$ gcloud init
# or
$ gcloud config configurations list
$ gcloud config configurations activate default
$ gcloud config set project simple-project-198818
$ gcloud config set compute/region asia-east1
$ gcloud config set compute/zone asia-east1-a
$ gcloud config list

Create Clusters

Create a regional cluster in asia-east1 region which has 1 node in each of the asia-east1 zones using --region=asia-east1 --num-nodes=1. By default, a cluster only creates its cluster master and nodes in a single compute zone.

# show available OSs and versions of Kubernetes
$ gcloud container get-server-config

# show available CPU platforms in the desired zone
$ gcloud compute zones describe asia-east1-a
availableCpuPlatforms:
- Intel Skylake
- Intel Broadwell
- Intel Haswell
- Intel Ivy Bridge

$ gcloud container clusters create demo \
--cluster-version=1.11.6-gke.6 \
--node-version=1.11.6-gke.6 \
--scopes=gke-default,cloud-platform,storage-full,compute-ro,pubsub,https://www.googleapis.com/auth/cloud_debugger \
--region=asia-east1 \
--num-nodes=1 \
--enable-autoscaling --min-nodes=1 --max-nodes=10 \
--maintenance-window=20:00 \
--machine-type=n1-standard-4 \
--min-cpu-platform="Intel Skylake" \
--enable-ip-alias \
--create-subnetwork="" \
--image-type=UBUNTU \
--node-labels=custom.kubernetes.io/fs-type=xfs

$ gcloud container clusters describe demo --region=asia-east1

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0", GitTreeState:"clean", BuildDate:"2019-02-04T04:48:55Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.5-gke.5", GitCommit:"9aba9c1237d9d2347bef28652b93b1cba3aca6d8", GitTreeState:"clean", BuildDate:"2018-12-11T02:36:50Z", GoVersion:"go1.10.3b4", Compiler:"gc", Platform:"linux/amd64"}

$ kubectl get nodes -o wide

You can only get a regional cluster by creating a whole new cluster, Google currently won't allow you to turn an existed cluster into a regional one.

ref:
https://cloud.google.com/sdk/gcloud/reference/container/clusters/create
https://cloud.google.com/compute/docs/machine-types
https://cloud.google.com/kubernetes-engine/docs/concepts/regional-clusters
https://cloud.google.com/kubernetes-engine/docs/how-to/min-cpu-platform
https://cloud.google.com/kubernetes-engine/docs/how-to/alias-ips

Google Kubernetes Engine clusters running Kubernetes version 1.8+ enable Role-Based Access Control (RBAC) by default. Therefore, you must explicitly provide --enable-legacy-authorization option to disable RBAC.

ref:
https://cloud.google.com/kubernetes-engine/docs/how-to/role-based-access-control

Delete the cluster. After you delete the cluster, you might also need to manually delete persistent disks (under Compute Engine), load balancers (under Network services) and static IPs (under VPC network) which belong to the cluster on Google Cloud Platform Console.

$ gcloud container clusters delete demo --region=asia-east1

Create Node Pools

Create a cluster with preemptible VMs which are much cheaper than regular instances using --preemptible.

You might receive The connection to the server x.x.x.x was refused - did you specify the right host or port? error while upgrading the cluster which includes adding new node pools.

$ gcloud container node-pools create n1-standard-4-pre \
--cluster=demo \
--node-version=1.11.6-gke.6 \
--scopes=gke-default,storage-full,compute-ro,pubsub,https://www.googleapis.com/auth/cloud_debugger \
--region=asia-east1 \
--num-nodes=1 \
--enable-autoscaling --min-nodes=1 --max-nodes=10 \
--machine-type=n1-standard-4 \
--min-cpu-platform="Intel Skylake" \
--node-labels=custom.kubernetes.io/scopes-storage-full=true
--enable-autorepair \
--preemptible

$ gcloud container node-pools list --cluster=demo --region=asia-east1

$ gcloud container operations list

ref:
https://cloud.google.com/sdk/gcloud/reference/container/node-pools/create
https://cloud.google.com/kubernetes-engine/docs/concepts/preemptible-vm
https://cloud.google.com/compute/docs/regions-zones/

Build Docker Images

You could use Google Cloud Build or any Continuous Integration (CI) service to automatically build Docker images and push them to Google Container Registry.

Furthermore, you need to tag your Docker images appropriately with the registry name format: region_name.gcr.io/your_project_id/your_image_name:version.

ref:
https://cloud.google.com/container-builder/
https://cloud.google.com/container-registry/

An example of cloudbuild.yaml:

substitutions:
  _REPO_NAME: simple-api
steps:
- id: pull-image
  name: gcr.io/cloud-builders/docker
  entrypoint: "/bin/sh"
  args: [
    "-c",
    "docker pull asia.gcr.io/$PROJECT_ID/$_REPO_NAME:$BRANCH_NAME || true"
  ]
  waitFor: [
    "-"
  ]
- id: build-image
  name: gcr.io/cloud-builders/docker
  args: [
    "build",
    "--cache-from", "asia.gcr.io/$PROJECT_ID/$_REPO_NAME:$BRANCH_NAME",
    "--label", "git.commit=$SHORT_SHA",
    "--label", "git.branch=$BRANCH_NAME",
    "--label", "ci.build-id=$BUILD_ID",
    "-t", "asia.gcr.io/$PROJECT_ID/$_REPO_NAME:$SHORT_SHA",
    "simple-api/"
  ]
  waitFor: [
    "pull-image",
  ]
images:
  - asia.gcr.io/$PROJECT_ID/$_REPO_NAME:$SHORT_SHA

ref:
https://cloud.google.com/container-builder/docs/build-config
https://cloud.google.com/container-builder/docs/create-custom-build-steps

Of course, you could also manually push Docker images to Google Container Registry.

$ gcloud auth configure-docker && \
gcloud config set project simple-project-198818 && \
export PROJECT_ID="$(gcloud config get-value project -q)"

$ docker build --rm -t asia.gcr.io/${PROJECT_ID}/simple-api:v1 simple-api/

$ gcloud docker -- push asia.gcr.io/${PROJECT_ID}/simple-api:v1

$ gcloud container images list --repository=asia.gcr.io/${PROJECT_ID}

ref:
https://cloud.google.com/container-registry/docs/pushing-and-pulling

Moreover, you should always adopt Multi-Stage builds for your Dockerfiles.

FROM python:3.6.8-alpine3.7 AS builder

ENV PATH=$PATH:/root/.local/bin
ENV PIP_DISABLE_PIP_VERSION_CHECK=1

WORKDIR /usr/src/app/

RUN apk add --no-cache --virtual .build-deps \
        build-base \
        linux-headers \
        openssl-dev \
        zlib-dev

COPY requirements.txt .

RUN pip install --user -r requirements.txt && \
    find $(python -m site --user-base) -type f -name "*.pyc" -delete && \
    find $(python -m site --user-base) -type f -name "*.pyo" -delete && \
    find $(python -m site --user-base) -type d -name "__pycache__" -delete

###

FROM python:3.6.8-alpine3.7

ENV PATH=$PATH:/root/.local/bin
ENV FLASK_APP=app.py

WORKDIR /usr/src/app/

RUN apk add --no-cache --virtual .run-deps \
    ca-certificates \
    curl \
    openssl \
    zlib

COPY --from=builder /root/.local/ /root/.local/
COPY . .

EXPOSE 8000

CMD ["uwsgi", "--ini", "config/uwsgi.ini", "--single-interpreter", "--enable-threads", "--http", ":8000"]

ref:
https://medium.com/@tonistiigi/advanced-multi-stage-build-patterns-6f741b852fae

Create Pods

No, you should never create Pods directly which are so-called naked Pods. Use Deployment instead.

ref:
https://kubernetes.io/docs/concepts/workloads/pods/pod-overview/

Pods have following life cycles (states):

  • Pending
  • Running
  • Succeeded
  • Failed
  • Unknown

ref:
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/

Inspect Pods

Show information about Pods.

$ kubectl get all

$ kubectl get deploy

$ kubectl get pods
$ kubectl get pods -l app=simple-api
$ kubectl get pods

$ kubectl describe pod simple-api-5bbf4dd4f9-8b4c9
$ kubectl get pod simple-api-5bbf4dd4f9-8b4c9 -o yaml

ref:
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#describe
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#get

Execute a command in a container.

$ kubectl exec -i -t simple-api-5bbf4dd4f9-8b4c9 -- sh

ref:
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#exec

Tail Pod logs. It is also recommended to use kubetail.

$ kubectl logs simple-api-5bbf4dd4f9-8b4c9 -f
$ kubectl logs deploy/simple-api -f
$ kubectl logs statefulset/mongodb-rs0 -f

$ kubetail simple-api
$ kubetail simple-worker
$ kubetail mongodb-rs0 -c db

ref:
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#logs
https://github.com/johanhaleby/kubetail

List all Pods on a certain node.

$ kubectl describe node gke-demo-default-pool-fb33ac26-frkw
...
Non-terminated Pods:         (7 in total)
  Namespace                  Name                                              CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                              ------------  ----------  ---------------  -------------
  default                    mongodb-rs0-1                                     2100m (53%)   4 (102%)    4G (30%)         4G (30%)
  default                    simple-api-84554476df-w5b5g                       500m (25%)    1 (51%)     1G (16%)         1G (16%)
  default                    simple-worker-6495b6b74b-rqplv                    500m (25%)    1 (51%)     1G (16%)         1G (16%)
  kube-system                fluentd-gcp-v3.0.0-848nq                          100m (2%)     0 (0%)      200Mi (1%)       300Mi (2%)
  kube-system                heapster-v1.5.3-6447d67f78-7psb2                  138m (3%)     138m (3%)   301856Ki (2%)    301856Ki (2%)
  kube-system                kube-dns-788979dc8f-5zvfk                         260m (6%)     0 (0%)      110Mi (0%)       170Mi (1%)
  kube-system                kube-proxy-gke-demo-default-pool-3c058fcf-x7cv    100m (2%)     0 (0%)      0 (0%)           0 (0%)
...

$ kubectl get pods --all-namespaces -o wide --sort-by="{.spec.nodeName}"

Check resource usage.

$ kubectl top pods
$ kubectl top nodes

ref:
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#top
https://kubernetes.io/docs/tasks/debug-application-cluster/

Restart Pods.

# you could simply kill Pods which would restart automatically if your Pods are managed by any Deployment
$ kubectl delete pods -l app=simple-worker

# you could replace a resource by providing a manifest
$ kubectl replace --force -f simple-api/

ref:
https://stackoverflow.com/questions/40259178/how-to-restart-kubernetes-pods

Completely delete resources.

$ kubectl delete -f simple-api/ -R
$ kubectl delete deploy simple-api
$ kubectl delete deploy -l app=simple,role=worker

# delete a Pod forcefully
$ kubectl delete pod simple-api-668d465985-886h5 --grace-period=0 --force
$ kubectl delete deploy simple-api --grace-period=0 --force

# delete all resources under a namespace
$ kubectl delete daemonsets,deployments,services,statefulset,pvc,pv --all --namespace tick

ref:
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#delete

Create ConfigMaps

Create an environment-variable-like ConfigMap.

kind: ConfigMap
apiVersion: v1
metadata:
  name: simple-api
data:
  FLASK_ENV: production
  MONGODB_URL: mongodb://mongodb-rs0-0.mongodb-rs0.default.svc.cluster.local,mongodb-rs0-1.mongodb-rs0.default.svc.cluster.local,mongodb-rs0-3.mongodb-rs0.default.svc.cluster.local/demo?readPreference=secondaryPreferred&maxPoolSize=10
  CACHE_URL: redis://redis-cache.default.svc.cluster.local/0
  CELERY_BROKER_URL: redis://redis-broker.default.svc.cluster.local/0
  CELERY_RESULT_BACKEND: redis://redis-broker.default.svc.cluster.local/1

Load environment variables from a ConfigMap:

kind: Deployment
apiVersion: apps/v1
metadata:
  name: simple-api
  labels:
    app: simple-api
spec:
  replicas: 1
  selector:
    matchLabels:
      app: simple-api
  template:
    metadata:
      labels:
        app: simple-api
    spec:
      containers:
      - name: simple-api
        image: asia.gcr.io/simple-project-198818/simple-api:4fc4199
        command: ["uwsgi", "--ini", "config/uwsgi.ini", "--single-interpreter", "--enable-threads", "--http", ":8000"]
        envFrom:
        - configMapRef:
            name: simple-api
        ports:
        - containerPort: 8000

Create a file-like ConfigMap.

kind: ConfigMap
apiVersion: v1
metadata:
  name: redis-cache
data:
  redis.conf: |-
    maxmemory-policy allkeys-lfu
    appendonly no
    save ""

Mount files from a ConfigMap:

kind: Deployment
apiVersion: apps/v1
metadata:
  name: redis-cache
  labels:
    app: redis-cache
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis-cache
  template:
    metadata:
      labels:
        app: redis-cache
    spec:
      volumes:
      - name: config
        configMap:
          name: redis-cache
      containers:
      - name: redis
        image: redis:4.0.10-alpine
        command: ["redis-server"]
        args: ["/etc/redis/redis.conf", "--loglevel", "verbose", "--maxmemory", "1g"]
        volumeMounts:
        - name: config
          mountPath: /etc/redis
        ports:
        - containerPort: 6379

ref:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/

Only mount a single file with subPath.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: redis-cache
  labels:
    app: redis-cache
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis-cache
  template:
    metadata:
      labels:
        app: redis-cache
    spec:
      volumes:
      - name: config
        configMap:
          name: redis-cache
      containers:
      - name: redis
        image: redis:4.0.10-alpine
        command: ["redis-server"]
        args: ["/etc/redis/redis.conf", "--loglevel", "verbose", "--maxmemory", "1g"]
        volumeMounts:
        - name: config
          mountPath: /etc/redis/redis.conf
          subPath: redis.conf
        ports:
        - containerPort: 6379

ref:
https://github.com/kubernetes/kubernetes/issues/44815#issuecomment-297077509

It is worth noting that changing ConfigMap or Secret won't trigger re-deploying Deployment. A workaround might be changing the name of ConfigMap every time you change the content of ConfigMap. If you mount ConfigMap as environment variables, you must trigger a re-deployment explicitly.

ref:
https://github.com/kubernetes/kubernetes/issues/22368

Create Secrets

First of all, Secrets are only base64 encoded, not encrypted.

Encode and decode a Secret value.

$ echo -n 'YOUR_SECRET_KEY' | base64
WU9VUl9TRUNSRVRfS0VZ

$ echo 'WU9VUl9TRUNSRVRfS0VZ' | base64 --decode
YOUR_SECRET_KEY

Create an environment-variable-like Secret.

kind: Secret
apiVersion: v1
metadata:
  name: simple-api
data:
  SECRET_KEY: WU9VUl9TRUNSRVRfS0VZ

Export data (base64-encoded) from a Secret.

$ kubectl get secret simple-project-com --export=true -o yaml

ref:
https://kubernetes.io/docs/concepts/configuration/secret/

Create Deployments With Probes

Deployment are designed for stateless (or nearly stateless) services. Deployment controls ReplicaSet and ReplicaSet controls Pod.

ref:
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/

livenessProbe can be used to determine when an application must be restarted by Kubernetes, while readinessProbe can be used to determine when a container is ready to accept traffic.

ref:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

It is also a best practice to always specify resource limits: resources.requests and resources.limits.

ref:
https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/

Create a Deployment with probes.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: simple-api
  labels:
    app: simple-api
spec:
  replicas: 1
  selector:
    matchLabels:
      app: simple-api
  template:
    metadata:
      labels:
        app: simple-api
    spec:
      containers:
      - name: simple-api
        image: asia.gcr.io/simple-project-198818/simple-api:4fc4199
        command: ["uwsgi", "--ini", "config/uwsgi.ini", "--single-interpreter", "--enable-threads", "--http", ":8000"]
        envFrom:
        - configMapRef:
            name: simple-api
        ports:
        - containerPort: 8000
        livenessProbe:
          exec:
            command: ["curl", "-fsS", "-m", "0.1", "-H", "User-Agent: KubernetesHealthCheck/1.0", "http://127.0.0.1:8000/health"]
          initialDelaySeconds: 5
          periodSeconds: 1
          successThreshold: 1
          failureThreshold: 5
        readinessProbe:
          exec:
            command: ["curl", "-fsS", "-m", "0.1", "-H", "User-Agent: KubernetesHealthCheck/1.0", "http://127.0.0.1:8000/health"]
          initialDelaySeconds: 3
          periodSeconds: 1
          successThreshold: 1
          failureThreshold: 3
        resources:
          requests:
            cpu: 500m
            memory: 1G
          limits:
            cpu: 1000m
            memory: 1G

Create another Deployment of Celery workers.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: simple-worker
spec:
  replicas: 2
  selector:
    matchLabels:
      app: simple-worker
  template:
    metadata:
      labels:
        app: simple-worker
    spec:
      terminationGracePeriodSeconds: 30
      containers:
      - name: simple-worker
        image: asia.gcr.io/simple-project-198818/simple-api:4fc4199
        command: ["celery", "-A", "app:celery", "worker", "--without-gossip", "-Ofair", "-l", "info"]
        envFrom:
        - configMapRef:
            name: simple-api
        readinessProbe:
          exec:
            command: ["sh", "-c", "celery inspect -q -A app:celery -d celery@$(hostname) --timeout 10 ping"]
          initialDelaySeconds: 15
          periodSeconds: 15
          timeoutSeconds: 10
          successThreshold: 1
          failureThreshold: 3
        resources:
          requests:
            cpu: 500m
            memory: 1G
          limits:
            cpu: 1000m
            memory: 1G
$ kubectl apply -f simple-api/ -R
$ kubectl get pods

The minimum value of timeoutSeconds is 1 so that you might need to use exec.command to run arbitrary shell commands with custom timeout settings.

ref:
https://cloudplatform.googleblog.com/2018/05/Kubernetes-best-practices-Setting-up-health-checks-with-readiness-and-liveness-probes.html

Create Deployments With InitContainers

If multiple Init Containers are specified for a Pod, those Containers are run one at a time in sequential order. Each must succeed before the next can run. When all of the Init Containers have run to completion, Kubernetes initializes regular containers as usual.

kind: Service
apiVersion: v1
metadata:
  name: gcs-proxy-media-simple-project-com
spec:
  type: NodePort
  selector:
    app: gcs-proxy-media-simple-project-com
  ports:
    - name: http
      port: 80
      targetPort: 80
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: google-cloud-storage-proxy
data:
  nginx.conf: |-
    worker_processes auto;

    http {
      include mime.types;
      default_type application/octet-stream;

      server {
        listen 80;

        if ( $http_user_agent ~* (GoogleHC|KubernetesHealthCheck) ) {
          return 200;
        }

        root /usr/share/nginx/html;
        open_file_cache max=10000 inactive=10m;
        open_file_cache_valid 1m;
        open_file_cache_min_uses 1;
        open_file_cache_errors on;

        include /etc/nginx/conf.d/*.conf;
      }
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gcs-proxy-media-simple-project-com
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gcs-proxy-media-simple-project-com
  template:
    metadata:
      labels:
        app: gcs-proxy-media-simple-project-com
    spec:
      volumes:
      - name: nginx-config
        configMap:
          name: google-cloud-storage-proxy
      - name: nginx-config-extra
        emptyDir: {}
      initContainers:
      - name: create-robots-txt
        image: busybox
        command: ["sh", "-c"]
        args:
        - |
            set -euo pipefail
            cat << 'EOF' > /etc/nginx/conf.d/robots.txt
            User-agent: *
            Disallow: /
            EOF
        volumeMounts:
        - name: nginx-config-extra
          mountPath: /etc/nginx/conf.d/
      - name: create-nginx-extra-conf
        image: busybox
        command: ["sh", "-c"]
        args:
        - |
            set -euo pipefail
            cat << 'EOF' > /etc/nginx/conf.d/extra.conf
            location /robots.txt {
              alias /etc/nginx/conf.d/robots.txt;
            }
            EOF
        volumeMounts:
        - name: nginx-config-extra
          mountPath: /etc/nginx/conf.d/
      containers:
      - name: http
        image: swaglive/openresty:gcsfuse
        imagePullPolicy: Always
        args: ["nginx", "-c", "/usr/local/openresty/nginx/conf/nginx.conf", "-g", "daemon off;"]
        ports:
        - containerPort: 80
        securityContext:
          privileged: true
          capabilities:
            add: ["CAP_SYS_ADMIN"]
        env:
          - name: GCSFUSE_OPTIONS
            value: "--debug_gcs --implicit-dirs --stat-cache-ttl 1s --type-cache-ttl 24h --limit-bytes-per-sec -1 --limit-ops-per-sec -1 -o ro,allow_other"
          - name: GOOGLE_CLOUD_STORAGE_BUCKET
            value: asia.contents.simple-project.com
        volumeMounts:
        - name: nginx-config
          mountPath: /usr/local/openresty/nginx/conf/nginx.conf
          subPath: nginx.conf
          readOnly: true
        - name: nginx-config-extra
          mountPath: /etc/nginx/conf.d/
          readOnly: true
        readinessProbe:
          httpGet:
            port: 80
            path: /
            httpHeaders:
            - name: User-Agent
              value: "KubernetesHealthCheck/1.0"
          timeoutSeconds: 1
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 1
          successThreshold: 1
        resources:
          requests:
            cpu: 0m
            memory: 500Mi
          limits:
            cpu: 1000m
            memory: 500Mi
$ kubectl exec -i -t simple-api-5968cfc48d-8g755 -- sh                                                                                  (gke_simple-project-198818_asia-east1_demo/default)
> curl http://gcs-proxy-media-simple-project-com/robots.txt
User-agent: *
Disallow: /

ref:
https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340

Create Deployments With Canary Deployment

TODO

ref:
https://kubernetes.io/docs/concepts/cluster-administration/manage-deployment/#canary-deployments
https://medium.com/google-cloud/kubernetes-canary-deployments-for-mere-mortals-13728ce032fe

Rollback A Deployment

Yes, you could publish a deployment with kubectl apply --record and rollback it with kubectl rollout undo. However, the simplest way might be just git checkout the previous commit and deploy again with kubectl apply.

The formal way.

$ kubectl apply -f simple-api/ -R --record
$ kubectl rollout history deploy/simple-api
$ kubectl rollout undo deploy/simple-api --to-revision=2

The git way.

$ git checkout b7ed8d5
$ kubectl apply -f simple-api/ -R
$ kubectl get pods

ref:
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-back-a-deployment

Scale A Deployment

Simply increase the number of spec.replicas and deploy again.

$ kubectl apply -f simple-api/ -R
# or
$ kubectl scale --replicas=10 deploy/simple-api

$ kubectl get pods

ref:
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#scale
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#scaling-a-deployment

Create HorizontalPodAutoscalers (HPA)

The Horizontal Pod Autoscaler automatically scales the number of pods in a Deployment based on observed CPU utilization, memory usage, or custom metrics. Yes, HPA only applies to Deployments and ReplicationControllers.

kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
  name: simple-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: simple-api
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      targetAverageUtilization: 80
  - type: Resource
    resource:
      name: memory
      targetAverageValue: 800M
---
kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
  name: simple-worker
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: simple-worker
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      targetAverageUtilization: 80
  - type: Resource
    resource:
      name: memory
      targetAverageValue: 500M
$ kubectl apply -f simple-api/hpa.yaml

$ kubectl get hpa --watch
NAME            REFERENCE                  TARGETS                   MINPODS   MAXPODS   REPLICAS   AGE
simple-api      Deployment/simple-api      18685952/800M, 4%/80%     2         20        3          10m
simple-worker   Deployment/simple-worker   122834944/500M, 11%/80%   2         10        3          10m

ref:
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/

You could run some load testing.

ref:
https://medium.com/@jonbcampos/kubernetes-horizontal-pod-scaling-190e95c258f5

There is also Cluster Autoscaler in Google Kubernetes Engine.

$ gcloud container clusters update demo \
--enable-autoscaling --min-nodes=1 --max-nodes=10 \
--node-pool=default-pool

ref:
https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler

Create VerticalPodsAutoscalers (VPA)

TODO

ref:
https://medium.com/@Mohamed.ahmed/kubernetes-autoscaling-101-cluster-autoscaler-horizontal-pod-autoscaler-and-vertical-pod-2a441d9ad231

Create PodDisruptionBudget (PDB)

  • Voluntary disruptions: actions initiated by application owners or admins.
  • Involuntary disruptions: unavoidable cases like hardware failures or system software error.

PodDisruptionBudgets are only accounted for with voluntary disruptions, something like a hardware failure will not take PodDisruptionBudget into account. PDB cannot prevent involuntary disruptions from occurring, but they do count against the budget.

Create a PodDisruptionBudget for a stateless application.

kind: PodDisruptionBudget
apiVersion: policy/v1beta1
metadata:
  name: simple-api
spec:
  minAvailable: 90%
  selector:
    matchLabels:
      app: simple-api

Create a PodDisruptionBudget for a multiple-instance stateful application.

kind: PodDisruptionBudget
apiVersion: policy/v1beta1
metadata:
  name: mongodb-rs0
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: mongodb-rs0
$ kubectl apply -f simple-api/pdb.yaml
$ kubectl apply -f mongodb/pdb.yaml

$ kubectl get pdb
NAME          MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
mongodb-rs0   2               N/A               1                     48m
simple-api    90%             N/A               0                     48m

ref:
https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
https://kubernetes.io/docs/tasks/run-application/configure-pdb/

Actually, you could also have the similar functionality using .spec.strategy.rollingUpdate.

  • maxUnavailable: The maximum number of Pods that can be unavailable during the update process.
  • maxSurge: The maximum number of Pods that can be created over the desired number of Pods.

Which makes sure that total ready Pods >= total desired Pods - maxUnavailable and total Pods <= total desired Pods + maxSurge.

ref:
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#writing-a-deployment-spec
https://cloud.google.com/kubernetes-engine/docs/how-to/updating-apps

Create Services

A Service is basically a load balancer of a set of Pods which are selected by labels. Since you can't rely on any Pod's IP which changes every time it creates and destroys, you should always provide a Service as an entry point for your Pods or so-called Microservice.

Typically, containers you run in the cluster are not accessible from the Internet, because they do not have external IP addresses. You must explicitly expose your application by creating a Service or an Ingress.

There are following Service types:

  • ClusterIP: A virtual IP which is only reachable from within the cluster. Also, the default Service type.
  • NodePort: It opens a specific port on all Nodes, and any traffic sent to the specific port on any node is forwarded to the Service.
  • LoadBalancer: It builds on NodePorts by additionally configuring the cloud provider to create an external load balancer.
  • ExternalName: It maps the service to an external CNAME record, i.e., your MySQL RDS on AWS.

Create a Service.

kind: Service
apiVersion: v1
metadata:
  name: simple-api
spec:
  type: NodePort
  selector:
    app: simple-api
  ports:
    - name: http
      port: 80
      targetPort: 8000

type: NodePorts is enough in most cases; spec.selector must match labels defined in the corresponding Deployment as the same as spec.ports.targetPort and spec.ports.protocol.

$ kubectl apply -f simple-api/ -R

$ kubectl get svc,endpoints

$ kubespy trace service simple-api
[ADDED v1/Service] default/simple-api
[ADDED v1/Endpoints] default/simple-api
    Directs traffic to the following live Pods:
        - [Ready] simple-api-6b4b4c4bfb-g5dln @ 10.28.1.42
        - [Ready] simple-api-6b4b4c4bfb-h66dg @ 10.28.8.24

ref:
https://kubernetes.io/docs/concepts/services-networking/service/
https://medium.com/google-cloud/kubernetes-nodeport-vs-loadbalancer-vs-ingress-when-should-i-use-what-922f010849e0

After a Service is created, kube-dns creates a corresponding DNS A record named your-service.your-namespace.svc.cluster.local which resolves to an internal IP in the cluster. In ths case: simple-api.default.svc.cluster.local. Headless Services (without a cluster IP) are also assigned a DNS A record which has the same form. Unlike normal Services, this A record directly resolves to a set of IPs of Pods selected by the Service. Clients should be expected to consume the set of IPs or use round-robin selection from the set.

You should always prefer DNS names of a Service over injected environment variables, e.g., FOO_SERVICE_HOST and FOO_SERVICE_PORT.

ref:
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

For more detail about Kubernetes networking, go to:
https://github.com/hackstoic/kubernetes_practice/blob/master/%E7%BD%91%E7%BB%9C.md
https://containerops.org/2017/01/30/kubernetes-services-and-ingress-under-x-ray/
https://www.safaribooksonline.com/library/view/kubernetes-up-and/9781491935668/ch07.html

Configure Services With Google Cloud CDN

kind: BackendConfig
apiVersion: cloud.google.com/v1beta1
metadata:
  name: cdn
spec:
  cdn:
    enabled: true
    cachePolicy:
      includeHost: false
      includeProtocol: false
      includeQueryString: false
---
kind: Service
apiVersion: v1
metadata:
  name: gcs-proxy-media-simple-project-com
  annotations:
    beta.cloud.google.com/backend-config: '{"ports": {"http":"cdn"}}'
    cloud.google.com/neg: '{"ingress": true}'
spec:
  selector:
    app: gcs-proxy-media-simple-project-com
  ports:
    - name: http
      port: 80
      targetPort: 80

ref:
https://cloud.google.com/kubernetes-engine/docs/concepts/backendconfig

Configure Services With Network Endpoint Groups (NEGs)

To use container-native load balancing, you must create a cluster with --enable-ip-alias flag, and just add an annotation to your Services. However, the load balancer is not created until you create an Ingress for the Service.

kind: Service
apiVersion: v1
metadata:
  name: simple-api
  annotations:
    cloud.google.com/neg: '{"ingress": true}'
spec:
  selector:
    app: simple-api
  ports:
    - name: http
      port: 80
      targetPort: 8000

ref:
https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing

Create An Internal Load Balancer

ref:
https://medium.com/@johnjjung/creating-an-inter-kubernetes-cluster-services-using-an-internal-loadbalancer-137f768bb3fc

Use Port Forwarding

Access a Service or a Pod on your local machine with port forwarding.

# 8080 is the local port and 80 is the remote port
$ kubectl port-forward svc/simple-api 8080:80

# port forward to a Pod directly
$ kubectl port-forward mongo-rs0-0 27017:27017

$ open http://127.0.0.1:8080/

ref:
https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/

Create An Ingress

Pods in Kubernetes are not reachable from outside the cluster, so you need a way to expose your Pods to the Internet. Even though you could associate Pods with a Service of the right type, i.e., NodePort or LoadBalancer, the recommended way to expose services is using Ingress. You can do a lot of different things with an Ingress, and there are many types of Ingress controllers that have different capabilities.

There are some reasons to choose Ingress over Service:

  • Service is internal load balancer and Ingress is a gateway of external access to Services
  • Service is L3 load balancer and Ingress is L7 load balancer
  • Ingress allows domain-based and path-based routing to different Services
  • It is not efficient to create a cloud provider's load balancer for each Service you want to expose

Create an Ingress which is implemented using Google Cloud Load Balancing (L7 HTTP load balancer). You should make sure Services exist before creating the Ingress.

kind: Ingress
apiVersion: extensions/v1beta1
metadata:
  name: simple-project
  annotations:
    kubernetes.io/ingress.class: "gce"
    # kubernetes.io/tls-acme: "true"
    # ingress.kubernetes.io/ssl-redirect: "true"
spec:
  # tls:
  # - secretName: simple-project-com-tls
  #   hosts:
  #   - simple-project.com
  #   - www.simple-project.com
  #   - api.simple-project.com
  rules:
  - host: simple-project.com
    http:
      paths:
      - path: /*
        backend:
          serviceName: simple-frontend
          servicePort: 80
  - host: www.simple-project.com
    http:
      paths:
      - path: /*
        backend:
          serviceName: simple-frontend
          servicePort: 80
  - host: api.simple-project.com
    http:
      paths:
      - path: /*
        backend:
          serviceName: simple-api
          servicePort: 80
  - host: asia.contents.simple-project.com
    http:
      paths:
      - path: /*
        backend:
          serviceName: gcs-proxy-media-simple-project-com
          servicePort: 80
  backend:
    serviceName: simple-api
    servicePort: 80

It might take several minutes to spin up a Google HTTP load balancer (includes acquiring the public IP), and at least 5 minutes before the GCE API starts healthchecking backends. After getting your public IP, you could go to your domain provider and create new DNS records which point to the IP.

$ kubectl apply -f ingress.yaml

$ kubectl describe ing simple-project

ref:
https://kubernetes.io/docs/concepts/services-networking/ingress/
https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/
https://www.joyfulbikeshedding.com/blog/2018-03-26-studying-the-kubernetes-ingress-system.html

To read more about Google Load balancer, go to:
https://cloud.google.com/kubernetes-engine/docs/tutorials/http-balancer
https://cloud.google.com/compute/docs/load-balancing/http/backend-service

Setup The Ingress With TLS Certificates

To automatically create HTTPS certificates for your domains:

Create Ingress Controllers

Kubernetes supports multiple Ingress controllers:

ref:
https://container-solutions.com/production-ready-ingress-kubernetes/

Create StorageClasses

StorageClass provides a way to define different available storage types, for instance, ext4 SSD, XFS SSD, CephFS, NFS. You could specify what you want in PersistentVolumeClaim or StatefulSet.

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: ssd
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: ssd-xfs
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd
  fsType: xfs
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: ssd-regional
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd
  zones: asia-east1-a, asia-east1-b, asia-east1-c
  replication-type: regional-pd
$ kubectl apply -f storageclass.yaml
$ kubectl get sc
NAME PROVISIONER AGE
ssd kubernetes.io/gce-pd 5s
ssd-regional kubernetes.io/gce-pd 4s
ssd-xfs kubernetes.io/gce-pd 3s
standard (default) kubernetes.io/gce-pd 1h

ref:
https://kubernetes.io/docs/concepts/storage/storage-classes/#gce

Create PersistentVolumeClaims

A Volume is just a directory which you could mount into containers and it is shared by all containers inside the same Pod. Also, it has an explicit lifetime - the same as the Pod that encloses it. Sources of Volume are various, they could be a remote Git repo, a file path of the host machine, a folder from a PersistentVolumeClaim, or data from a ConfigMap and a Secret.

PersistentVolumes are used to manage durable storage in a cluster. Unlike Volumes, PersistentVolumes have a lifecycle independent of any individual Pod. On Google Kubernetes Engine, PersistentVolumes are typically backed by Google Compute Engine Persistent Disks. Typically, you don't have to create PersistentVolumes explicitly. In Kubernetes 1.6 and later versions, you only need to create PersistentVolumeClaim, and the corresponding PersistentVolume would be dynamically provisioned with StorageClasses. Pods use PersistentVolumeClaims as Volumes.

Be care of creating a Deployment with PersistentVolumeClaim. In most of the case, you might not want to multiple replica of a Deployment write data into the same PersistentVolumeClaim.

ref:
https://kubernetes.io/docs/concepts/storage/volumes/
https://kubernetes.io/docs/concepts/storage/persistent-volumes/
https://cloud.google.com/kubernetes-engine/docs/concepts/persistent-volumes

Also, IOPS is based on the disk size and node size. You need to claim a large disk size if you want high IOPS even you only have very few disk usage.

ref:
https://cloud.google.com/compute/docs/disks/performance

On Kubernetes v1.10+, it is possible to create local PersistentVolumes for your StatefulSets. Previously, PersistentVolumes only supported remote volume types, for instance, GCE's Persistent Disk and AWS's EBS. However, using local storage ties your applications to that specific node, making your application harder to schedule.

ref:
https://kubernetes.io/blog/2018/04/13/local-persistent-volumes-beta/

Create A StatefulSet

Pods created under a StatefulSet have a few unique attributes: the name of the pod is not random, instead each pod gets an ordinal name. In addition, Pods are created one at a time instead of all at once, which can help when bootstrapping a stateful system. StatefulSet also deletes/updates one Pod at a time, in reverse order with respect to its ordinal index, and it waits for each to be completely shutdown before deleting the next.

Rule of thumb: once you find out that you need PersistentVolume for the component, you might just consider using StatefulSet.

ref:
https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/
https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/
https://akomljen.com/kubernetes-persistent-volumes-with-deployment-and-statefulset/

Create a StatefulSet of a three-node MongoDB replica set.

kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: default-view
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: view
subjects:
  - kind: ServiceAccount
    name: default
    namespace: default
---
kind: Service
apiVersion: v1
metadata:
  name: mongodb-rs0
spec:
  clusterIP: None
  selector:
    app: mongodb-rs0
  ports:
    - port: 27017
      targetPort: 27017
---
kind: StatefulSet
apiVersion: apps/v1
metadata:
  name: mongodb-rs0
spec:
  replicas: 3
  updateStrategy:
    type: RollingUpdate
  serviceName: mongodb-rs0
  selector:
    matchLabels:
      app: mongodb-rs0
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: ssd-xfs
      resources:
        requests:
          storage: 100G
  template:
    metadata:
      labels:
        app: mongodb-rs0
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: custom.kubernetes.io/fs-type
                operator: In
                values:
                - "xfs"
              - key: cloud.google.com/gke-preemptible
                operator: NotIn
                values:
                - "true"
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - topologyKey: "kubernetes.io/hostname"
              labelSelector:
                matchExpressions:
                  - key: "app"
                    operator: In
                    values:
                    - mongodb-rs0
      terminationGracePeriodSeconds: 10
      containers:
      - name: db
        image: mongo:3.6.5
        command: ["mongod"]
        args: ["--bind_ip_all", "--replSet", "rs0"]
        ports:
        - containerPort: 27017
        volumeMounts:
        - name: data
          mountPath: /data/db
        readinessProbe:
          exec:
            command: ["mongo", --eval, "db.adminCommand('ping')"]
        resources:
          requests:
            cpu: 2
            memory: 4G
          limits:
            cpu: 4
            memory: 4G
      - name: sidecar
        image: cvallance/mongo-k8s-sidecar
        env:
          - name: MONGO_SIDECAR_POD_LABELS
            value: app=mongodb-rs0
          - name: KUBE_NAMESPACE
            value: default
          - name: KUBERNETES_MONGO_SERVICE_NAME
            value: mongodb-rs0
$ kubectl apply -f storageclass.yaml
$ kubectl apply -f mongodb/ -R

$ kubectl get pods

$ kubetail mongodb -c db
$ kubetail mongodb -c sidecar

$ kubectl scale statefulset mongodb-rs0 --replicas=4

The purpose of cvallance/mongo-k8s-sidecar is to automatically add new Pods to the replica set and remove Pods from the replica set while you scale up or down MongoDB StatefulSet.

ref:
https://github.com/cvallance/mongo-k8s-sidecar
https://kubernetes.io/blog/2017/01/running-mongodb-on-kubernetes-with-statefulsets/
https://medium.com/@thakur.vaibhav23/scaling-mongodb-on-kubernetes-32e446c16b82

Create A Headless Service For A StatefulSet

Headless Services (clusterIP: None) are just like normal Kubernetes Services, except they don’t do any load balancing for you. For a typical StatefulSet component, for instance, a database with Master-Slave replication, you don't want Kubernetes load balancing in order to prevent writing data to slaves accidentally.

When headless Services combine with StatefulSets, they can give you unique DNS addresses which return A records that point directly to Pods themselves. DNS names are in the format of static-pod-name.headless-service-name.namespace.svc.cluster.local.

kind: Service
apiVersion: v1
metadata:
  name: redis-broker
spec:
  clusterIP: None
  selector:
    app: redis-broker
  ports:
  - port: 6379
    targetPort: 6379
---
kind: StatefulSet
apiVersion: apps/v1
metadata:
  name: redis-broker
spec:
  replicas: 1
  serviceName: redis-broker
  selector:
    matchLabels:
      app: redis-broker
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: ssd
      resources:
        requests:
          storage: 32Gi
  template:
    metadata:
      labels:
        app: redis-broker
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-preemptible
                operator: NotIn
                values:
                - "true"
      volumes:
      - name: config
        configMap:
          name: redis-broker
      containers:
      - name: redis
        image: redis:4.0.10-alpine
        command: ["redis-server"]
        args: ["/etc/redis/redis.conf", "--loglevel", "verbose", "--maxmemory", "1g"]
        ports:
        - containerPort: 6379
        volumeMounts:
        - name: data
          mountPath: /data
        - name: config
          mountPath: /etc/redis
        readinessProbe:
          exec:
            command: ["sh", "-c", "redis-cli -h $(hostname) ping"]
          initialDelaySeconds: 5
          timeoutSeconds: 1
          periodSeconds: 1
          successThreshold: 1
          failureThreshold: 3
        resources:
          requests:
            cpu: 250m
            memory: 1G
          limits:
            cpu: 1000m
            memory: 1G

If redis-broker has 2 replicas, nslookup redis-broker.default.svc.cluster.local returns multiple A records for a single DNS lookup is commonly known as round-robin DNS.

$ kubectl run -i -t --image busybox dns-test --restart=Never --rm /bin/sh

> nslookup redis-broker.default.svc.cluster.local
Server: 10.63.240.10
Address 1: 10.63.240.10 kube-dns.kube-system.svc.cluster.local
Name: redis-broker.default.svc.cluster.local
Address 1: 10.60.6.2 redis-broker-0.redis-broker.default.svc.cluster.local
Address 2: 10.60.6.7 redis-broker-1.redis-broker.default.svc.cluster.local

> nslookup redis-broker-0.redis-broker.default.svc.cluster.local
Server: 10.63.240.10
Address 1: 10.63.240.10 kube-dns.kube-system.svc.cluster.local
Name: redis-broker-0.redis-broker.default
Address 1: 10.60.6.2 redis-broker-0.redis-broker.default.svc.cluster.local

ref:
https://kubernetes.io/docs/concepts/services-networking/service/#headless-services
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#services
https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#using-stable-network-identities

Moreover, there is no port re-mapping for a headless Service due to the IP resolves to Pod directly.

kind: Service
apiVersion: v1
metadata:
  namespace: tick
  name: influxdb
spec:
  clusterIP: None
  selector:
    app: influxdb
  ports:
  - name: api
    port: 4444
    targetPort: 8086
  - name: admin
    port: 8083
    targetPort: 8083
$ kubectl apply -f tick/ -R
$ kubectl get svc --namespace tick
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)             AGE
influxdb   ClusterIP   None         <none>        4444/TCP,8083/TCP   1h

$ curl http://influxdb.tick.svc.cluster.local:4444/ping
curl: (7) Failed to connect to influxdb.tick.svc.cluster.local port 4444: Connection refused

$ curl -I http://influxdb.tick.svc.cluster.local:8086/ping
HTTP/1.1 204 No Content
Content-Type: application/json
Request-Id: 7fc09a56-8538-11e8-8d1d-000000000000

Create A DaemonSet

Create a DaemonSet which changes OS kernel configurations on each node.

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: thp-disabler
spec:
  selector:
    matchLabels:
      app: thp-disabler
  template:
    metadata:
      labels:
        app: thp-disabler
    spec:
      hostPID: true
      containers:
      - name: configurer
        image: gcr.io/google-containers/startup-script:v1
        securityContext:
          privileged: true
        env:
        - name: STARTUP_SCRIPT
          value: |
            #! /bin/bash
            set -o errexit
            set -o pipefail
            set -o nounset

            echo 'never' > /sys/kernel/mm/transparent_hugepage/enabled
            echo 'never' > /sys/kernel/mm/transparent_hugepage/defrag

ref:
https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/

Create A CronJob

Backup your MongoDB database every hour.

kind: CronJob
apiVersion: batch/v1beta1
metadata:
  name: backup-mongodb-rs0
spec:
  suspend: false
  schedule: "30 * * * *"
  startingDeadlineSeconds: 600
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: custom.kubernetes.io/scopes-storage-full
                    operator: In
                    values:
                    - "true"
          volumes:
          - name: backups-dir
            emptyDir: {}
          initContainers:
          - name: clean
            image: busybox
            command: ["rm", "-rf", "/backups/*"]
            volumeMounts:
            - name: backups-dir
              mountPath: /backups
          - name: backup
            image: vinta/mongodb-tools:4.0.1
            workingDir: /backups
            command: ["sh", "-c"]
            args:
            - mongodump --host=$MONGODB_URL --readPreference=secondaryPreferred --oplog --gzip --archive=$(date +%Y-%m-%dT%H-%M-%S).tar.gz
            env:
            - name: MONGODB_URL
              value: mongodb-rs0-0.mongodb-rs0.default.svc.cluster.local,mongodb-rs0-1.mongodb-rs0.default.svc.cluster.local,mongodb-rs0-3.mongodb-rs0.default.svc.cluster.local
            volumeMounts:
            - name: backups-dir
              mountPath: /backups
            resources:
              requests:
                cpu: 2
                memory: 2G
          containers:
          - name: upload
            image: google/cloud-sdk:alpine
            workingDir: /backups
            command: ["sh", "-c"]
            args:
            - gsutil -m cp -r . gs://$(GOOGLE_CLOUD_STORAGE_BUCKET)
            env:
            - name: GOOGLE_CLOUD_STORAGE_BUCKET
              value: simple-project-backups
            volumeMounts:
            - name: backups-dir
              mountPath: /backups
              readOnly: true

Note: The environment variable appears in parentheses, $(VAR), and it is required for the variable to be expanded in the command or args field.

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: simple-api-send-email
spec:
  schedule: "*/30 * * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: simple-api-send-email
            image: asia.gcr.io/simple-project-198818/simple-api:4fc4199
            command: ["flask", "shell", "-c"]
            args:
            - |
              from bar.tasks import send_email
              send_email.delay('Hey!', 'Stand up!', to=['[email protected]'])
            envFrom:
            - configMapRef:
                name: simple-api

You could just write a simple Python script as a CronJob since everyting is containerized.

ref:
https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/

Define NodeAffinity And PodAffinity

Prevent that Pods locate on preemptible nodes. Also, you should always prefer nodeAffinity over nodeSelector.

kind: StatefulSet
apiVersion: apps/v1
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-preemptible
                operator: NotIn
                values:
                - "true"

ref:
https://medium.com/google-cloud/using-preemptible-vms-to-cut-kubernetes-engine-bills-in-half-de2481b8e814

spec.PodAntiAffinity ensures that each Pod of the same Deployment or StatefulSet does not co-locate on a single node.

kind: StatefulSet
apiVersion: apps/v1
spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - topologyKey: "kubernetes.io/hostname"
              labelSelector:
                matchExpressions:
                  - key: "app"
                    operator: In
                    values:
                    - mongodb-rs0

ref:
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

Migrate Pods from Old Nodes to New Nodes

  • Cordon marks old nodes as unschedulable
  • Drain evicts all Pods on old nodes
for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=n1-standard-4-pre -o=name); do
  kubectl cordon "$node";
done

for node in $(kubectl get nodes -l cloud.google.com/gke-nodepool=n1-standard-4-pre -o=name); do
  kubectl drain --ignore-daemonsets --delete-local-data --grace-period=2 "$node";
done

$ kubectl get nodes
NAME                                       STATUS                     ROLES     AGE       VERSION
gke-demo-default-pool-3c058fcf-x7cv        Ready                      <none>    2h        v1.11.6-gke.6
gke-demo-default-pool-58da1098-1h00        Ready                      <none>    2h        v1.11.6-gke.6
gke-demo-default-pool-fc34abbf-9dwr        Ready                      <none>    2h        v1.11.6-gke.6
gke-demo-n1-standard-4-pre-1a54e45a-0m7p   Ready,SchedulingDisabled   <none>    58m       v1.11.6-gke.6
gke-demo-n1-standard-4-pre-1a54e45a-mx3h   Ready,SchedulingDisabled   <none>    58m       v1.11.6-gke.6
gke-demo-n1-standard-4-pre-1a54e45a-qhdz   Ready,SchedulingDisabled   <none>    58m       v1.11.6-gke.6

ref:
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#cordon
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#drain
https://cloud.google.com/kubernetes-engine/docs/tutorials/migrating-node-pool

Show Objects' Events

$ kubectl get events -w
$ kubectl get events -w --sort-by=.metadata.creationTimestamp
$ kubectl get events -w --sort-by=.metadata.creationTimestamp | grep mongo

ref:
https://kubernetes.io/docs/tasks/debug-application-cluster/

You could find more comprehensive logs on Google Cloud Stackdriver Logging if you are using GKE.

View Pods' Logs on Stackdriver Logging

You could use the following search formats.

textPayload:"OBJECT_FINALIZE"

logName="projects/simple-project-198818/logs/worker"
textPayload:"Added media preset"

logName="projects/simple-project-198818/logs/beat"
textPayload:"backend_cleanup"

resource.labels.pod_id="simple-api-6744bf74db-529qf"
textPayload:"5adb2bd460d6487649fe82ea"
timestamp>="2018-04-21T12:00:00Z"
timestamp<="2018-04-21T16:00:00Z"

resource.type="k8s_container"
resource.labels.cluster_name="production"
resource.labels.namespace_id="default"
resource.labels.pod_id:"simple-worker"
textPayload:"ConcurrentObjectUseError"

resource.type="k8s_node"
resource.labels.location="asia-east1"
resource.labels.cluster_name="production"
logName="projects/simple-project-198818/logs/node-problem-detector"

# see a Pod's logs
resource.type="k8s_container"
resource.labels.cluster_name="production"
resource.labels.namespace_id="default"
resource.labels.pod_name="cache-redis-0"
"start"

# see a Node's logs
resource.type="k8s_node"
resource.labels.location="asia-east1"
resource.labels.cluster_name="production"
resource.labels.node_name="gke-production-n1-highmem-32-p0-2bd334ec-v4ng"
"start"

ref:
https://kubernetes.io/docs/tasks/debug-application-cluster/logging-stackdriver/
https://cloud.google.com/logging/docs/view/advanced-filters

Best Practices

ref:
https://cloud.google.com/solutions/best-practices-for-building-containers
https://medium.com/@sachin.arote1/kubernetes-best-practices-9b1435a4cb53
https://medium.com/@brendanrius/scaling-kubernetes-for-25m-users-a7937e3536a0

Common Issues

Switch Contexts

Get authentication credentials to allow your kubectl to interact with the cluster.

$ gcloud container clusters get-credentials demo --project simple-project-198818

ref:
https://cloud.google.com/sdk/gcloud/reference/container/clusters/get-credentials
https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/

A Context is roughly a configuration profile which indicates the cluster, the namespace, and the user you use. Contexts are stored in ~/.kube/config.

$ kubectl config get-contexts
$ kubectl config use-context gke_simple-project-198818_asia-east1_demo
$ kubectl config view

ref:
https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/

The recommended way to switch contexts is using fubectl.

$ kcs

ref:
https://github.com/kubermatic/fubectl

Pending Pods

One of the most common reasons of Pending Pods is lack of resources.

$ kubectl describe pod mongodb-rs0-1
...
Events:
Type       Reason              Age                  From                 Message
----       ------              ----                 ----                 -------
Warning    FailedScheduling    3m (x739 over 1d)    default-scheduler    0/3 nodes are available: 1 ExistingPodsAntiAffinityRulesNotMatch, 1 MatchInterPodAffinity, 1 NodeNotReady, 2 NoVolumeZoneConflict, 3 Insufficient cpu, 3 Insufficient memory, 3 MatchNodeSelector.
...

You could resize nodes in the cluster.

$ gcloud container clusters resize demo --node-pool=n1-standard-4-pre --size=5 --region=asia-east1

ref:
https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/

Init:Error Pods

$ kubectl describe mongodump-sh0-1543978800-bdkhl
$ kubectl logs mongodump-sh0-1543978800-bdkhl -c mongodump

ref:
https://kubernetes.io/docs/tasks/debug-application-cluster/debug-init-containers/#accessing-logs-from-init-containers

CrashLoopBackOff Pods

CrashLoopBackOff means the Pod is starting, then crashing, then starting again and crashing again.

When in doubt, kubectl describe.

$ kubectl describe pod the-pod-name
$ kubectl logs the-pod-name --previous

ref:
https://www.krenger.ch/blog/crashloopbackoff-and-how-to-fix-it/
https://sysdig.com/blog/debug-kubernetes-crashloopbackoff/