Deploy graph-node (The Graph) in Kubernetes (AWS EKS)

Table of Contents

graph-node is an open source software that indexes blockchain data, also known as an indexer. Though the cost of running a self-hosted graph node could be pretty high. We're going to deploy a self-hosted graph node on Amazon Elastic Kubernetes Service (EKS).

In this article, we have two approaches to deploy self-hosted graph node:

A single graph node with a single PostgreSQL database
A graph node cluster with a primary-secondary PostgreSQL cluster
- A graph node cluster consists of one index node and multiple query nodes
- For the database, we will simply use a Multi-AZ DB cluster on AWS RDS

ref:
https://github.com/graphprotocol/graph-node

Create a PostgreSQL Database on AWS RDS

Hardware requirements for running a graph-node:
https://thegraph.com/docs/en/network/indexing/#what-are-the-hardware-requirements
https://docs.thegraph.academy/official-docs/indexer/testnet/graph-protocol-testnet-baremetal/1_architectureconsiderations

A Single DB Instance

We use the following settings on staging.

Version: PostgreSQL 13.7-R1
Template: Dev/Test
Deployment option: Single DB instance
DB instance identifier: graph-node
Master username: graph_node
Auto generate a password: Yes
DB instance class: db.t3.medium (2 vCPU 4G RAM)
Storage type: gp2
Allocated storage: 500 GB
Enable storage autoscaling: No
Compute resource: Don’t connect to an EC2 compute resource
Network type: IPv4
VPC: eksctl-perp-staging-cluster/VPC
DB Subnet group: default-vpc
Public access: Yes
VPC security group: graph-node
Availability Zone: ap-northeast-1d
Initial database name: graph_node

A Multi-AZ DB Cluster

We use the following settings on production.

Version: PostgreSQL 13.7-R1
Template: Production
Deployment option: Multi-AZ DB Cluster
DB instance identifier: graph-node-cluster
Master username: graph_node
Auto generate a password: Yes
DB instance class: db.m6gd.2xlarge (8 vCPU 32G RAM)
Storage type: io1
Allocated storage: 500 GB
Provisioned IOPS: 1000
Enable storage autoscaling: No
Compute resource: Don’t connect to an EC2 compute resource
VPC: eksctl-perp-production-cluster/VPC
DB Subnet group: default-vpc
Public access: Yes
VPC security group: graph-node

Unfortunately, AWS currently does not have Reserved Instances (RIs) Plan for Multi-AZ DB clusters. Use "Multi-AZ DB instance" or "Single DB instance" instead if the cost is a big concern to you.

RDS Remote Access

You could test your database remote access. Also, make sure the security group's inbound rules include 5432 port for PostgreSQL.

brew install postgresql

psql --host=YOUR_RDB_ENDPOINT --port=5432 --username=graph_node --password --dbname=postgres
# or
createdb -h YOUR_RDB_ENDPOINT -p 5432 -U graph_node graph_node

Create a Dedicated EKS Node Group

This step is optional.

eksctl --profile=perp create nodegroup 
--cluster perp-production 
--region ap-southeast-1 
--name "managed-graph-node-m5-xlarge" 
--node-type "m5.xlarge" 
--nodes 3 
--nodes-min 3 
--nodes-max 3 
--managed 
--asg-access 
--alb-ingress-access

Deploy graph-node in Kubernetes

Deployments for a Single DB Instance

apiVersion: v1
kind: Service
metadata:
  name: graph-node
spec:
  clusterIP: None
  selector:
    app: graph-node
  ports:
    - name: jsonrpc
      port: 8000
      targetPort: 8000
    - name: websocket
      port: 8001
      targetPort: 8001
    - name: admin
      port: 8020
      targetPort: 8020
    - name: index-node
      port: 8030
      targetPort: 8030
    - name: metrics
      port: 8040
      targetPort: 8040
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: graph-node-config
data:
  # https://github.com/graphprotocol/graph-node/blob/master/docs/environment-variables.md
  GRAPH_LOG: info
  EXPERIMENTAL_SUBGRAPH_VERSION_SWITCHING_MODE: synced
---
apiVersion: v1
kind: Secret
metadata:
  name: graph-node-secret
type: Opaque
data:
  postgres_host: xxx
  postgres_user: xxx
  postgres_db: xxx
  postgres_pass: xxx
  ipfs: https://ipfs.network.thegraph.com
  ethereum: optimism:https://YOUR_RPC_ENDPOINT_1 optimism:https://YOUR_RPC_ENDPOINT_2
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: graph-node
spec:
  replicas: 1
  selector:
    matchLabels:
      app: graph-node
  serviceName: graph-node
  template:
    metadata:
      labels:
        app: graph-node
      annotations:
        "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
    spec:
      containers:
        # https://github.com/graphprotocol/graph-node/releases
        # https://hub.docker.com/r/graphprotocol/graph-node
        - name: graph-node
          image: graphprotocol/graph-node:v0.29.0
          envFrom:
            - secretRef:
                name: graph-node-secret
            - configMapRef:
                name: graph-node-config
          ports:
            - containerPort: 8000
              protocol: TCP
            - containerPort: 8001
              protocol: TCP
            - containerPort: 8020
              protocol: TCP
            - containerPort: 8030
              protocol: TCP
            - containerPort: 8040
              protocol: TCP
          resources:
            requests:
              cpu: 2000m
              memory: 4G
            limits:
              cpu: 4000m
              memory: 4G

Since Kubernetes Secrets are, by default, stored unencrypted in the API server's underlying data store (etcd). Anyone with API access can retrieve or modify a Secret. They're not secret at all. So instead of storing sensitive data in Secrets, you might want to use Secrets Store CSI Driver.

ref:
https://github.com/graphprotocol/graph-node/blob/master/docs/environment-variables.md
https://hub.docker.com/r/graphprotocol/graph-node

Deployments for a Multi-AZ DB Cluster

There are two types of nodes in a graph node cluster:

Index Node: Only indexing data from the blockchain, not serving queries at all
Query Node: Only serving GraphQL queries, not indexing data at all

Indexing subgraphs doesn't require too much CPU and memory resources, but serving queries does, especially when you enable GraphQL caching.

Index Node

Technically, we could further split an index node into Ingestor and Indexer: the former fetches blockchain data from RPC providers periodically, and the latter indexes entities based on mappings. That's another story though.

apiVersion: v1
kind: ConfigMap
metadata:
  name: graph-node-cluster-index-config-file
data:
  config.toml: |
    [store]
    [store.primary]
    connection = "postgresql://USER:PASSWORD@RDS_WRITER_ENDPOINT/DB"
    weight = 1
    pool_size = 2

    [chains]
    ingestor = "graph-node-cluster-index-0"
    [chains.optimism]
    shard = "primary"
    provider = [
      { label = "optimism-rpc-proxy", url = "http://rpc-proxy-for-graph-node.default.svc.cluster.local:8000", features = ["archive"] }
      # { label = "optimism-quicknode", url = "https://YOUR_RPC_ENDPOINT_1", features = ["archive"] },
      # { label = "optimism-alchemy", url = "https://YOUR_RPC_ENDPOINT_2", features = ["archive"] },
      # { label = "optimism-infura", url = "https://YOUR_RPC_ENDPOINT_3", features = ["archive"] }
    ]

    [deployment]
    [[deployment.rule]]
    indexers = [ "graph-node-cluster-index-0" ]
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: graph-node-cluster-index-config-env
data:
  # https://github.com/graphprotocol/graph-node/blob/master/docs/environment-variables.md
  GRAPH_LOG: "info"
  GRAPH_KILL_IF_UNRESPONSIVE: "true"
  ETHEREUM_POLLING_INTERVAL: "100"
  ETHEREUM_BLOCK_BATCH_SIZE: "10"
  GRAPH_STORE_WRITE_QUEUE: "50"
  EXPERIMENTAL_SUBGRAPH_VERSION_SWITCHING_MODE: "synced"
---
apiVersion: v1
kind: Service
metadata:
  name: graph-node-cluster-index
spec:
  clusterIP: None
  selector:
    app: graph-node-cluster-index
  ports:
    - name: jsonrpc
      port: 8000
      targetPort: 8000
    - name: websocket
      port: 8001
      targetPort: 8001
    - name: admin
      port: 8020
      targetPort: 8020
    - name: index-node
      port: 8030
      targetPort: 8030
    - name: metrics
      port: 8040
      targetPort: 8040
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: graph-node-cluster-index
spec:
  replicas: 1
  selector:
    matchLabels:
      app: graph-node-cluster-index
  serviceName: graph-node-cluster-index
  template:
    metadata:
      labels:
        app: graph-node-cluster-index
      annotations:
        "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
    spec:
      terminationGracePeriodSeconds: 10
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: alpha.eksctl.io/nodegroup-name
                    operator: In
                    values:
                      - managed-graph-node-m5-xlarge
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values:
                      # should schedule index node to the zone that writer instance is located
                      - ap-southeast-1a
      volumes:
        - name: graph-node-cluster-index-config-file
          configMap:
            name: graph-node-cluster-index-config-file
      containers:
        # https://github.com/graphprotocol/graph-node/releases
        # https://hub.docker.com/r/graphprotocol/graph-node
        - name: graph-node-cluster-index
          image: graphprotocol/graph-node:v0.29.0
          command:
            [
              "/bin/sh",
              "-c",
              'graph-node --node-id $HOSTNAME --config "/config.toml" --ipfs "https://ipfs.network.thegraph.com"',
            ]
          envFrom:
            - configMapRef:
                name: graph-node-cluster-index-config-env
          ports:
            - containerPort: 8000
              protocol: TCP
            - containerPort: 8001
              protocol: TCP
            - containerPort: 8020
              protocol: TCP
            - containerPort: 8030
              protocol: TCP
            - containerPort: 8040
              protocol: TCP
          resources:
            requests:
              cpu: 1000m
              memory: 1G
            limits:
              cpu: 2000m
              memory: 1G
          volumeMounts:
            - name: graph-node-cluster-index-config-file
              subPath: config.toml
              mountPath: /config.toml
              readOnly: true

ref:
https://github.com/graphprotocol/graph-node/blob/master/docs/config.md
https://github.com/graphprotocol/graph-node/blob/master/docs/environment-variables.md

The key factors to the efficiency of syncing/indexing subgraphs are:

The latency of the RPC provider
The write throughput of the database

I didn't find any graph-node configs or environment variables that can speed up the syncing process noticeably. If you know, please tell me.

ref:
https://github.com/graphprotocol/graph-node/issues/3756

If you're interested in building a RPC proxy with healthcheck of block number and latency, see Deploy Ethereum RPC Provider Load Balancer with HAProxy in Kubernetes (AWS EKS). graph-node itself cannot detect if an RPC provider's block is delayed.

Query Nodes

The most important config is DISABLE_BLOCK_INGESTOR: "true" which basically configures the node as a query node.

apiVersion: v1
kind: ConfigMap
metadata:
  name: graph-node-cluster-query-config-env
data:
  # https://github.com/graphprotocol/graph-node/blob/master/docs/environment-variables.md
  DISABLE_BLOCK_INGESTOR: "true" # this node won't ingest blockchain data
  GRAPH_LOG_QUERY_TIMING: "gql"
  GRAPH_GRAPHQL_QUERY_TIMEOUT: "600"
  GRAPH_QUERY_CACHE_BLOCKS: "60"
  GRAPH_QUERY_CACHE_MAX_MEM: "2000" # the actual used memory could be 3x
  GRAPH_QUERY_CACHE_STALE_PERIOD: "100"
  EXPERIMENTAL_SUBGRAPH_VERSION_SWITCHING_MODE: "synced"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: graph-node-cluster-query-config-file
data:
  config.toml: |
    [store]
    [store.primary]
    connection = "postgresql://USER:PASSWORD@RDS_WRITER_ENDPOINT/DB" # connect to RDS writer instance
    weight = 0
    pool_size = 2
    [store.primary.replicas.repl1]
    connection = "postgresql://USER:PASSWORD@RDS_READER_ENDPOINT/DB" # connect to RDS reader instances (multiple)
    weight = 1
    pool_size = 100

    [chains]
    ingestor = "graph-node-cluster-index-0"

    [deployment]
    [[deployment.rule]]
    indexers = [ "graph-node-cluster-index-0" ]

    [general]
    query = "graph-node-cluster-query*"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: graph-node-cluster-query
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  minReadySeconds: 5
  selector:
    matchLabels:
      app: graph-node-cluster-query
  template:
    metadata:
      labels:
        app: graph-node-cluster-query
    spec:
      terminationGracePeriodSeconds: 40
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: alpha.eksctl.io/nodegroup-name
                    operator: In
                    values:
                      - managed-graph-node-m5-xlarge
      volumes:
        - name: graph-node-cluster-query-config-file
          configMap:
            name: graph-node-cluster-query-config-file
      containers:
        # https://github.com/graphprotocol/graph-node/releases
        # https://hub.docker.com/r/graphprotocol/graph-node
        - name: graph-node-cluster-query
          image: graphprotocol/graph-node:v0.29.0
          command:
            [
              "/bin/sh",
              "-c",
              'graph-node --node-id $HOSTNAME --config "/config.toml" --ipfs "https://ipfs.network.thegraph.com"',
            ]
          envFrom:
            - configMapRef:
                name: graph-node-cluster-query-config-env
          ports:
            - containerPort: 8000
              protocol: TCP
            - containerPort: 8001
              protocol: TCP
            - containerPort: 8030
              protocol: TCP
          resources:
            requests:
              cpu: 1000m
              memory: 8G
            limits:
              cpu: 2000m
              memory: 8G
          volumeMounts:
            - name: graph-node-cluster-query-config-file
              subPath: config.toml
              mountPath: /config.toml
              readOnly: true

ref:
https://github.com/graphprotocol/graph-node/blob/master/docs/config.md
https://github.com/graphprotocol/graph-node/blob/master/docs/environment-variables.md

It's also strongly recommended to mark subgraph schemas as immutable with @entity(immutable: true). Immutable entities are much faster to write and to query, so should be used whenever possible. The query time reduces by 80% in our case.

ref:
https://thegraph.com/docs/en/developing/creating-a-subgraph/#defining-entities

Setup an Ingress for graph-node

WebSocket connections are inherently sticky. If the client requests a connection upgrade to WebSockets, the target that returns an HTTP 101 status code to accept the connection upgrade is the target used in the WebSockets connection. After the WebSockets upgrade is complete, cookie-based stickiness is not used. You don't need to enable stickiness for ALB.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: graph-node-ingress
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-FS-1-2-Res-2020-10
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:XXX
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/target-group-attributes: stickiness.enabled=false
    alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=3600,deletion_protection.enabled=true
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS": 443}]'
    alb.ingress.kubernetes.io/actions.subgraph-api: >
      {"type": "forward", "forwardConfig": {"targetGroups": [
        {"serviceName": "graph-node-cluster-query", "servicePort": 8000, "weight": 100}
      ]}}
    alb.ingress.kubernetes.io/actions.subgraph-ws: >
      {"type": "forward", "forwardConfig": {"targetGroups": [
        {"serviceName": "graph-node-cluster-query", "servicePort": 8001, "weight": 100}
      ]}}
    alb.ingress.kubernetes.io/actions.subgraph-hc: >
      {"type": "forward", "forwardConfig": {"targetGroups": [
        {"serviceName": "graph-node-cluster-query", "servicePort": 8030, "weight": 100}
      ]}}
spec:
  rules:
    - host: "subgraph-api.example.com"
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: subgraph-api
                port:
                  name: use-annotation
    - host: "subgraph-ws.example.com"
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: subgraph-ws
                port:
                  name: use-annotation
    - host: "subgraph-hc.example.com"
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: subgraph-hc
                port:
                  name: use-annotation

Deploy a Subgraph

kubectl port-forward service/graph-node-cluster-index 8020:8020
npx graph create your_org/your_subgraph --node http://127.0.0.1:8020
graph deploy --node http://127.0.0.1:8020 your_org/your_subgraph

ref:
https://github.com/graphprotocol/graph-cli

Here are also some useful commands for maintenance tasks:

kubectl logs -l app=graph-node-cluster-query -f | grep "query_time_ms"

# show total pool connections according to config file
graphman --config config.toml config pools graph-node-cluster-query-zone-a-78b467665d-tqw7g

# show info
graphman --config config.toml info QmbCm38nL6C7v3FS5snzjnQyDMvV8FkM72C3o5RCfGpM9P

# reassign a deployment to a index node
graphman --config config.toml reassign QmbCm38nL6C7v3FS5snzjnQyDMvV8FkM72C3o5RCfGpM9P graph-node-cluster-index-pending-0
graphman --config config.toml reassign QmbCm38nL6C7v3FS5snzjnQyDMvV8FkM72C3o5RCfGpM9P graph-node-cluster-index-0

# stop indexing a deployment
graphman --config config.toml unassign QmbCm38nL6C7v3FS5snzjnQyDMvV8FkM72C3o5RCfGpM9P

# remove unused deployments
graphman --config config.toml unused record
graphman --config config.toml unused remove -d QmdoXB4zJVD5n38omVezMaAGEwsubhKdivgUmpBCUxXKDh

ref:
https://github.com/graphprotocol/graph-node/blob/master/docs/graphman.md