IPFS: The (Very Slow) Distributed Permanent Web

IPFS: The (Very Slow) Distributed Permanent Web

IPFS stands for InterPlanetary File System, but you could simply consider it as a distributed, permanent, but ridiculously slow, not properly functioning version of web. You could upload any static file and static website to IPFS. And the whole swarm would probably distribute your files to the moon, that might be why IPFS is so fucking slow.

ref:
https://ipfs.io/

Installation

Install on macOS.

$ brew install ipfs

Start your IPFS node.

$ ipfs init
initializing IPFS node at /Users/vinta/.ipfs
generating 2048-bit RSA keypair... done
peer identity: QmfNy1th16zscbpxe8Q2EQdQkNFn7Y3Rp9kGZWL1EQDyw6

$ ipfs daemon

ref:
https://ipfs.io/docs/commands/#ipfs-init
https://ipfs.io/docs/commands/#ipfs-daemon

Furthermore, you might want to run your IPFS node in a Docker container.

# docker-compose.yml
version: "3"
services:
    ipfs:
        image: ipfs/go-ipfs:v0.4.15
        working_dir: /export
        ports:
            - "4001:4001" # Swarm
            - "5001:5001" # web UI
            - "8080:8080" # HTTP proxy
        volumes:
            - "~/.ipfs:/data/ipfs"
            - "~/.ipfs/export:/export"

ref:
https://hub.docker.com/r/ipfs/go-ipfs/

Usage

Show Node Info

$ ipfs id
{
    "ID": "QmfNy1th16zscbpxe8Q2EQdQkNFn7Y3Rp9kGZWL1EQDyw6",
    "PublicKey": "A_LONG_LONG_LONG_KEY,
    "Addresses": [
        "/ip4/127.0.0.1/tcp/4001/ipfs/QmfNy1th16zscbpxe8Q2EQdQkNFn7Y3Rp9kGZWL1EQDyw6",
        "/ip4/172.19.0.2/tcp/4001/ipfs/QmfNy1th16zscbpxe8Q2EQdQkNFn7Y3Rp9kGZWL1EQDyw6"
    ],
    "AgentVersion": "go-ipfs/0.4.14/5db3846",
    "ProtocolVersion": "ipfs/0.1.0"
}

ref:
https://ipfs.io/docs/getting-started/

Add Other Nodes to Your Bootstrap List

This one is from Muzeum, https://muzeum.pro/.

$ ipfs bootstrap add /ip4/52.221.121.238/tcp/4001/ipfs/QmTKYdZDkqHiY24kPynSmKbmRdk7cJxWsvvfvvvZArQ1N9

# you could also connect to a node directly
$ ipfs swarm connect /ip4/52.221.121.238/tcp/4001/ipfs/QmTKYdZDkqHiY24kPynSmKbmRdk7cJxWsvvfvvvZArQ1N9

ref:
https://ipfs.io/docs/commands/#ipfs-bootstrap
https://ipfs.io/docs/commands/#ipfs-swarm

Add Files to IPFS

Every IPFS node's default storage is 10GB, and a single node could only store data it needs, which also means each node only stores a small amount of whole data on IPFS. If there is not enough nodes, your data might be distributed to no one except your own node.

Your content is automatically pinned when you ipfs add it.

$ ipfs add -r mysite
added QmRticJ3P5fnb9GGnUj3U9XMkYvGEnv9AQfk6YmgRhivYA mysite/index.html
added QmY9cxiHqTFoWamkQVkpmmqzBrY3hCBEL2XNu3NtX74Fuu mysite/readme.md
added QmTLhFgeWLacpbiGNYmhchHGQAhfNyDZcLt5akJFFLV89V mysite

If files/folders under the folder change, the hash of the folder changes too.

$ vim mysite/index.html
$ ipfs add -r mysite
added QmQTTe3deLfeULKjPHnQTcyFuCmY5JZiwSTiPT4nSt1KVK mysite/index.html # changed
added QmS85tb3aKQNurFm51FaxtK6NyNei4ej3gDR21baDZXRoU mysite            # changed

ref:
https://ipfs.io/docs/commands/#ipfs-add

Pin Files from IPFS

Pinning means storing IPFS files on local node, and prevent them from getting garbage collected. Also, you could access them much quickly. You only need to do ipfs pin add to pin contents someone else uploaded.

$ ipfs pin add -r --progress /ipns/ipfs.soundscape.net/

$ ipfs pin add --progress /ipns/ipfs.soundscape.net/music_group/index.json
pinned QmZwTEhdjT4MyvEnWndVEJzBjp8zGGZH1cEBpshBQs75rY recursively

$ ipfs pin add --progress /ipns/ipfs.soundscape.net/music_album/index.json
pinned QmSAuGU5xt5SdR2ca2EDgeHFATSrAQhTfTYpYs9K9qmqED recursively

$ ipfs pin add --progress /ipns/ipfs.soundscape.net/music_recording/index.json
pinned QmcTiadA9jRMXx77tydPa6492QJAtjXkKkA4gERaFksy94 recursively

$ ipfs pin add --progress /ipns/ipfs.soundscape.net/music_composition/index.json
pinned QmTfqVaGVRnaPRQgYypGYXUvTK1UcDfK5VWYvU4rwK3m26 recursively

P.S. Sometimes when I ipfs pin add a file which is not on my node, the command just hangs there. I'm not sure why that once I access the file first (through curl or any browser), then ipfs pin add works fine. But it does not make sense: if I already get/access/download the file, I could just ipfs add the file and it would be automatically pinned.

ref:
https://ipfs.io/docs/commands/#ipfs-pin

Get Files

You have several ways to get files or folders from IPFS:

  • ipfs get dir-hash -o readable-dir-name
    • ipfs get QmbMQNcg8TTo5dXZPtuxbns1XVq6cZJaa7vNqZzeJpKwfk -o mysite
  • ipfs get file-hash -o readable-file-name.ext
    • ipfs get Qmd286K6pohQcTKYqnS1YhWrCiS4gz7Xi34sdwMe9USZ7u -o cat.jpg
  • ipfs get /ipfs/dir-hash/path/to/file.txt
    • ipfs get /ipfs/QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdG/readme
  • ipfs get /ipns/example.com/path/to/file.txt
    • ipfs get /ipns/ipfs.soundscape.net/music_group/index.json

You could also access IPFS files through any public gateway:

  • curl https://ipfs.io/ipns/peer-id/path/to/file.txt
    • curl https://ipfs.io/ipns/QmfNy1th16zscbpxe8Q2EQdQkNFn7Y3Rp9kGZWL1EQDyw6/index.html
  • curl https://ipfs.io/ipns/example.com/path/to/file.txt
    • curl https://ipfs.io/ipns/ipfs.soundscape.net/music_group/index.json
  • curl http://127.0.0.1:8080/ipns/example.com/path/to/file.txt
    • curl http://127.0.0.1:8080/ipns/ipfs.soundscape.net/music_group/index.json

Download IPFS objects with ipfs get.

$ ipfs ls QmW2WQi7j6c7UgJTarActp7tDNikE4B2qXtFCfLPdsgaTQ
Qmd286K6pohQcTKYqnS1YhWrCiS4gz7Xi34sdwMe9USZ7u 443362 cat.jpg

# you could get a folder
$ ipfs get QmW2WQi7j6c7UgJTarActp7tDNikE4B2qXtFCfLPdsgaTQ
$ ls QmW2WQi7j6c7UgJTarActp7tDNikE4B2qXtFCfLPdsgaTQ
cat.jpg

# as well as a file
$ ipfs get Qmd286K6pohQcTKYqnS1YhWrCiS4gz7Xi34sdwMe9USZ7u -o cat.jpg

# get files and rename them
$ mkdir -p soundscape/music_group/ soundscape/music_album/ soundscape/music_recording/ soundscape/music_composition/ && \
  ipfs get /ipns/ipfs.soundscape.net/music_group/index.json -o soundscape/music_group/index.json; \
  ipfs get /ipns/ipfs.soundscape.net/music_album/index.json -o soundscape/music_album/index.json; \
  ipfs get /ipns/ipfs.soundscape.net/music_recording/index.json -o soundscape/music_recording/index.json; \
  ipfs get /ipns/ipfs.soundscape.net/music_composition/index.json -o soundscape/music_composition/index.json

# get whole folders
$ ipfs get /ipns/ipfs.soundscape.net/music_group; \
  ipfs get /ipns/ipfs.soundscape.net/music_album; \
  ipfs get /ipns/ipfs.soundscape.net/music_recording; \
  ipfs get /ipns/ipfs.soundscape.net/music_composition

ref:
https://ipfs.io/docs/commands/#ipfs-get
https://discuss.ipfs.io/t/trying-to-better-understand-the-pinning-concept/754

Display IPFS object data with ipfs cat.

$ ipfs cat Qmd286K6pohQcTKYqnS1YhWrCiS4gz7Xi34sdwMe9USZ7u > cat.jpg
$ ipfs cat QmS4ustL54uo8FzR9455qaxZwuMiUhyvMcX9Ba8nUH4uVv/readme

Publish a Website to IPNS

IPNS stands for InterPlanetary Naming System.

Everytime you change files under a folder, the hash of the folder also changes. So you need a static reference which always points to the latest hash of your folder. You could publish your static website (a folder) to IPNS with the static reference, which is your peer ID as well as the hash of your public key.

By default, every IPFS node has only one pair of private and public key. Therefore, you could only publish one folder with your peer ID. But you could add new keypairs through ipfs key gen and publish multiple folders.

$ ipfs add -r mysite
added QmeqHWZgvgx5C7T6DakX75CJDRgAUoSDZayLYrcnAP8Fma mysite/index.html
added QmUtuRphD9rJgRkfxwj7DcyFEAcSeH3Q1fK8nHxxoDiKK5 mysite

$ ipfs name publish QmUtuRphD9rJgRkfxwj7DcyFEAcSeH3Q1fK8nHxxoDiKK5
published to QmfNy1th16zscbpxe8Q2EQdQkNFn7Y3Rp9kGZWL1EQDyw6: /ipfs/QmUtuRphD9rJgRkfxwj7DcyFEAcSeH3Q1fK8nHxxoDiKK5

$ ipfs name resolve QmfNy1th16zscbpxe8Q2EQdQkNFn7Y3Rp9kGZWL1EQDyw6
/ipfs/QmUtuRphD9rJgRkfxwj7DcyFEAcSeH3Q1fK8nHxxoDiKK5

Click following links to see contents.

After you change something, publish it again with new hash.

$ vim mysite/index.html
$ ipfs add -r mysite
added QmNjbhdks8RUgDt6QiNFe5QGe2HrbCsq5FKda9D9hLVkkU mysite/index.html # changed
added QmbMQNcg8TTo5dXZPtuxbns1XVq6cZJaa7vNqZzeJpKwfk mysite            # changed

$ ipfs name publish QmbMQNcg8TTo5dXZPtuxbns1XVq6cZJaa7vNqZzeJpKwfk
published to QmfNy1th16zscbpxe8Q2EQdQkNFn7Y3Rp9kGZWL1EQDyw6: /ipfs/QmbMQNcg8TTo5dXZPtuxbns1XVq6cZJaa7vNqZzeJpKwfk

ref:
https://ipfs.io/docs/commands/#ipfs-name

Create a Domain Name Alias for Your Peer ID

The hash is not very friendly for humans. Fortunately, you could and probably should associate a domain name with your peer ID.

First, you need to add a TXT record whose value is dnslink=/ipns/YOUR_PEER_ID to your domain name. In the following article, we assume the domain name you choose is ipfs.kittenphile.com.

$ dig +short TXT ipfs.kittenphile.com
"dnslink=/ipns/QmfNy1th16zscbpxe8Q2EQdQkNFn7Y3Rp9kGZWL1EQDyw6"

$ ipfs name resolve -r ipfs.kittenphile.com
/ipfs/QmaE2DcNxGjPGPfzfTQuTBTW9D57abVSv319WqC89Av1y1

Click following links to see contents.

ref:
https://ipfs.io/docs/examples/example-viewer/example#../websites/README.md
https://hackernoon.com/ten-terrible-attempts-to-make-the-inter-planetary-file-system-human-friendly-e4e95df0c6fa

Public Gateway

If you have a public gateway and people retrieve files through it. Your public gateway fetches and stores the data, but it doesn't pin them. Files get removed with the next garbage collection run.

ref:
https://discuss.ipfs.io/t/public-facing-gateway-and-pinning/449

Parallel tasks in Python: concurrent.futures

Parallel tasks in Python: concurrent.futures

TL;DR: concurrent.futures is well suited to Embarrassingly Parallel tasks. You could write concurrent code with a simple for loop.

executor.map() runs the same function multiple times with different parameters and executor.submit() accepts any function with arbitrary parameters.

Install

concurrent.futures is part of the standard library in Python 3.2+. If you're using an older version of Python, you need to install the futures package.

$ pip install futures

ref:
https://docs.python.org/3/library/concurrent.futures.html

executor.map()

You should use the ProcessPoolExecutor for CPU intensive tasks and the ThreadPoolExecutor is suited for network operations or I/O. The ProcessPoolExecutor uses the multiprocessing module, which is not affected by GIL (Global Interpreter Lock) but also means that only picklable objects can be executed and returned.

In Python 3.5+, executor.map() receives an optional argument: chunksize. For very long iterables, using a large value for chunksize can significantly improve performance compared to the default size of 1. With ThreadPoolExecutor, chunksize has no effect.

from concurrent.futures import ThreadPoolExecutor
import time

import requests

def fetch(a):
    url = 'http://httpbin.org/get?a={0}'.format(a)
    r = requests.get(url)
    result = r.json()['args']
    return result

start = time.time()

# if max_workers is None or not given, it will default to the number of processors, multiplied by 5
with ThreadPoolExecutor(max_workers=None) as executor:
    for result in executor.map(fetch, range(42)):
        print('response: {0}'.format(result))

print('time: {0}'.format(time.time() - start))

You might want to change the value of max_workers to 1 and observe the difference.

ref:
https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures
https://www.blog.pythonlibrary.org/2016/08/03/python-3-concurrency-the-concurrent-futures-module/
http://masnun.com/2016/03/29/python-a-quick-introduction-to-the-concurrent-futures-module.html

executor.submit()

executor.submit() returns a Future object. A Future is basically an object that encapsulates an asynchronous execution of a function that will finish (or raise an exception) in the future.

The main difference between map and as_completed is that map returns the results in the order in which you pass iterables. On the other hand, the first result from the as_completed function is from whichever future completed first. Besides, iterating a map() returns results of futures; iterating a as_completed(futures) returns futures themselves.

from concurrent.futures import ThreadPoolExecutor, as_completed
import time

import requests

def fetch(url, timeout):
    r = requests.get(url, timeout=timeout)
    data = r.json()['args']
    return data

start = time.time()

with ThreadPoolExecutor(max_workers=20) as executor:
    futures = {}
    for i in range(42):
        url = 'https://httpbin.org/get?i={0}'.format(i)
        future = executor.submit(fetch, url, 60)
        futures[future] = url

    for future in as_completed(futures):
        url = futures[future]
        try:
            data = future.result()
        except Exception as exc:
            print(exc)
        else:
            print('fetch {0}, get {1}'.format(url, data))

print('time: {0}'.format(time.time() - start))

ref:
https://docs.python.org/3/library/concurrent.futures.html#future-objects

Discussion

ref:
https://news.ycombinator.com/item?id=16737129

Read and write files in Go

Reading file line by line

如果要一行一行地讀
建議用 bufio.Scanner
但是 Scanner 有個缺點
就是一行太長(超過 64K)的時候會出現 bufio.Scanner: token too long 的錯誤
這時候還是得用 bufio.Reader

fin, err := os.Open(path)
if err != nil {
    fmt.Println(err)
}
defer fin.Close()

scanner := bufio.NewScanner(fin)
for scanner.Scan() {
    line := scanner.Text()
    fmt.Fprintln(os.Stdin, line)
}

if err := scanner.Err(); err != nil {
    fmt.Fprintln(os.Stderr, err)
}

If you know the maximum length of the tokens you will be reading, copy the bufio.Scanner code into your project and change the const MaxScanTokenSize value.

ref:
http://stackoverflow.com/questions/6141604/go-readline-string
http://stackoverflow.com/questions/1821811/how-to-read-write-from-to-file
http://stackoverflow.com/questions/8757389/reading-file-line-by-line-in-go
http://stackoverflow.com/questions/5884154/golang-read-text-file-into-string-array-and-write
https://github.com/polaris1119/The-Golang-Standard-Library-by-Example/blob/master/chapter01/01.4.md

bufio
https://golang.org/pkg/bufio/
https://golang.org/pkg/bufio/#Scanner

Reading and writing file line by line

fmt.Fprintln(writer, line)bw.WriteString(line) 還要快

func FileSpacing(filename string, w io.Writer) (err error) {
    fr, err := os.Open(filename)
    if err != nil {
        return err
    }
    defer fr.Close()

    br := bufio.NewReader(fr)
    bw := bufio.NewWriter(w)

    for {
        line, err := br.ReadString('\n')
        if err == nil {
            fmt.Fprint(bw, TextSpacing(line))
        } else {
            if err == io.EOF {
                fmt.Fprint(bw, TextSpacing(line))
                break
            }
            return err
        }
    }
    defer bw.Flush()

    return nil
}

Copy a file

fin, _ := os.Open("source.txt")
fout, _ := os.Create("destination.txt")

io.Copy(fout, fin)

defer fout.Close()
defer fin.Close()

ref:
http://stackoverflow.com/questions/23272663/transfer-a-big-file-in-golang
http://golang.org/pkg/io/#Copy

Compute MD5 of a file

func md5Of(filename string) string {
    var result []byte

    file, err := os.Open(filename)
    checkError(err)
    defer file.Close()

    hash := md5.New()
    _, err = io.Copy(hash, file)
    checkError(err)

    checksum := hex.EncodeToString(hash.Sum(result))

    return checksum
}

ref:
http://stackoverflow.com/questions/29505089/how-can-i-compare-two-files-in-golang

Read and save file in Django / Python

File 和 ImageFile 接受 Python 的 file 或 StringIO 物件
而 ContentFile 接受 string

ref:
https://docs.djangoproject.com/en/dev/ref/files/file/#the-file-object

Django Form

image_file = request.FILES['file']

# 方法一
profile.mugshot.save(image_file.name, image_file)

# 方法二
profile.mugshot = image_file

profile.save()

ref:
File Upload with Form in Django

open('/path/to/file.png')

from django.core.files import File

with open('/home/vinta/image.png', 'rb') as f:
    profile.mugshot = File(f)
    profile.save()

Django ContentFile

import os
import uuid

from django.core.files.base import ContentFile

import requests

url = 'http://vinta.ws/static/photo.jpg'
r = requests.get(url)
file_url, file_ext = os.path.splitext(r.url)
file_name = '%s%s' % (str(uuid.uuid4()).replace('-', ''), file_ext)

profile.mugshot.save('123.png', ContentFile(r.content), save=False)

# 如果 profile.mugshot 是 ImageField 欄位的話
# 可以用以下的方式來判斷它是不是合法的圖檔
try:
    profile.mugshot.width
except TypeError:
    raise RuntimeError('圖檔格式不正確')

profile.save()

Data URI, Base64

from binascii import a2b_base64

from django.core.files.base import ContentFile

data_uri = '....'
head, data = data_uri.split(',')
binary_data = a2b_base64(data)

# 方法一
profile.mugshot.save('whatever.jpg', ContentFile(binary_data), save=False)
profile.save()

# 不能用這種方式,因為少了 file name
profile.mugshot = ContentFile(binary_data)
profile.save()

# 方法二
f = open('image.png', 'wb')
f.write(binary_data)
f.close()

# 方法三
from StringIO import StringIO
from PIL import Image
img = Image.open(StringIO(binary_data))
print img.size

ref:
http://stackoverflow.com/questions/19395649/python-pil-create-and-save-image-from-data-uri

StringIO, PIL image

你就把 StringIO 想成是 open('/home/vinta/some_file.txt', 'rb') 的 file 物件

from StringIO import StringIO

from PIL import Image
import requests

r = requests.get('http://vinta.ws/static/photo.jpg')
img = Image.open(StringIO(r.content))
print pil_image.size

StringIO, PIL image, Django

from StringIO import StringIO

from django.core.files.base import ContentFile

from PIL import Image

raw_img_io = StringIO(binary_data)
img = Image.open(raw_img_io)
img = img.resize((524, 328), Image.ANTIALIAS)
img_io = StringIO()
img.save(img_io, 'PNG', quality=100)

profile.image.save('whatever.png', ContentFile(img_io.getvalue()), save=False)
profile.save()

ref:
http://stackoverflow.com/questions/3723220/how-do-you-convert-a-pil-image-to-a-django-file

Download file from URL, tempfile

import os
import tempfile
import requests
import xlrd

try:
    file_path = report.file.path
    temp = None
except NotImplementedError:
    url = report.file.url
    r = requests.get(url, stream=True)
    file_url, file_ext = os.path.splitext(r.url)

    # delete=True 會在 temp.close() 之後自己刪掉
    temp = tempfile.NamedTemporaryFile(prefix='report_file_', suffix=file_ext, dir='/tmp', delete=False)
    file_path = temp.name

    with open(file_path, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
                f.flush()

wb = xlrd.open_workbook(file_path)

...

# 因為是 tempfile.NamedTemporaryFile(delete=False)
# 所以你要自己刪掉
try:
    os.remove(temp.name)
except AttributeError:
    pass

ref:
http://stackoverflow.com/questions/16694907/how-to-download-large-file-in-python-with-requests-py
http://pymotw.com/2/tempfile/

Write to files in Python

因為寫入資料到 disk 很慢
比較好的方式是累積一些資料後再一次寫入(所謂的 buffer)
Python 的 open('/path/to/file/') 預設就有 buffer 了(第三個參數就是 buffering)
無論是 read, write 或 append

The optional buffering argument specifies the file’s desired buffer size: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size (in bytes). A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used.

In [1]: import io
In [2]: io.DEFAULT_BUFFER_SIZE
Out[2]: 8192
from django.core.management.base import NoArgsCommand

from music.models import Song


class Command(NoArgsCommand):

    def handle_noargs(self, **options):
        start = time.time()

        total_lines = 0
        with open('dump_song_data.txt', 'w') as f:
            # header_line = 'user id, song id\n'
            # f.write(header_line)

            songs = Song.objects.filter(play_count__gte=10000)
            for song in songs.iterator():
                line = '%s\n' % (song.id)
                f.write(line)

                total_lines += 1

        end = time.time()

        print(total_lines)
        print(end - start)

ref:
https://docs.python.org/2/library/functions.html#open
http://stackoverflow.com/questions/3167494/how-often-does-python-flush-to-a-file
http://stackoverflow.com/questions/16669428/process-very-large-20gb-text-file-line-by-line
http://stackoverflow.com/questions/1896674/python-how-to-read-huge-text-file-into-memory