Elasticsearch: More than a Search Engine

Elasticsearch is a schemaless, document-oriented search engine, has a bunch of powerful quering APIs. It's also a great NoSQL database.

ref:
https://www.elastic.co/products/elasticsearch

Glossary

以下的定義以 Elasticsearch 5.6 為準,可能跟舊版的定義不同。在新版的 Elasticsearch 中,每個 index 只能有一個 mapping type,之前的版本則可以有多個。

  • cluster:一個 cluster 包含一個或多個 nodes,會自動選出一個 master node
  • node:一台跑著 Elasticsearch 的機器就是一個 node
  • index:類似關聯式資料庫裡的 table
  • mapping:類似關聯式資料庫裡的 table schema
  • field:類似關聯式資料庫裡的 column
  • document:類似關聯式資料庫裡的 row
  • text:任意的非結構化文字,text 會被 analyze 變成 term,然後才能被搜尋
  • term:實際上存在 Elasticsearch 裡的東西
  • analysis: 把 text 變成 term 的過程,例如 normalize、tokenize 和 stopword remove

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/glossary.html

Mapping

Mapping 可以顯式地指定,但是如果沒有指定,Elasticsearch 就會在第一次有資料進去的時候,自動根據資料建立對應的 mapping,所以一些欄位的屬性(例如 analyzer)可能不會符合你的預期,所以最好還是手動指定 mapping。

常用的 field data types:

  • text:表示 string 類型,用於 full-text search
  • keyword:表示 string 類型,用於 exact value 的 filter、sort 或 aggregate(就是舊版的 not_analyzed

在 Elasticsearch 中,同一個欄位可以被 index 成不同的 data types,例如 location 欄位,可以透過 fields 屬性,同時 index 成 textkeyword,用來全文搜索和 exact value 過濾。也可以分別指定不同的 analyzer。

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html

Analysis

一個 analyzer 由三個部分組成:

  • Character filters
  • Tokenizers
  • Token filters

你可以自己組合出你的 analyzer,以 elasticsearch-dsl-py 為例:

from elasticsearch_dsl import DocType, Date, Integer, Keyword, Text, Boolean
from elasticsearch_dsl import analyzer, tokenizer

text_analyzer = analyzer('text_analyzer',
    char_filter=["html_strip"],
    tokenizer="standard",
    filter=["asciifolding", "lowercase", "snowball", "stop"]
)

cjk_analyzer = analyzer('text_analyzer',
    char_filter=["html_strip"],
    tokenizer=tokenizer('trigram', 'nGram', min_gram=2, max_gram=3),
    filter=["asciifolding", "lowercase", "snowball", "stop"]
)

ref:
http://elasticsearch-dsl.readthedocs.io/en/latest/persistence.html

Testing Analyzer

測試某段文字在某個 analyzer 下的效果:

POST http://127.0.0.1:9200/_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase", "asciifolding"],
  "text": "Is this déja vu?"
}

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html

Chinese Words Segmentation

ik 之類的分詞 plugin 的效果都不是很好,內建的 cjk 加上 NGram 可能會是比較好的選擇(可以用 multi-field index)。另外一個作法是,把資料餵進去 Elasticsearch 之前就先分好詞,可以用 Jieba,分詞完的文本以空格分隔,然後用 Elasticsearch 的 whitespace tokenizer。

中文搜尋經驗分享
https://blog.liang2.tw/2015Talk-Chinese-Search/

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#cjk-analyzer
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-smartcn.html << 不推薦
https://github.com/medcl/elasticsearch-analysis-ik << 堪用

RESTful APIs

Show useful information for humans
http://127.0.0.1:9200/_cat
https://www.elastic.co/guide/en/elasticsearch/reference/current/cat.html

List all indices and aliases
http://127.0.0.1:9200/_aliases

List mappings under am index
http://127.0.0.1:9200/repo/_mapping

List documents under an index
http://127.0.0.1:9200/repo/_search

Query DSL

query 就是你要搜索的主體
filter 則是這個搜索的前置條件
https://www.elastic.co/guide/en/elasticsearch/guide/master/search-in-depth.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-queries.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filters.html

要做 exact value 的 query
請用 term
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

要做 full text 的 query
請用 match
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html

要一次 query 多個欄位
請用 multi_match
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html

要用 AND (must), OR (should), NOT (must_not) 的條件搜索
請用 bool
https://www.elastic.co/guide/en/elasticsearch/guide/master/bool-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

要結合 filter 和 query
請用 filtered
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html

More Like This query
除了可以輸入文字之外,還可以直接指定 document id 找出相似的結果
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html

對特定欄位加權
https://www.elastic.co/guide/en/elasticsearch/guide/current/query-time-boosting.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/_boosting_query_clauses.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-boosting-query.html

Multi-index, Multi-type

除了可以搜尋單一 type,也可以跨 index、跨 type

  • /_search: Search all types in all indices
  • /gb/_search: Search all types in the gb index
  • /gb,us/_search: Search all types in the gb and us indices
  • /g*,u*/_search: Search all types in any indices beginning with g or beginning with u
  • /gb/user/_search: Search type user in the gb index
  • /gb,us/user,tweet/_search: Search types user and tweet in the gb and us indices
  • /_all/user,tweet/_search: Search types user and tweet in all indices

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html#search-multi-index-type

Read and Write Files in Django and Python

Read and Write Files in Django and Python

File 和 ImageFile 接受 Python 的 file 或 StringIO 物件
而 ContentFile 接受 string

ref:
https://docs.djangoproject.com/en/dev/ref/files/file/#the-file-object

Django Form

image_file = request.FILES['file']

# 方法一
profile.mugshot.save(image_file.name, image_file)

# 方法二
profile.mugshot = image_file

profile.save()

open('/path/to/file.png')

from django.core.files import File

with open('/home/vinta/image.png', 'rb') as f:
    profile.mugshot = File(f)
    profile.save()

Django ContentFile

import os
import uuid

from django.core.files.base import ContentFile

import requests

url = 'http://vinta.ws/static/photo.jpg'
r = requests.get(url)
file_url, file_ext = os.path.splitext(r.url)
file_name = '%s%s' % (str(uuid.uuid4()).replace('-', ''), file_ext)

profile.mugshot.save('123.png', ContentFile(r.content), save=False)

# 如果 profile.mugshot 是 ImageField 欄位的話
# 可以用以下的方式來判斷它是不是合法的圖檔
try:
    profile.mugshot.width
except TypeError:
    raise RuntimeError('圖檔格式不正確')

profile.save()

Data URI, Base64

from binascii import a2b_base64

from django.core.files.base import ContentFile

data_uri = 'data:image/jpeg;base64,/9j/4AAQSkZJRg....'
head, data = data_uri.split(',')
binary_data = a2b_base64(data)

# 方法一
profile.mugshot.save('whatever.jpg', ContentFile(binary_data), save=False)
profile.save()

# 不能用這種方式,因為少了 file name
profile.mugshot = ContentFile(binary_data)
profile.save()

# 方法二
f = open('image.png', 'wb')
f.write(binary_data)
f.close()

# 方法三
from StringIO import StringIO
from PIL import Image
img = Image.open(StringIO(binary_data))
print img.size

ref:
https://stackoverflow.com/questions/19395649/python-pil-create-and-save-image-from-data-uri

StringIO, PIL image

你就把 StringIO 想成是 open('/home/vinta/some_file.txt', 'rb') 的 file 物件

from StringIO import StringIO

from PIL import Image
import requests

r = requests.get('http://vinta.ws/static/photo.jpg')
img = Image.open(StringIO(r.content))
print pil_image.size

StringIO, PIL image, Django

from StringIO import StringIO

from django.core.files.base import ContentFile

from PIL import Image

img = Image.open(instance.file)
# or
raw_img_io = StringIO(binary_data)
img = Image.open(raw_img_io)
img = img.resize((524, 328), Image.ANTIALIAS)
img_io = StringIO()
img.save(img_io, 'PNG', quality=100)

profile.image.save('whatever.png', ContentFile(img_io.getvalue()), save=False)
profile.save()

ref:
https://stackoverflow.com/questions/3723220/how-do-you-convert-a-pil-image-to-a-django-file

Download file from URL, tempfile

import os
import tempfile
import requests
import xlrd

try:
    file_path = report.file.path
    temp = None
except NotImplementedError:
    url = report.file.url
    r = requests.get(url, stream=True)
    file_url, file_ext = os.path.splitext(r.url)

    # delete=True 會在 temp.close() 之後自己刪掉
    temp = tempfile.NamedTemporaryFile(prefix='report_file_', suffix=file_ext, dir='/tmp', delete=False)
    file_path = temp.name

    with open(file_path, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
                f.flush()

wb = xlrd.open_workbook(file_path)

# 因為是 tempfile.NamedTemporaryFile(delete=False)
# 所以你要自己刪掉
try:
    os.remove(temp.name)
except AttributeError:
    pass

ref:
https://stackoverflow.com/questions/16694907/download-large-file-in-python-with-requests

HTTP Cache Headers in Django

HTTP Cache Headers in Django

Conditional View Processing

def latest_entry(request, blog_id):
    return Entry.objects.filter(blog=blog_id).latest("published").published

@condition(last_modified_func=latest_entry)
def front_page(request, blog_id):
    ...

ref:
https://docs.djangoproject.com/en/dev/topics/conditional-view-processing/

Cache Middlewares

如果有啟用 Django 的 cache middleware(就是 UpdateCacheMiddlewareFetchFromCacheMiddleware
每一個 request 都會被標上 Cache-Control: max-age=600
那個 600 是根據 CACHE_MIDDLEWARE_SECONDS 設定的值

只要設置了 max-age > 0
response header 中就會被自動加入 Cache-ControlExpiresLast-Modified 兩個欄位

ref:
https://docs.djangoproject.com/en/dev/ref/settings/#std:setting-CACHE_MIDDLEWARE_SECONDS

Never Cache Decorator

from django.views.decorators.cache import never_cache

@never_cache
def myview(request):
    pass

如果你單純的就是不希望被 cache
就使用這種方式
@never_cache 只會設置 Cache-Control: max-age=0max-age=0 是馬上過期)

Cache Control Decorator

from django.views.decorators.cache import cache_control

class SongDetail(SVAPIDetailView):
    serializer_class = api_serializers.SongDetailSerializer

    @cache_control(no_store=True, no_cache=True, max_age=0)
    def get(self, request, song_id):
        do_something()

        return Response(data)

不知道為什麼,只設置 no-store 和 no-cache 的話
iOS 的 AFNetworking 還是會 cache
照道理說 no-store 的優先權應該是最高的
目前的解法是使用 Cache-Control: max-age=0

ref:
https://docs.djangoproject.com/en/dev/topics/cache/#controlling-cache-using-other-headers

Scrapy: The Web Scraping Framework for Python

Scrapy: The Web Scraping Framework for Python

Scrapy is a fast high-level web crawling and web scraping framework.

ref:
https://doc.scrapy.org/en/latest/

Install

# on Ubuntu
$ sudo apt-get install libxml2-dev libxslt1-dev libffi-dev

# on Mac
$ brew install libffi

$ pip install scrapy service_identity

Usage

# interative shell
# http://doc.scrapy.org/en/latest/intro/tutorial.html#trying-selectors-in-the-shell
$ scrapy shell "http://www.wendyslookbook.com/2013/09/the-frame-a-digital-glossy/"
# or
$ scrapy shell --spider=seemodel
>>> view(response)
>>> fetch(req_or_url)

# create a project
$ scrapy startproject blackwindow

# create a spider
$ scrapy genspider fancy www.fancy.com

# run spider
$ scrapy crawl fancy
$ scrapy crawl pinterest -L ERROR

Spider
去爬資料的程式,用 parse() 定義你要 parse 哪些資料

Item
定義抓回來的資料欄位,可以想成是 django 的 model

Pipeline
對抓回來的資料進行加工,可能是清除 html 或是檢查重複之類的

scrapy 底層是用 lxml 和 Twisted

ref:
https://github.com/vinta/BlackWidow

Tips

Debugging

from scrapy.shell import inspect_response
inspect_response(response, self)

These 2 lines will invoke the interative shell.

相對路徑 XPath

divs = response.xpath('//div')
for p in divs.xpath('.//p'):  # extracts all <p> inside
    print p.extract()

Access Django Model in Scrapy

def setup_django_env(django_settings_dir):
    import imp
    import sys

    from django.core.management import setup_environ

    django_project_path = os.path.abspath(os.path.join(django_settings_dir, '..'))
    sys.path.append(django_project_path)
    sys.path.append(django_settings_dir)

    f, filename, desc = imp.find_module('settings', [django_settings_dir, ])
    project = imp.load_module('settings', f, filename, desc)

    setup_environ(project)

# where Django settings.py placed
DJANGO_SETTINGS_DIR = '/all_projects/heelsfetishism/heelsfetishism'
setup_django_env(DJANGO_SETTINGS_DIR)

then you can import Django's modules in scrapy, like this:

from django.contrib.auth.models import User

from app.models import SomeModel

State

http://doc.scrapy.org/en/latest/topics/jobs.html#keeping-persistent-state-between-batches

def parse_item(self, response):
    # parse item here
    self.state['items_count'] = self.state.get('items_count', 0) + 1

Close Spider

from scrapy.exceptions import CloseSpider

# 只能在 spider 裡頭呼叫,不能用在 pipeline 裡
raise CloseSpider('Stop')

Login in Spider

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request, FormRequest

from blackwidow.items import HeelsItem

class SeeModelSpider(CrawlSpider):
    name = 'seemodel'
    allowed_domains = ['www.seemodel.com', ]
    login_page = 'http://www.seemodel.com/member.php?mod=logging&action=login'
    start_urls = [
        'http://www.seemodel.com/forum.php?mod=forumdisplay&fid=41&filter=heat&orderby=heats',
        'http://www.seemodel.com/forum.php?mod=forumdisplay&fid=42&filter=heat&orderby=heats',
    ]

    rules = (
        Rule(
            SgmlLinkExtractor(allow=r'forum\.php\?mod=viewthread&tid=\d+'),
            callback='parse_item',
            follow=False,
        ),
    )

    def start_requests(self):
        self.username = self.settings['SEEMODEL_USERNAME']
        self.password = self.settings['SEEMODEL_PASSWORD']

        yield Request(
            url=self.login_page,
            callback=self.login,
            dont_filter=True,
        )

    def login(self, response):
        return FormRequest.from_response(
            response,
            formname='login',
            formdata={
                'username': self.username,
                'password': self.password,
                'cookietime': 'on',
            },
            callback=self.check_login_response,
        )

    def check_login_response(self, response):
        if self.username not in response.body:
            self.log("Login failed")
            return

        self.log("Successfully logged in")

        return [Request(url=url, dont_filter=True) for url in self.start_urls]

    def parse_item(self, response):
        item = HeelsItem()
        item['comment'] = response.xpath('//*[@id="thread_subject"]/text()').extract()
        item['image_urls'] = response.xpath('//ignore_js_op//img/@zoomfile').extract()
        item['source_url'] = response.url

        return item

ref:
https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-userlogin

Others

XPath 的選擇節點語法
http://mi.hosp.ncku.edu.tw/km/index.php/dotnet/48-netdisk/57-xml-xpath

Avoiding getting banned
http://doc.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned

Python 抓取框架:Scrapy 的架构
http://biaodianfu.com/scrapy-architecture.html

Download images
https://scrapy.readthedocs.org/en/latest/topics/images.html

Parse datetime in Python and JavaScript

Parse datetime in Python and JavaScript

Python

I recommend dateutil.

ref:
https://dateutil.readthedocs.org/en/latest/

import datetime
from dateutil import parser as dateutil_parser

>>> dateutil_parser.parse('2014-12-24T16:15:16')
datetime.datetime(2014, 12, 24, 16, 15, 16)

>>> datetime_obj = datetime.datetime.strptime('2014-12-24T16:15:16', '%Y-%m-%dT%H:%M:%S')
datetime.datetime(2014, 12, 24, 16, 15, 16)

>>> datetime_obj = datetime.datetime.strptime('201408282300', '%Y%m%d%H%M')
datetime.datetime(2014, 8, 28, 23, 0)

>>> datetime_obj.strftime('%Y-%m-%d %H:%M')

strftime >> datetime -> str
strptime >> str --> datetime

ref:
https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior

Django Template

class DriverInfoForm(forms.ModelForm):
    service_time_start = forms.TimeField(
        widget=forms.TimeInput(format='%H:%M'),
        input_formats=['%H:%M', ]
    )

@register.filter
def str_to_time(time_str, output_format):
    """
    把字串轉成 datetime obj
    再依據 output_format 輸出

    {{ news.modified_at|str_to_time:"%Y/%m/%d %H:%M" }}
    """

    from dateutil import parser

    datetime_obj = parser.parse(time_str, fuzzy=True)

    return datetime_obj.strftime(output_format)
日期:{{ withdraw.presented_at|date:"%Y 年 %n 月" }}
聯絡時間:{{ driver.service_time_start|date:"H:i" }} - {{ driver.service_time_end|date:"H:i" }}

要注意的是,Django 似乎不能 parse AM / PM,所以儘量用 24 小時制。

ref:
https://docs.djangoproject.com/en/dev/ref/templates/builtins/#date

JavaScript

I recommend moment.js.

ref:
https://momentjs.com/

var today = new Date().toISOString().slice(0, 10);
// 2016-05-11

var t1 = new Date('2016-05-02T03:00:00.000+01:00');
// Mon May 02 2016 10:00:00 GMT+0800 (CST)

var t1_timestamp_ms = t1.getTime();
// 要注意的是 JavaScript 的 getTime() 的單位是 ms
// 1462154400000

var t1_timestamp = t1.getTime() / 1000;
// 1462154400

var t2 = new Date(1485596172 * 1000);
// Sat Jan 28 2017 17:36:12 GMT+0800 (CST)

var t3 = moment('201408292300', 'YYYYMMDDHHmm');

var t3 = moment('2018-02-02')
var timestamp = time.unix()
// 單位是 second
// 1518192000

ref:
https://stackoverflow.com/questions/3552461/how-to-format-a-javascript-date