All Posts Tagged “elasticsearch”

Change index mappings with zero downtime using elasticsearch-py

Basically you can't change mappings (so-called "schema") in Elasticsearch. You may add fields free but changing field definitions (field types or analyzers) of mappings is impossible. One way or another, you need to create a new index.

Steps:

  • Create an alias my_index which points to the old index my_index_v1
  • Use my_index instead of my_index_v1 in your application
  • Create a new index my_index_v2 with new mappings
  • Transfer documents from old index to new index - a.k.a. reindex
  • Associate the alias my_index with index my_index_v2
  • Delete the old index my_index_v1
from datetime import datetime

from elasticsearch import Elasticsearch
from elasticsearch.helpers import reindex

es_client = Elasticsearch(hosts=settings.ES_HOSTS)

# make sure that this alias doesn't conflict with any existing index name
alias = 'packer'

# CAUTION: if you have an index already, you should create an alias for it first
# es_client.indices.put_alias(index='your_current_index', name=alias)

old_indexes = list(es_client.indices.get_alias(alias).keys())
try:
    old_index = old_indexes[0]
except IndexError:
    old_index = None
else:
    if len(old_indexes) > 1:
        raise RuntimeError('Alias `{0}` points to {1} indexes that may cause error when writing data to `{0}`'.format(alias, len(old_indexes)))

new_index = '{}_{}'.format(alias, datetime.now().strftime('%Y%m%d%H%M%S%f'))

available_types = [TrackDoc, AlbumDoc]
for my_doc_type in available_types:
    # create a new index with new mappings
    my_doc_type.init(index=new_index)

if old_index:
    # transfer documents from old index to new index
    reindex(es_client, source_index=old_index, target_index=new_index)

    es_client.indices.update_aliases({
        'actions': [
            {'remove': {'index': old_index, 'alias': alias}},
            {'add': {'index': new_index, 'alias': alias}},
        ],
    })
else:
    es_client.indices.update_aliases({
        'actions': [
            {'add': {'index': new_index, 'alias': alias}},
        ],
    })

ref:
https://www.elastic.co/blog/changing-mapping-with-zero-downtime
https://blog.codecentric.de/en/2014/09/elasticsearch-zero-downtime-reindexing-problems-solutions/
http://elasticsearch-py.readthedocs.org/en/master/helpers.html#elasticsearch.helpers.reindex

An alias can point to multiple indexes, in that case, reading (searching) from the alias performs perfectly, writing (indexing) to the alias raises an exception: Alias [my_index] has more than one indices associated with it [[my_index_v1, my_index_v2]], can't execute a single index op.

It's not recommended to set the same alias for multiple indexes unless explicitly using a specific index for writing data.

ref:
https://www.elastic.co/guide/en/elasticsearch/guide/current/multiple-indices.html

Create aliases

# list all indexes and their aliases
$ curl 'http://127.0.0.1:9200/_aliases'

# create an alias
$ curl -XPOST 'http://127.0.0.1:9200/_aliases' -d '
{
    "actions" : [
        { "add" : { "index" : "dps", "alias" : "packer" } }
    ]
}
'

# delete all indexes and aliases
$ curl -XDELETE 'http://127.0.0.1:9200/*/'

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-aliases.html

Update index settings

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-put-mapping.html
https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-update-settings.html
https://gist.github.com/nicolashery/6317643

Use elasticsearch-dsl with Python

Query DSL 是 Elasticsearch 的查詢用 Domain-specific Language
可以當成是 Elasticsearch 的 SQL
只不過它實際上就是一堆 JSON
elasticsearch-dsl 就是官方發佈的一套用來操作 Query DSL 的 Python package
用起來有點像 Django 的 ORM

ref:
https://github.com/elastic/elasticsearch-dsl-py

Installation

$ pip install elasticsearch-dsl

ref:
https://elasticsearch-dsl.readthedocs.org/en/latest/index.html

Indice and Types

in app/documents.py

from elasticsearch_dsl import DocType, String, Boolean
from elasticsearch_dsl.connections import connections
connections.create_connection(hosts=['127.0.0.1', ])


class AlbumDoc(DocType):
    upc = String(index='not_analyzed')
    title = String(analyzer='ik', fields={'raw': String(index='not_analyzed')})
    artist = String(analyzer='ik')
    is_ready = Boolean()

    class Meta:
        index = 'dps'
        doc_type = 'album'

    @classmethod
    def sync(cls, album):
        album_doc = AlbumDoc(meta={'id': album.id})
        album_doc.upc = album.get_upcs(output_str=False)
        album_doc.title = album.name
        album_doc.artist = album.artist.name
        album_doc.is_ready = album.is_ready
        album_doc.save()

    def save(self, *args, **kwargs):
        return super(AlbumDoc, self).save(*args, **kwargs)

    def get_model_obj(self):
        from svapps.dps.models import Album
        return Album.objects.get(id=self.meta.id)

# to create mappings
AlbumDoc.init()

一定要執行一次 YourDocType.init()
這樣 Elasticsearch 才會根據你的 DocType 產生對應的 mapping
否則 Elasticsearch 就會在你第一次倒資料進去的時候根據你的資料的 data type 建立對應的 mapping
所以 analyzer 之類的設定就會是預設的 standard
你可以透過 _mapping API 來檢查
http://127.0.0.1:9200/dps/_mapping/track
http://127.0.0.1:9200/dps/_mapping/album

需要全文搜尋的欄位要設為 analyzed(string 欄位默認都是 analyzed)
不需要全文搜尋的欄位,也就是要求精確的欄位,例如:username、email address、zip code,就可以設成 not_analyzed
但是你就不能對 analyzed 的欄位使用 term 了
除非你對該欄位額外再建立一個 raw 欄位

ref:
https://elasticsearch-dsl.readthedocs.org/en/latest/persistence.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html#CO59-2

Store Data

album_doc = AlbumDoc(meta={'id': 42})
album_doc.upc = ['887375000619', '887375502069']
album_doc.title = 'abc'
album_doc.artist = 'xyz'
album_doc.is_ready = True
album_doc.save()

# 可以如常地 query,不用管它是不是 list
search = AlbumDoc.search().filter('term', upc='887375000619')
response = search.execute()

因為 Elasticsearch 是 schemaless
所以即使你定義了 String 欄位
還是可以存一個 list 進去

Search Data

search = TrackDoc.search() \
    .filter('term', is_ready=True) \
    .query('match', title=u'沒有的啊')

search = TrackDoc.search() \
    .filter('term', is_ready=True) \
    .query(
        Q('match', title='沒有的啊') & \
        Q('match', artist='那我懂你意思了') & \
        Q('match', album='沒有的, 啊!?')
    )

q = Q(
    'bool',
    must=[
        Q('match', title={'query': track_name, 'fuzziness': 'AUTO'}),
    ],
    should=[
        Q('match', album={'query': album_name, 'minimum_should_match': '60%'}),
        Q('match', artist={'query': artist_name, 'minimum_should_match': '80%'}),
    ],
    minimum_should_match=1
)
search = TrackDoc.search().filter('term', is_ready=True).query(q)

q = Q(
    'bool',
    should=[
        Q('term', isrc=q),
        Q('term', upc=q),
        Q('match', **{'title.raw': {'query': q}}),
        Q('multi_match', query=q, fields=['title', 'artist', 'album']),
    ],
)
search = Search(index='dps', doc_type=['track', 'album']).query(q)
search = search[:20]

# print the raw Query DSL
import uniout
from pprint import pprint
pprint(search.to_dict())

response = search.execute()

print(response.hits.total)
print(response[0].title)
print(response[0].artist)
print(response[0].album)
print(response[0].is_ready)

ref:
https://elasticsearch-dsl.readthedocs.org/en/latest/search_dsl.html

Elasticsearch notes

Elasticsearch is a schemaless, document-oriented search engine, has a bunch of powerful quering APIs. It's also a pretty good NoSQL database.

ref:
https://www.elastic.co/products/elasticsearch

Glossary

  • cluster: 一個 cluster 會包含一個或多個 nodes
  • node: 一台 server 就是一個 node
  • index: 有點類似 RDBMS 裡的 database 的概念,嚴格來說 index 只是一個 namespace
  • shard: 每個 index 會被分割成多個 shards 放到不同的 node 上。每個 shard 還分為 primary 和 replica
  • type: 類似 RDBMS 裡的 table
  • document: 類似 RDBMS 裡的 row,document 實際上是儲存在一個個的 shard 裡
  • field: 類似 RDBMS 裡的 column
  • mapping: how the data in each field is interpreted 類似 RDBMS 裡的 table schema
  • analysis: how full text is processed to make it searchable

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/glossary.html

http://es.xiaoleilu.com/020_Distributed_Cluster/15_Add_an_index.html

Mapping (Schema)

Data in Elasticsearch can be broadly divided into two types: exact values and full text.

  • Exact values are exactly what they sound like. Examples are a date or a user ID, but can also include exact strings such as a username or an email address. The exact value Foo is not the same as the exact value foo. The exact value 2014 is not the same as the exact value 2014-09-15.
  • Full text, on the other hand, refers to textual data—usually written in some human language — like the text of a tweet or the body of an email.

當第一次有資料倒進去
Elasticsearch 就會根據資料的 data type 自動建立對應的 mapping
但是一些欄位的屬性(例如 analyzer)可能不會符合你的預期
所以建議你還是自己手動建立 mapping 比較好
https://www.elastic.co/guide/en/elasticsearch/guide/master/mapping-intro.html

某些欄位你平常搜尋的時候希望是 full text
但是在 aggregation 時又希望是 exact value
這時候你可以新增一個 raw 欄位來達成
https://www.elastic.co/guide/en/elasticsearch/guide/current/top-hits.html

RESTful API

show useful info for humans
http://127.0.0.1:9200/_cat

list all indices and their aliases
http://127.0.0.1:9200/_aliases

list types under a index
http://127.0.0.1:9200/dps/_mapping

list all documents under a type
http://127.0.0.1:9200/dps/track/_search

get mapping for an index or type
http://127.0.0.1:9200/dps/_mapping/track

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/cat.html

Query DSL

可以分成 query 和 filter
query 就是你要搜索的主體
filter 則是這個搜索的前置條件
https://www.elastic.co/guide/en/elasticsearch/guide/master/search-in-depth.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-queries.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filters.html

要做 exact value 的 query
請用 term
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

要做 full text 的 query
請用 match
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html

要一次 query 多個欄位
請用 multi_match
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html

要用 AND (must), OR (should), NOT (must_not) 的條件搜索
請用 bool
https://www.elastic.co/guide/en/elasticsearch/guide/master/bool-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

要結合 filter 和 query
請用 filtered(通常都會用這個)
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html

對特定欄位加權
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-boosting-query.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/query-time-boosting.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/_boosting_query_clauses.html

Multi-index, Multi-type

除了可以搜尋單一 type,也可以跨 index、跨 type

  • /_search Search all types in all indices
  • /gb/_search Search all types in the gb index
  • /gb,us/_search Search all types in the gb and us indices
  • /g*,u*/_search Search all types in any indices beginning with g or beginning with u
  • /gb/user/_search Search type user in the gb index
  • /gb,us/user,tweet/_search Search types user and tweet in the gb and us indices
  • /_all/user,tweet/_search Search types user and tweet in all indices

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html#search-multi-index-type

Analyzer

簡單說就是用來分詞的

中文分詞
https://github.com/medcl/elasticsearch-analysis-ik/
https://github.com/medcl/elasticsearch-analysis-mmseg

Elasticsearch 中文分詞 analysis plugin: ik

$ pip install httpie

# test.txt 的內容就是你要分詞的文字
$ http 127.0.0.1:9200/_analyze?analyzer=cjk < test.txt