Elasticsearch: More than a Search Engine

Elasticsearch is a schemaless, document-oriented search engine, has a bunch of powerful quering APIs. It's also a great NoSQL database.

ref:
https://www.elastic.co/products/elasticsearch

Glossary

以下的定義以 Elasticsearch 5.6 為準,可能跟舊版的定義不同。在新版的 Elasticsearch 中,每個 index 只能有一個 mapping type,之前的版本則可以有多個。

  • cluster:一個 cluster 包含一個或多個 nodes,會自動選出一個 master node
  • node:一台跑著 Elasticsearch 的機器就是一個 node
  • index:類似關聯式資料庫裡的 table
  • mapping:類似關聯式資料庫裡的 table schema
  • field:類似關聯式資料庫裡的 column
  • document:類似關聯式資料庫裡的 row
  • text:任意的非結構化文字,text 會被 analyze 變成 term,然後才能被搜尋
  • term:實際上存在 Elasticsearch 裡的東西
  • analysis: 把 text 變成 term 的過程,例如 normalize、tokenize 和 stopword remove

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/glossary.html

Mapping

Mapping 可以顯式地指定,但是如果沒有指定,Elasticsearch 就會在第一次有資料進去的時候,自動根據資料建立對應的 mapping,所以一些欄位的屬性(例如 analyzer)可能不會符合你的預期,所以最好還是手動指定 mapping。

常用的 field data types:

  • text:表示 string 類型,用於 full-text search
  • keyword:表示 string 類型,用於 exact value 的 filter、sort 或 aggregate(就是舊版的 not_analyzed

在 Elasticsearch 中,同一個欄位可以被 index 成不同的 data types,例如 location 欄位,可以透過 fields 屬性,同時 index 成 textkeyword,用來全文搜索和 exact value 過濾。也可以分別指定不同的 analyzer。

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html

Analysis

一個 analyzer 由三個部分組成:

  • Character filters
  • Tokenizers
  • Token filters

你可以自己組合出你的 analyzer,以 elasticsearch-dsl-py 為例:

from elasticsearch_dsl import DocType, Date, Integer, Keyword, Text, Boolean
from elasticsearch_dsl import analyzer, tokenizer

text_analyzer = analyzer('text_analyzer',
    char_filter=["html_strip"],
    tokenizer="standard",
    filter=["asciifolding", "lowercase", "snowball", "stop"]
)

cjk_analyzer = analyzer('text_analyzer',
    char_filter=["html_strip"],
    tokenizer=tokenizer('trigram', 'nGram', min_gram=2, max_gram=3),
    filter=["asciifolding", "lowercase", "snowball", "stop"]
)

ref:
http://elasticsearch-dsl.readthedocs.io/en/latest/persistence.html

Testing Analyzer

測試某段文字在某個 analyzer 下的效果:

POST http://127.0.0.1:9200/_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase", "asciifolding"],
  "text": "Is this déja vu?"
}

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html

Chinese Words Segmentation

ik 之類的分詞 plugin 的效果都不是很好,內建的 cjk 加上 NGram 可能會是比較好的選擇(可以用 multi-field index)。另外一個作法是,把資料餵進去 Elasticsearch 之前就先分好詞,可以用 Jieba,分詞完的文本以空格分隔,然後用 Elasticsearch 的 whitespace tokenizer。

中文搜尋經驗分享
https://blog.liang2.tw/2015Talk-Chinese-Search/

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#cjk-analyzer
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-smartcn.html << 不推薦
https://github.com/medcl/elasticsearch-analysis-ik << 堪用

RESTful APIs

Show useful information for humans
http://127.0.0.1:9200/_cat
https://www.elastic.co/guide/en/elasticsearch/reference/current/cat.html

List all indices and aliases
http://127.0.0.1:9200/_aliases

List mappings under am index
http://127.0.0.1:9200/repo/_mapping

List documents under an index
http://127.0.0.1:9200/repo/_search

Query DSL

query 就是你要搜索的主體
filter 則是這個搜索的前置條件
https://www.elastic.co/guide/en/elasticsearch/guide/master/search-in-depth.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-queries.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filters.html

要做 exact value 的 query
請用 term
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

要做 full text 的 query
請用 match
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html

要一次 query 多個欄位
請用 multi_match
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html

要用 AND (must), OR (should), NOT (must_not) 的條件搜索
請用 bool
https://www.elastic.co/guide/en/elasticsearch/guide/master/bool-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

要結合 filter 和 query
請用 filtered
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html

More Like This query
除了可以輸入文字之外,還可以直接指定 document id 找出相似的結果
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html

對特定欄位加權
https://www.elastic.co/guide/en/elasticsearch/guide/current/query-time-boosting.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/_boosting_query_clauses.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-boosting-query.html

Multi-index, Multi-type

除了可以搜尋單一 type,也可以跨 index、跨 type

  • /_search: Search all types in all indices
  • /gb/_search: Search all types in the gb index
  • /gb,us/_search: Search all types in the gb and us indices
  • /g*,u*/_search: Search all types in any indices beginning with g or beginning with u
  • /gb/user/_search: Search type user in the gb index
  • /gb,us/user,tweet/_search: Search types user and tweet in the gb and us indices
  • /_all/user,tweet/_search: Search types user and tweet in all indices

ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html#search-multi-index-type