Elasticsearch is a schemaless, document-oriented search engine, has a bunch of powerful quering APIs. It's also a great NoSQL database.
ref:
https://www.elastic.co/products/elasticsearch
Glossary
以下的定義以 Elasticsearch 5.6 為準,可能跟舊版的定義不同。在新版的 Elasticsearch 中,每個 index 只能有一個 mapping type,之前的版本則可以有多個。
- cluster:一個 cluster 包含一個或多個 nodes,會自動選出一個 master node
- node:一台跑著 Elasticsearch 的機器就是一個 node
- index:類似關聯式資料庫裡的 table
- mapping:類似關聯式資料庫裡的 table schema
- field:類似關聯式資料庫裡的 column
- document:類似關聯式資料庫裡的 row
- text:任意的非結構化文字,text 會被 analyze 變成 term,然後才能被搜尋
- term:實際上存在 Elasticsearch 裡的東西
- analysis: 把 text 變成 term 的過程,例如 normalize、tokenize 和 stopword remove
ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/glossary.html
Mapping
Mapping 可以顯式地指定,但是如果沒有指定,Elasticsearch 就會在第一次有資料進去的時候,自動根據資料建立對應的 mapping,所以一些欄位的屬性(例如 analyzer)可能不會符合你的預期,所以最好還是手動指定 mapping。
常用的 field data types:
text
:表示 string 類型,用於 full-text searchkeyword
:表示 string 類型,用於 exact value 的 filter、sort 或 aggregate(就是舊版的not_analyzed
)
在 Elasticsearch 中,同一個欄位可以被 index 成不同的 data types,例如 location
欄位,可以透過 fields
屬性,同時 index 成 text
和 keyword
,用來全文搜索和 exact value 過濾。也可以分別指定不同的 analyzer。
ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html
Analysis
一個 analyzer 由三個部分組成:
- Character filters
- Tokenizers
- Token filters
你可以自己組合出你的 analyzer,以 elasticsearch-dsl-py 為例:
from elasticsearch_dsl import DocType, Date, Integer, Keyword, Text, Boolean
from elasticsearch_dsl import analyzer, tokenizer
text_analyzer = analyzer('text_analyzer',
char_filter=["html_strip"],
tokenizer="standard",
filter=["asciifolding", "lowercase", "snowball", "stop"]
)
cjk_analyzer = analyzer('text_analyzer',
char_filter=["html_strip"],
tokenizer=tokenizer('trigram', 'nGram', min_gram=2, max_gram=3),
filter=["asciifolding", "lowercase", "snowball", "stop"]
)
ref:
http://elasticsearch-dsl.readthedocs.io/en/latest/persistence.html
Testing Analyzer
測試某段文字在某個 analyzer 下的效果:
POST http://127.0.0.1:9200/_analyze
{
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding"],
"text": "Is this déja vu?"
}
ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html
Chinese Words Segmentation
ik 之類的分詞 plugin 的效果都不是很好,內建的 cjk 加上 NGram 可能會是比較好的選擇(可以用 multi-field index)。另外一個作法是,把資料餵進去 Elasticsearch 之前就先分好詞,可以用 Jieba,分詞完的文本以空格分隔,然後用 Elasticsearch 的 whitespace tokenizer。
中文搜尋經驗分享
https://blog.liang2.tw/2015Talk-Chinese-Search/
ref:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#cjk-analyzer
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-smartcn.html << 不推薦
https://github.com/medcl/elasticsearch-analysis-ik << 堪用
RESTful APIs
Show useful information for humans
http://127.0.0.1:9200/_cat
https://www.elastic.co/guide/en/elasticsearch/reference/current/cat.html
List all indices and aliases
http://127.0.0.1:9200/_aliases
List mappings under am index
http://127.0.0.1:9200/repo/_mapping
List documents under an index
http://127.0.0.1:9200/repo/_search
Query DSL
query 就是你要搜索的主體
filter 則是這個搜索的前置條件
https://www.elastic.co/guide/en/elasticsearch/guide/master/search-in-depth.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-queries.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filters.html
要做 exact value 的 query
請用 term
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
要做 full text 的 query
請用 match
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
要一次 query 多個欄位
請用 multi_match
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html
要用 AND (must), OR (should), NOT (must_not) 的條件搜索
請用 bool
https://www.elastic.co/guide/en/elasticsearch/guide/master/bool-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
要結合 filter 和 query
請用 filtered
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html
More Like This query
除了可以輸入文字之外,還可以直接指定 document id 找出相似的結果
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
對特定欄位加權
https://www.elastic.co/guide/en/elasticsearch/guide/current/query-time-boosting.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/_boosting_query_clauses.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-boosting-query.html
Multi-index, Multi-type
除了可以搜尋單一 type,也可以跨 index、跨 type
/_search
: Search all types in all indices/gb/_search
: Search all types in the gb index/gb,us/_search
: Search all types in the gb and us indices/g*,u*/_search
: Search all types in any indices beginning with g or beginning with u/gb/user/_search
: Search type user in the gb index/gb,us/user,tweet/_search
: Search types user and tweet in the gb and us indices/_all/user,tweet/_search
: Search types user and tweet in all indices