MongoDB cookbook: indexes

MongoDB cookbook: indexes

Indexes are crucial for the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection.

Types of Indexes

ref:
https://docs.mongodb.com/manual/indexes/
https://docs.mongodb.com/manual/applications/indexes/

Single Field Index

For a single field index and sort operations, the sort order (i.e. ascending or descending) of the index key doesn't matter. With index intersetion, single field indexs could be powerful.

ref:
https://docs.mongodb.com/manual/core/index-single/

Compound Index

The order of the fields listed in a compound index is very important.

ref:
https://docs.mongodb.com/manual/core/index-compound/
https://docs.mongodb.com/manual/tutorial/create-indexes-to-support-queries/

Index Intersection

MongoDB can use multiple single field indexes to fulfill queries.

db.orders.createIndex({item: 1})
db.orders.createIndex({qty: 1})

db.orders.find({item: 'abc123', qty: {$gt: 15}})

ref:
https://docs.mongodb.com/manual/core/index-intersection/

Sort with Indexes

ref:
https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/

Covered Queries

ref:
https://docs.mongodb.com/manual/core/query-optimization/#read-operations-covered-query

List Indexes

db.message.getIndexes()

// show collection statistics
db.message.stats()
db.message.stats().indexSizes

ref:
https://docs.mongodb.com/manual/tutorial/manage-indexes/

Add Index

The order of fields in an index matters, you must consider Index Cardinality and Selectivity. Instead, the order of fields in a find() query doesn't affect whether it can use an index or not.

ref:
https://stackoverflow.com/questions/5245737/mongodb-indexes-order-and-query-order-must-match

An index which contains array fields might consume a lot of disk space.

db.message.createIndex({
    '_cls': 1,
    'sender': 1,
    'posted_at': 1
}, {'background': true, 'sparse': true})

db.message.createIndex({
    '_cls': 1,
    'includes': 1,
    'posted_at': 1
}, {'background': true, 'sparse': true})

db.getCollection('message').find({
    '$or': [
        // sent by cp
        {
            '_cls': 'Message.ChatMessage',
            'sender': ObjectId('582ee32a5b9c861c87dc297e'),
            'posted_at': {
                '$gte': ISODate('2018-01-08T00:00:00.000Z'),
                '$lt': ISODate('2018-01-14T00:00:00.000Z')
            }
        },
        // sent by payer
        {
            '_cls': 'Message.GiftMessage',
            'includes': ObjectId('582ee32a5b9c861c87dc297e'),
            'posted_at': {
                '$gte': ISODate('2018-01-08T00:00:00.000Z'),
                '$lt': ISODate('2018-01-14T00:00:00.000Z')
            }
        }
    ]
})

You can't index two arrays together, in this example: includes and unlocks.

// it doesn't work
db.message.createIndex({
    '_cls': 1,
    'sender': 1,
    'includes': 1,
    'unlocks': 1
}, {'background': true, 'sparse': true})

The Order of Fields

The order of fields in a compound index should be:

  • First, fields on which you will query for exact values.
  • Second, fields on which you will sort.
  • Finally, fields on which you will query for a range of values.

ref:
https://blog.mlab.com/2012/06/cardinal-ins/
https://emptysqua.re/blog/optimizing-mongodb-compound-indexes/
https://stackoverflow.com/questions/33545339/how-does-the-order-of-compound-indexes-matter-in-mongodb-performance-wise

Partial v.s. Sparse Indexes

Partial indexes should be preferred over sparse indexes. However, partial indexes only support a very small set of filter operators.

ref:
https://docs.mongodb.com/manual/core/index-partial/
https://docs.mongodb.com/manual/core/index-sparse/

Drop Index

db.message.dropIndex({
    'includes': 1
})

db.message.dropIndex({
    '_cls': 1,
    'posted_at': 1,
    'includes': 1
})

Remove Unused Indexes

You can use db.getCollection('message').aggregate({$indexStats: {}}) to find unused indexes, there is a accesses.ops which means the number of operations that have used the index. Also, you should remove indexes which have the same prefix.

db.getCollection('message').aggregate(
    {
        '$indexStats': {}
    },
    {
        '$match': {
            'accesses.ops': {'$gt': 0}
        }
    }
)

Result:

{
    "name" : "_cls_1_sender_1_posted_at_1",
    "key" : {
        "_cls" : 1,
        "sender" : 1,
        "posted_at" : 1
    },
    "host" : "a6ea11893605:27017",
    "accesses" : {
        "ops" : 3,
        "since" : "2018-01-26T07:04:51.137Z"
    }
}

ref:
https://blog.mlab.com/2017/01/using-mongodb-indexstats-to-identify-and-remove-unused-indexes/
https://scalegrid.io/blog/how-to-find-unused-indexes-in-mongodb/

Profiling

// enable
db.setProfilingLevel(2)

// disable
db.setProfilingLevel(0)

// see profiling data after you issues some queries
db.system.profile.find().limit(10).sort( { ts : -1 } ).pretty()

// delete profiling data
db.system.profile.drop()

Query Explain

There are both collection.find().explain() and collection.explain().find(). It's recommended to use collection.find().explain('executionStats') for getting more information, like total documents examined.

db.getCollection('message').find({
    '$or': [
        // sent by cp
        {
            '_cls': 'Message.ChatMessage',
            'sender': ObjectId('582ee32a5b9c861c87dc297e'),
            'posted_at': {
                '$gte': ISODate('2018-01-08T00:00:00.000Z'),
                '$lt': ISODate('2018-01-14T00:00:00.000Z')
            }
        },
        {
            '_cls': 'Message',
            'sender': ObjectId('582ee32a5b9c861c87dc297e'),
            'posted_at': {
                '$gte': ISODate('2018-01-08T00:00:00.000Z'),
                '$lt': ISODate('2018-01-14T00:00:00.000Z')
            }
        },
        // sent by payer
        {
            '_cls': 'Message.ChatMessage',
            'includes': ObjectId('582ee32a5b9c861c87dc297e'),
            'posted_at': {
                '$gte': ISODate('2018-01-08T00:00:00.000Z'),
                '$lt': ISODate('2018-01-14T00:00:00.000Z')
            }
        },
        {
            '_cls': 'Message.ReplyMessage',
            'includes': ObjectId('582ee32a5b9c861c87dc297e'),
            'posted_at': {
                '$gte': ISODate('2018-01-08T00:00:00.000Z'),
                '$lt': ISODate('2018-01-14T00:00:00.000Z')
            }
        },
        {
            '_cls': 'Message.GiftMessage',
            'includes': ObjectId('582ee32a5b9c861c87dc297e'),
            'posted_at': {
                '$gte': ISODate('2018-01-08T00:00:00.000Z'),
                '$lt': ISODate('2018-01-14T00:00:00.000Z')
            }
        }
    ]
})
// .explain()
// .explain('allPlansExecution')
.explain('executionStats')

ref:
https://docs.mongodb.com/manual/reference/method/cursor.explain/
https://docs.mongodb.com/manual/reference/method/db.collection.explain/#db.collection.explain

Stages are descriptive of the operation; e.g.

  • COLLSCAN: scanning the entire collection
  • IXSCAN: scanning index keys
  • FETCH: retrieving documents
  • SHARD_MERGE: merging results from shards

Some important fields to look at in the result of explain():

  • scanAndOrder: sorting documents in memory ins

ref:
https://docs.mongodb.com/manual/reference/explain-results/

Aggregation Explain

db.getCollection('message').explain().aggregate()

ref:
https://stackoverflow.com/questions/12702080/mongodb-explain-for-aggregation-framework

MongoDB cookbook: queries

MongoDB cookbook: queries

MongoDB Shell in JavaScript

Check If a Document Exists

It is significantly faster to use find() + limit() because findOne() will always read + return the document if it exists. find() just returns a cursor (or not) and only reads the data if you iterate through the cursor.

db.getCollection('message').find({_id: ObjectId("585836504b287b5022a3ae26", delivered: false)}, {_id: 1}).limit(1)

ref:
https://stackoverflow.com/questions/8389811/how-to-query-mongodb-to-test-if-an-item-exists
https://blog.serverdensity.com/checking-if-a-document-exists-mongodb-slow-findone-vs-find/

Find Documents

db.getCollection('user').find({username: 'nanababy520'})

db.getCollection('message').find({_id: ObjectId("5a6383b8d93d7a3fadf75af3")})
db.getCollection('message').find({sender: ObjectId("57aace67ac08e72acc3b265f"), pricing: {$ne: 0}})
db.getCollection('message').find({_cls: 'Message'}).sort({posted_at: -1})

Find Documents with Regular Expression

db.getCollection('user').find({'username': /vicky/})

ref:
https://docs.mongodb.com/manual/reference/operator/query/regex/

Find Documents with an Array Field

db.getCollection('message').find({includes: ObjectId("5a4bb448af9c462c610d0cc7")})

db.getCollection('user').find({gender: 'F', tags: 'promoted'})
db.getCollection('user').find({gender: 'F', 'tags.1': {$exists: true}})

Find Documents Where a Field Has Value

db.test.insert({"num": 1, "check": "check value"})
db.test.insert({"num": 2, "check": null})
db.test.insert({"num": 3})

db.test.find({"check":{$exists: true}})
// return 1 and 2

db.test.find({"check": {$ne: null}});
// return 1

db.test.find({"check": null})
// return 2 and 3

ref:
https://stackoverflow.com/questions/4057196/how-do-you-query-this-in-mongo-is-not-null

Find Documents Where an Array Field Is Not Empty

db.getCollection('message').find({unlocks: {$exists: true}})
db.getCollection('user').find({inbox: {$exists: true, $not: {$size: 0}}})

ref:
https://stackoverflow.com/questions/14789684/find-mongodb-records-where-array-field-is-not-empty

Find Documents Where an Array Field Doesn't Contain a Certain Value

db.getCollection('user').update({_id: ObjectId("579994ac61ff217f96a585d9"), tags: {$ne: 'tag_to_add'}}, {$push: {tags: 'tag_to_add'}})

db.getCollection('user').update({_id: ObjectId("579994ac61ff217f96a585d9"), tags: {$nin: ['tag_to_add']}}, {$push: {tags: 'tag_to_add'}})

ref:
https://stackoverflow.com/questions/16221599/find-documents-with-arrays-not-containing-a-document-with-a-particular-field-val

Find Documents Where an Array Field's Size Is Greater Than 1

db.getCollection('message').find({_cls: 'Message.ChatMessage', sender: ObjectId("582ee32a5b9c861c87dc297e"), 'unlocks': {$exists: true, $not: {$size: 0}}})

ref:
https://stackoverflow.com/questions/7811163/query-for-documents-where-array-size-is-greater-than-1/15224544#15224544

Insert

db.getCollection('feature.launch').insert({
    'url': '//asia.public.swag.live/launchs/5a06b88aaf9c462c6146ce12.jpg',
    'user': {
        'id': ObjectId("5a06b88aaf9c462c6146ce12"),
        'username': 'luke0804',
        'tags': ["gender:male"]
    }
})

db.getCollection('feature.launch').insert({
    'url': '//asia.public.swag.live/launchs/57c16f5bb811055b66d8ef46.jpg',
    'user': {
        'id': ObjectId("57c16f5bb811055b66d8ef46"),
        'username': 'riva',
        'tags': ["gender:female"]
    }
})

MongoEngine in Python

ref:
http://docs.mongoengine.org/guide/querying.html

Check If a Document Exists

now = datetime.datetime.now(datetime.timezone.utc)
if TagSchedule.objects.no_dereference().only('id').filter(user=user_id, tag=tag, started_at__gt=now):
    return 'exists'

Update an Array Field

user_id = '582ee32a5b9c861c87dc297e'
tag = 'my_tag'

# add
User.objects.filter(id=user_id, tags__ne=tag).update(push__tags=tag)

# remove
User.objects.filter(id=user_id).update(pull__tags=tag)

ref:
http://docs.mongoengine.org/guide/querying.html#atomic-updates

Upsert

You must use upsert=true with uniquely indexed fields.

tag_schedule = TagSchedule.objects \
    .filter(user=user_id, tag=tag) \
    .modify(
        started_at=started_at,
        ended_at=ended_at,
        upsert=True
    )

ref:
https://docs.mongodb.com/manual/reference/method/db.collection.update/#update-with-unique-indexes

Two-phase Commit

ref:
https://docs.mongodb.com/manual/tutorial/perform-two-phase-commits/

Run a Celery task at a specific time

Run a Celery task at a specific time

Schedule Tasks

You are able to run any Celery task at a specific time through eta (means "Estimated Time of Arrival") parameter.

import datetime

import celery

@celery.shared_task(bind=True)
def add_tag(task, user_id, tag):
    User.objects.filter(id=user_id, tags__ne=tag).update(push__tags=tag)
    return True

user_id = '582ee32a5b9c861c87dc297e'
tag = 'new_tag'
started_at = datetime.datetime(2018, 3, 12, tzinfo=datetime.timezone.utc)
add_tag.apply_async((user_id, tag), eta=started_at)

ref:
http://docs.celeryproject.org/en/master/userguide/calling.html#eta-and-countdown

Revoke Tasks

Revoked tasks will be discarded until their eta.

from celery.result import AsyncResult

AsyncResult(task_id).revoke()

ref:
http://docs.celeryproject.org/en/latest/reference/celery.result.html#celery.result.AsyncResult.revoke

Revoking tasks works by sending a broadcast message to all the workers, the workers then keep a list of revoked tasks in memory. When a worker starts up it will synchronize revoked tasks with other workers in the cluster.

The list of revoked tasks is in-memory so if all workers restart the list of revoked ids will also vanish. If you want to preserve this list between restarts you need to specify a file for these to be stored in by using the –statedb argument to celery worker.

ref:
http://docs.celeryproject.org/en/latest/userguide/workers.html#worker-persistent-revokes