MongoDB cookbook: indexes

MongoDB cookbook: indexes

Indexes are crucial for the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection.

Types of Indexes

ref:
https://docs.mongodb.com/manual/indexes/
https://docs.mongodb.com/manual/applications/indexes/

Single Field Index

For a single field index and sort operations, the sort order (i.e. ascending or descending) of the index key doesn't matter. With index intersetion, single field indexs could be powerful.

ref:
https://docs.mongodb.com/manual/core/index-single/

Compound Index

The order of the fields listed in a compound index is very important.

ref:
https://docs.mongodb.com/manual/core/index-compound/
https://docs.mongodb.com/manual/tutorial/create-indexes-to-support-queries/

Index Intersection

MongoDB can use multiple single field indexes to fulfill queries.

db.orders.createIndex({item: 1})
db.orders.createIndex({qty: 1})

db.orders.find({item: 'abc123', qty: {$gt: 15}})

ref:
https://docs.mongodb.com/manual/core/index-intersection/

Sort with Indexes

ref:
https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/

Covered Queries

ref:
https://docs.mongodb.com/manual/core/query-optimization/#read-operations-covered-query

List Indexes

db.message.getIndexes()

// show collection statistics
db.message.stats()
db.message.stats().indexSizes

ref:
https://docs.mongodb.com/manual/tutorial/manage-indexes/

Add Index

The order of fields in an index matters, you must consider Index Cardinality and Selectivity. Instead, the order of fields in a find() query doesn't affect whether it can use an index or not.

ref:
https://stackoverflow.com/questions/5245737/mongodb-indexes-order-and-query-order-must-match

An index which contains array fields might consume a lot of disk space.

db.message.createIndex({
    '_cls': 1,
    'sender': 1,
    'posted_at': 1
}, {'background': true, 'sparse': true})

db.message.createIndex({
    '_cls': 1,
    'includes': 1,
    'posted_at': 1
}, {'background': true, 'sparse': true})

db.getCollection('message').find({
    '$or': [
        // sent by cp
        {
            '_cls': 'Message.ChatMessage',
            'sender': ObjectId('582ee32a5b9c861c87dc297e'),
            'posted_at': {
                '$gte': ISODate('2018-01-08T00:00:00.000Z'),
                '$lt': ISODate('2018-01-14T00:00:00.000Z')
            }
        },
        // sent by payer
        {
            '_cls': 'Message.GiftMessage',
            'includes': ObjectId('582ee32a5b9c861c87dc297e'),
            'posted_at': {
                '$gte': ISODate('2018-01-08T00:00:00.000Z'),
                '$lt': ISODate('2018-01-14T00:00:00.000Z')
            }
        }
    ]
})

You can't index two arrays together, in this example: includes and unlocks.

// it doesn't work
db.message.createIndex({
    '_cls': 1,
    'sender': 1,
    'includes': 1,
    'unlocks': 1
}, {'background': true, 'sparse': true})

The Order of Fields

The order of fields in a compound index should be:

  • First, fields on which you will query for exact values.
  • Second, fields on which you will sort.
  • Finally, fields on which you will query for a range of values.

ref:
https://blog.mlab.com/2012/06/cardinal-ins/
https://emptysqua.re/blog/optimizing-mongodb-compound-indexes/
https://stackoverflow.com/questions/33545339/how-does-the-order-of-compound-indexes-matter-in-mongodb-performance-wise

Partial v.s. Sparse Indexes

Partial indexes should be preferred over sparse indexes. However, partial indexes only support a very small set of filter operators.

ref:
https://docs.mongodb.com/manual/core/index-partial/
https://docs.mongodb.com/manual/core/index-sparse/

Drop Index

db.message.dropIndex({
    'includes': 1
})

db.message.dropIndex({
    '_cls': 1,
    'posted_at': 1,
    'includes': 1
})

Remove Unused Indexes

You can use db.getCollection('message').aggregate({$indexStats: {}}) to find unused indexes, there is a accesses.ops which means the number of operations that have used the index. Also, you should remove indexes which have the same prefix.

db.getCollection('message').aggregate(
    {
        '$indexStats': {}
    },
    {
        '$match': {
            'accesses.ops': {'$gt': 0}
        }
    }
)

Result:

{
    "name" : "_cls_1_sender_1_posted_at_1",
    "key" : {
        "_cls" : 1,
        "sender" : 1,
        "posted_at" : 1
    },
    "host" : "a6ea11893605:27017",
    "accesses" : {
        "ops" : 3,
        "since" : "2018-01-26T07:04:51.137Z"
    }
}

ref:
https://blog.mlab.com/2017/01/using-mongodb-indexstats-to-identify-and-remove-unused-indexes/
https://scalegrid.io/blog/how-to-find-unused-indexes-in-mongodb/

Profiling

// enable
db.setProfilingLevel(2)

// disable
db.setProfilingLevel(0)

// see profiling data after you issues some queries
db.system.profile.find().limit(10).sort( { ts : -1 } ).pretty()

// delete profiling data
db.system.profile.drop()

Query Explain

There are both collection.find().explain() and collection.explain().find(). It's recommended to use collection.find().explain('executionStats') for getting more information, like total documents examined.

db.getCollection('message').find({
    '$or': [
        // sent by cp
        {
            '_cls': 'Message.ChatMessage',
            'sender': ObjectId('582ee32a5b9c861c87dc297e'),
            'posted_at': {
                '$gte': ISODate('2018-01-08T00:00:00.000Z'),
                '$lt': ISODate('2018-01-14T00:00:00.000Z')
            }
        },
        {
            '_cls': 'Message',
            'sender': ObjectId('582ee32a5b9c861c87dc297e'),
            'posted_at': {
                '$gte': ISODate('2018-01-08T00:00:00.000Z'),
                '$lt': ISODate('2018-01-14T00:00:00.000Z')
            }
        },
        // sent by payer
        {
            '_cls': 'Message.ChatMessage',
            'includes': ObjectId('582ee32a5b9c861c87dc297e'),
            'posted_at': {
                '$gte': ISODate('2018-01-08T00:00:00.000Z'),
                '$lt': ISODate('2018-01-14T00:00:00.000Z')
            }
        },
        {
            '_cls': 'Message.ReplyMessage',
            'includes': ObjectId('582ee32a5b9c861c87dc297e'),
            'posted_at': {
                '$gte': ISODate('2018-01-08T00:00:00.000Z'),
                '$lt': ISODate('2018-01-14T00:00:00.000Z')
            }
        },
        {
            '_cls': 'Message.GiftMessage',
            'includes': ObjectId('582ee32a5b9c861c87dc297e'),
            'posted_at': {
                '$gte': ISODate('2018-01-08T00:00:00.000Z'),
                '$lt': ISODate('2018-01-14T00:00:00.000Z')
            }
        }
    ]
})
// .explain()
// .explain('allPlansExecution')
.explain('executionStats')

ref:
https://docs.mongodb.com/manual/reference/method/cursor.explain/
https://docs.mongodb.com/manual/reference/method/db.collection.explain/#db.collection.explain

Stages are descriptive of the operation; e.g.

  • COLLSCAN: scanning the entire collection
  • IXSCAN: scanning index keys
  • FETCH: retrieving documents
  • SHARD_MERGE: merging results from shards

Some important fields to look at in the result of explain():

  • scanAndOrder: sorting documents in memory ins

ref:
https://docs.mongodb.com/manual/reference/explain-results/

Aggregation Explain

db.getCollection('message').explain().aggregate()

ref:
https://stackoverflow.com/questions/12702080/mongodb-explain-for-aggregation-framework

Speed up Python and Node.js builds on Travis CI

Speed up Python and Node.js builds on Travis CI

Travis CI's caching archives all directories listed in the configuration and uploads them to Amazon S3. Cached contents are available to any build on the repository, including Pull Requests. For Python and Node.js projects, you could cache both site-packages and node_modules directories in every Travis CI build.

Here is an example of .travis.yml:

sudo: false

language: python

python:
  - "2.7"

node_js: 4

cache:
  directories:
    - $HOME/.cache/pip
    - $HOME/virtualenv/python2.7.9/lib/python2.7/site-packages
    - node_modules

before_install:
  - pip install -U pip

install:
  - pip install -r requirements.txt
  - pip install coverage --ignore-installed
  - npm install

script:
  - coverage run manage.py test

In the case of mine, after applying these changes, the installation time of pip and npm reduces from 180 seconds to 5 seconds.

One thing should be mentioned here: Since we didn't specify any bin folder in the configuration (and I don't think that's necessary), any execution file that being installed by pip such as coverage or django-admin.py will not exist in subsequent builds. If you need those commands, you could just force install them by adding pip install some_package --ignore-installed.

References:

Caching Dependencies and Directories
https://docs.travis-ci.com/user/caching/

How to cache requirements for a Django project on Travis-CI?
http://stackoverflow.com/questions/19422229/how-to-cache-requirements-for-a-django-project-on-travis-ci

如何在 Travis CI 加快 Python 單元測試速度
https://tzangms.com/how-to-speed-up-python-unit-test-on-travis-ci/

Integrate with webpages using CasperJS (built on top of PhantomJS)

Integrate with webpages using CasperJS (built on top of PhantomJS)

PhantomJS is a headless and scriptable WebKit runtime (aka browser) with JavaScript API.

Usage

in script.js

Login and delete spare movie tags on Douban.

var casper = require('casper').create({
  pageSettings: {
    loadImages: true,
    loadPlugins: false
  },
  logLevel: 'debug',
  verbose: true
});

// save session cookies
var fs = require('fs');
var page = require('webpage').create();

var cookieFile = 'cookies.json';

var saveSessionCookie = function() {
  try {
    fs.statSync(cookieFile);
  } catch (e) {
    fs.write(cookieFile, JSON.stringify(phantom.cookies), 'w');
  }
}

if (fs.isFile(cookieFile)) {
  Array.prototype.forEach.call(JSON.parse(fs.read(cookieFile)), function(x) {
    phantom.addCookie(x);
  });
}

// script
var loginUrl = 'https://accounts.douban.com/login';
var startUrl = 'https://movie.douban.com/people/vinta/all';

var tags_do_not_delete = [
  '丹麦', '新西兰', '新加坡', '以色列', '印度', '意大利', '瑞典', '墨西哥', '俄罗斯', '西班牙', '比利时'
];

casper.start(loginUrl, function() {
  this.echo(this.getCurrentUrl());
  this.echo(this.getTitle());

  this.capture('login.png');

  var data = {
    form_email: 'xxx',
    form_password: 'xxx'
  };

  // 可能會被豆瓣要求輸入驗證碼
  // 可以用 casperjs script.js --remote-debugger-port=9000
  // 先打開 login.png 看驗證碼是什麼
  // 到 http://127.0.0.1:9000/ 的 console 手動輸入驗證碼
  // data['captcha-solution'] = '123';

  this.waitForSelector('form#lzform');
  this.fill('form#lzform', data, true);
});

casper.then(function() {
  this.echo(this.getCurrentUrl());
  this.echo(this.getTitle());

  saveSessionCookie();

  this.capture('all.png');

  this.open(startUrl).then(function() {
    this.waitForSelector('#open_tags', function() {
      this.click('#open_tags');
    });

    this.waitWhileSelector('#open_tags');
  });
});

casper.then(function() {
  this.echo(this.getCurrentUrl());
  this.echo(this.getTitle());

  var links = this.evaluate(function() {
    var tagList = document.querySelectorAll('ul.tag-list li a');
    var theLinks = Array.prototype.map.call(tagList, function(elem) {
        return {
          tag: elem.textContent.trim(),
          href: elem.getAttribute('href'),
          count: parseInt(elem.nextElementSibling.textContent, 10)
        };
    });

    return theLinks;
  });

  var filteredLinks = links.filter(function(link) {
    if (link.count < 5 && tags_do_not_delete.indexOf(link.tag) == -1) {
      return true;
    }
    return false;
  });

  this.each(filteredLinks, function(self, link) {
    this.echo(link.tag + ', ' + link.count);

    self.thenOpen(link.href, function() {
      this.echo(this.getCurrentUrl());
      this.echo(this.getTitle());

      this.waitForSelector('#tag-del', function() {
        this.click('#tag-del');
      });

      this.waitForSelector('input[name="del_submit"]', function() {
        this.click('input[name="del_submit"]');
      });
    });
  });
});

casper.run();

To evaluate JavaScript code in the context of the webpage, you must use evaluate() function. The context is a sandbox.

ref:
http://docs.casperjs.org/en/latest/modules/index.html

ref:
https://github.com/vinta/playground/blob/master/casperjs/script.js

Save session cookies

--cookies-file=xxx.txt only store non-session cookies (which remain your logged-in or authenticated status). You have to save every cookie manually.

var casper = require('casper').create();

// save session cookies
var fs = require('fs');
var page = require('webpage').create();

var cookieFile = 'cookies.json';

var saveSessionCookie = function() {
  try {
    fs.statSync(cookieFile);
  } catch (e) {
    fs.write(cookieFile, JSON.stringify(phantom.cookies), 'w');
  }
}

if (fs.isFile(cookieFile)) {
  Array.prototype.forEach.call(JSON.parse(fs.read(cookieFile)), function(x) {
    phantom.addCookie(x);
  });
}

casper.start('yourUrl', function() {
  // do your shit
});

ref:
http://stackoverflow.com/questions/18739354/how-can-i-use-persisted-cookies-from-a-file-using-phantomjs

Run

$ docker run --rm -v `pwd`:/data vinta/casperjs:1.1.3 script.js

# or

$ brew install casperjs
$ casperjs script.js --disk-cache=true

ref:
https://hub.docker.com/r/vinta/casperjs/
https://hub.docker.com/r/zopanix/casperjs/

ref:
http://phantomjs.org/api/command-line.html

Run in debugging mode

$ casperjs script.js --remote-debugger-port=9000
$ open http://127.0.0.1:9000/
  • Click the first link (something like "file:///usr/local/Cellar/xxx").
  • In Sources tab, press "Enable Debugging" button.
  • In Console tab, type "__run();" to start.
  • Once breakpoints worked, you could go to Console tab to debug.

ref:
http://phantomjs.org/troubleshooting.html

for loop in JavaScript

有 index 的 for loop

for (var i = 0; i < 10; i++) {
   console.log(i);
}

for in 用在 object

不能用在 array

var obj = {a:1, b:2, c:3};

for (var prop in obj) {
  console.log("obj." + prop + " = " + obj[prop]);
}

ref:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for...in

for of 用在 array(或 iterator 物件)

ES6 才有的語法

let iterable = [10, 20, 30];

for (const value of iterable) {
  console.log(value);
}

不然也可以用 Array.forEach()

[2, 5, , 9].forEach(function (element, index, array) {
  console.log('a[' + index + '] = ' + element);
});

ref:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for...of

ES6 Promise 筆記

new Promise() 接受一個 function 做為初始化參數
這個 function 又接受兩個參數 resolvereject
它們都是 function
resolve() 的作用是把 Promise object 的狀態從 pending 改成 resolved / fulfilled(成功)
reject() 則是把狀態從 pending 改成 rejected(失敗)

Promise object 產生之後
可以用 Promise object 的 then() 來指定 resolved 狀態的 callback function
然後 catch() 來指定 rejected 狀態的 callback function

每個 then()catch() 都會 return 一個新的 Promise objecct
如果你在 then() 裡 return 的不是 Promise object
它會隱式地用 Promise.resolve() 幫你轉換

const yourPromiseFunc = function (params) {
  return new Promise((resolve, reject) => {
    doYourAsyncShit(params, (err, data) => {
      if (err) {
        reject(err);
      } else {
        resolve(data);
      }
    });
  });
};

yourPromiseFunc('some parameter')
.then((data) => {
  console.log('success', data);
})
.catch((err) => {
  console.log('fail', data);
});

嚴格來說你只能在 then() 裡做三件事:

  • return 另外一個 Promise object
  • return 一個 synchronous value(字串、數字或其他 object)
  • throw 一個 Error()

如果你沒有顯式地 return 的話,JavaScript 會自動幫你 return undefined;

ref:
http://www.html5rocks.com/zh/tutorials/es6/promises/
http://es6.ruanyifeng.com/#docs/promise#基本用法

使用 reject() 而不是 throw

在 Promise 裡
如果可以用 reject(new Error('your error message')) 就用
不要用 throw new Error('your error message');

基本上就是用 reject() 來表示我們有意識地拋出的錯誤

ref:
http://liubin.org/promises-book/#not-throw-use-reject

把任意 object 轉換成 Promise object

Promise.resolve(xxx); 就是把 xxx 包裝成 Promise object
然後 resolve() 它
如果 xxx 已經是 Promise object 了,則會 clone 一個新的

Promise.resolve(42);

# equals to

new Promise((resolve) => {
  resolve(42);
});

因為所有 Promise 操作都一定是 async 的(這是規格裡規定的)
所以就算是 Promise.resolve(42); 的 42 也不會馬上被執行到

var promise = new Promise(function (resolve) {
    console.log("inner promise"); // 執行順序 1
    resolve(42);
});

promise.then(function(value) {
    console.log(value); // 執行順序 3
});

console.log("outer promise"); // 執行順序 2

ref:
http://liubin.org/promises-book/#chapter2-how-to-write-promise

Promise chains

一律使用 then().catch() 的方式分別指定 resolved 和 rejected 的 callback functions
通常會在 promise chains 的最後放一個 catch()

因為每個 then() 執行完都會 return 一個新的 Promise object(注意!是新的 Promise object,不是你最一開始 new 出來的那個)
所以你可以一直用很多個 then() 串起來
你也可以在 then() 裡 return 某個值作為下一個 then() 的參數

get('story.json')
.then(function(response) {
  return JSON.parse(response);
})
.then(function(data) {
  console.log(data);
});

ref:
http://liubin.org/promises-book/#then-return-new-promise

// 會照順序由上往下執行(除了 .catch() 之外)
Promise.resolve()
  .then(functionA)
  .then(functionB)
  .then(functionC)
  .catch(errorHandler)
  .then(finalFunction);

errorHandler 只能 catch 到 functionA、functionB 和 functionC 中拋出的錯誤

你可以在 then() 裡 return 某個值
他會被包裝成 Promise object(透過 Promise.resolve(某個值))然後傳給下一個 then()

如果你想在 functionC 裡同時使用 funtionA 和 functionB 的結果
你可以這麼寫

firstThingAsync()
  .then(function(result1) {
    return Promise.all([result1, secondThingAsync(result1)]);
  })
  .then(function(results) {
    // do something with results array: results[0], results[1]
  })
  .catch((err) => {
    doErrorHandling();
  });

Anti-patterns

// 這種寫法會造成你的 badAsyncCall() 得不到 newVar 的返回值
function badAsyncCall() {
    var promise = Promise.resolve();
    promise.then(function() {
        return newVar;
    });

    return promise;
}

// 應該要寫成
function badAsyncCall() {
    var promise = Promise.resolve();
    return promise.then(function() {
        return newVar;
    });
}

ref:
https://pouchdb.com/2015/05/18/we-have-a-problem-with-promises.html
http://www.datchley.name/promise-patterns-anti-patterns/

等到所有 promises 都執行完才執行某個動作

你可以用 Promise.all()

Promise.all([promise1, promise2, promise3])
.then((results) => {
  // 這裡會在 promise1, promise2, promise3 的狀態都是 fulfilled 時執行
  // results 的順序跟 .all() 的順序一定會是一致的
  // results[0] 就是 promise1 的回傳值,results[1] 則是 promise2,以此類推
  // 你也可以寫成 .then(([data1, data2, data3]) => {}),不過 Node.js v4.3 還不支援這個語法就是了
})
.catch((err) => {
  // 這裡會在任一個 promise 變成 rejected 時執行
});

ref:
http://www.datchley.name/es6-promises/
https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/Promise/all
http://liubin.org/promises-book/#ch2-promise-all