python

Calculate the similarity of two vectors

2017-02-012019-10-22VintaMachine Learning, Python

scipy.spatial.distance
https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

sklearn.metrics
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

Distance

Euclidean distance 歐幾里德距離

from sklearn.metrics.pairwise import euclidean_distances

euclidean_distances([0, 0, 0, 0], [0, 0, 0, 0])
# array([[ 0.]])

euclidean_distances([1, 0, 1, 0], [1, 0, 1, 0])
# array([[ 0.]])

euclidean_distances([0, 1, 0, 1], [1, 0, 1, 0])
# array([[ 2.]])

ref:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html

Manhattan Distance 曼哈頓距離

from sklearn.metrics.pairwise import manhattan_distances

manhattan_distances([0, 0, 0, 0], [0, 0 , 0, 0])
# array([[ 0.]])

manhattan_distances([1, 1, 1, 0], [1, 0, 0, 0])
# array([[ 2.]])

manhattan_distances([0, 1, 0, 1], [1, 0, 1, 0])
# array([[ 4.]])

ref:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.manhattan_distances.html

Similarity

Cosine similarity 餘弦相似度

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import cosine_distances
from sklearn.metrics.pairwise import pairwise_distances
from scipy.spatial.distance import pdist, squareform

cosine_similarity(matrix) == \
1 - cosine_distances(matrix) == \
1 - pairwise_distances(matrix, metric='cosine') == \
1 - squareform(pdist(matrix, 'cosine'))

cosine_similarity([0, 0, 0, 0], [0, 0, 0, 0])
# array([[ 0.]])

cosine_similarity([1, 0, 0, 0], [1, 0, 0, 0])
# array([[ 1.]])

cosine_similarity([1, 0, 1, 0], [0, 1, 0, 1])
# array([[ 0.]])

cosine_similarity([1, 0, 0, 1], [1, 0, 0, 0])
# array([[ 0.70710678]])

cosine_similarity([1, 0, 0, 1], [1, 0, 1, 0])
# array([[ 0.5]])

ref:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html

Jaccard similarity coefficient score

from sklearn.metrics import jaccard_similarity_score

jaccard_similarity_score([0, 0, 0, 0], [0, 0, 0, 0])
# 1.0

jaccard_similarity_score([0, 0, 0, 0], [1, 0, 0, 0])
# 0.75

jaccard_similarity_score([1, 0, 0, 0], [1, 0, 0, 0])
# 1.0

jaccard_similarity_score([1, 0, 1, 0], [0, 1, 0, 1])
# 0.0

ref:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html

http://datascience.stackexchange.com/questions/5121/applications-and-differences-for-jaccard-similarity-and-cosine-similarity

Log-Likelihood similarity

TODO

Pearson correlation coefficient 皮爾森相關係數

It has a value between +1 and −1 inclusive, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. You should only calculate Pearson Correlations when the number of items in common between two users is > 1, preferably greater than 5/10. Only calculate the Pearson Correlation for two users where they have commonly rated items.

For hign-dimensional binary attributes, the performances of Pearson correlation coefficient and Cosine similarity
are better than Jaccard similarity coefficient score.

from scipy.stats import pearsonr

pearsonr([1, 0, 1, 1], [0, 0, 0, 0])
# (nan, 1.0)

pearsonr([1, 0, 1, 1], [1, 0, 0, 0])
# (0.33333333333333331, 0.66666666666666607)

pearsonr([1, 0, 1, 0], [0, 1, 0, 1])
# (-1.0, 0.0)

ref:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html
http://stackoverflow.com/questions/11429604/how-is-nan-handled-in-pearson-correlation-user-user-similarity-matrix-in-a-recom

Dissimilarity

Dice dissimilarity

from scipy.spatial.distance import dice
import numpy as np

v1 = np.array([0, 0, 0, 0])
v2 = np.array([0, 0, 0, 0])

try:
    sim = 1.0 - dice(v1.astype(bool), v2.astype(bool))
except ZeroDivisionError:
    sim = 0

ref:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.dice.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.kulsinski.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.sokalsneath.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.yule.html

Parallel tasks in Python: concurrent.futures

2017-01-252019-10-22VintaPython

TL;DR: concurrent.futures is well suited to Embarrassingly Parallel tasks. You could write concurrent code with a simple for loop.

executor.map() runs the same function multiple times with different parameters and executor.submit() accepts any function with arbitrary parameters.

Install

concurrent.futures is part of the standard library in Python 3.2+. If you're using an older version of Python, you need to install the futures package.

$ pip install futures

ref:
https://docs.python.org/3/library/concurrent.futures.html

`executor.map()`

You should use the ProcessPoolExecutor for CPU intensive tasks and the ThreadPoolExecutor is suited for network operations or I/O. The ProcessPoolExecutor uses the multiprocessing module, which is not affected by GIL (Global Interpreter Lock) but also means that only picklable objects can be executed and returned.

In Python 3.5+, executor.map() receives an optional argument: chunksize. For very long iterables, using a large value for chunksize can significantly improve performance compared to the default size of 1. With ThreadPoolExecutor, chunksize has no effect.

from concurrent.futures import ThreadPoolExecutor
import time

import requests

def fetch(a):
    url = 'http://httpbin.org/get?a={0}'.format(a)
    r = requests.get(url)
    result = r.json()['args']
    return result

start = time.time()

# if max_workers is None or not given, it will default to the number of processors, multiplied by 5
with ThreadPoolExecutor(max_workers=None) as executor:
    for result in executor.map(fetch, range(42)):
        print('response: {0}'.format(result))

print('time: {0}'.format(time.time() - start))

You might want to change the value of max_workers to 1 and observe the difference.

ref:
https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures
https://www.blog.pythonlibrary.org/2016/08/03/python-3-concurrency-the-concurrent-futures-module/
http://masnun.com/2016/03/29/python-a-quick-introduction-to-the-concurrent-futures-module.html

`executor.submit()`

executor.submit() returns a Future object. A Future is basically an object that encapsulates an asynchronous execution of a function that will finish (or raise an exception) in the future.

The main difference between map and as_completed is that map returns the results in the order in which you pass iterables. On the other hand, the first result from the as_completed function is from whichever future completed first. Besides, iterating a map() returns results of futures; iterating a as_completed(futures) returns futures themselves.

from concurrent.futures import ThreadPoolExecutor, as_completed
import time

import requests

def fetch(url, timeout):
    r = requests.get(url, timeout=timeout)
    data = r.json()['args']
    return data

start = time.time()

with ThreadPoolExecutor(max_workers=20) as executor:
    futures = {}
    for i in range(42):
        url = 'https://httpbin.org/get?i={0}'.format(i)
        future = executor.submit(fetch, url, 60)
        futures[future] = url

    for future in as_completed(futures):
        url = futures[future]
        try:
            data = future.result()
        except Exception as exc:
            print(exc)
        else:
            print('fetch {0}, get {1}'.format(url, data))

print('time: {0}'.format(time.time() - start))

ref:
https://docs.python.org/3/library/concurrent.futures.html#future-objects

Discussion

ref:
https://news.ycombinator.com/item?id=16737129

Create a custom migration in Django

2017-01-232019-10-22VintaDatabase, Python

# create an empty migration file
$ ./manage.py makemigrations --empty --name convert_to_utf8mb4 your_app

in your_app/migrations/0002_convert_to_utf8mb4.py

from __future__ import unicode_literals

from django.db import migrations

class Migration(migrations.Migration):

    dependencies = [
        ('your_app', '0001_initial'),
    ]

    operations = [
        migrations.RunSQL(
            'ALTER TABLE app_repostarring CHANGE repo_description repo_description VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;'
        ),
    ]

ref:
https://docs.djangoproject.com/en/dev/ref/migration-operations/#runsql

Lazy evaluation in Django middlewares

2016-12-292019-10-22VintaPython, Web Development

Attach a lazy evaluated function as a property of request in a middleware.

from django.contrib.gis.geoip import GeoIP
from django.utils.functional import SimpleLazyObject

def get_country_code(request):
    g = GeoIP()
    location = g.country(request.META['REMOTE_ADDR'])
    country_code = location.get('country_code', 'TW')

    return country_code

class CountryAndSiteMiddleware(object):

    def process_request(self, request):
        request.COUNTRY_CODE = SimpleLazyObject(lambda: get_country_code(request))

Then you could use request.COUNTRY_CODE whenever you want.

Setup Jupyter and other Machine Learning tools on macOS

2016-11-152019-10-22VintaMachine Learning, Python

Jupyter Notebook is an interactive environment for running code in the browser. It allows you to create interactive documents that contain live code, rich text elements and visualizations. It's also a widely used tool for Data Scientists to make prototypes or demonstrations.

ref:
http://jupyter.org/

Install

$ brew install freetype gcc libffi libpng openssl pkg-config
$ pip install -U \
  cython \
  numpy \
  scipy \
  matplotlib \
  bokeh \
  seaborn \
  scikit-learn \
  surprise \
  gensim \
  nltk \
  pandas \
  jupyter

$ pip install jupyter_contrib_nbextensions && \
  jupyter contrib nbextension install --user

# install kernels for Python 2 and 3
$ python2 -m pip install ipykernel && \
  python2 -m ipykernel install --user

# list kernels
$ jupyter kernelspec list

# remove kernel
$ jupyter kernelspec uninstall apache_toree_scala

# start your notebook server
$ jupyter notebook
$ jupyter notebook --ip 0.0.0.0 --allow-root --no-browser

ref:
https://ipython.readthedocs.io/en/latest/install/kernel_install.html
https://jupyter.readthedocs.io/en/latest/running.html#running
https://github.com/ipython-contrib/jupyter_contrib_nbextensions

Or you could just download Anaconda and install it.
https://www.continuum.io/downloads#osx

Configuration

# ~/.ipython/profile_default/ipython_config.py    
c = get_config()

c.InteractiveShell.ast_node_interactivity = 'all'

# c.InteractiveShellApp.matplotlib = 'notebook'
c.InteractiveShellApp.matplotlib = 'inline'

Usage

Automatic module reload

%load_ext autoreload
%autoreload 2
import your_module

ref:
https://blog.3blades.io/jupyter-notebook-little-known-tricks-b0866a558017

Show media in notebook

# show image
from IPython.display import Image
Image('iris.png')

# show pdf
from IPython.display import IFrame
IFrame('iris.pdf', width='100%', height=700)

Django integration with django-extension

$ python manage.py shell_plus --notebook

ref:
https://stackoverflow.com/questions/35483328/how-do-i-set-up-jupyter-ipython-notebook-for-django