Pandas cookbook

Pandas is a Python library providing easy-to-ues data structure and data analysis tools.

ref:
http://pandas.pydata.org/

Read a csv file

A DataFrame is a tablular data structure comprised of rows and columns. You can also think of a DataFrame as a group of Series objects that share an index (the column names).

import pandas as pd

# specify columns you need to import
columns = ['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z']
diamonds = pd.read_csv('datasets/diamonds/diamonds.csv', usecols=columns)

# add a custom header row (columns)
df = pd.read_csv('movielens/u.data', sep='\t', names=['UserID', 'ItemId ', 'Rating', 'Timestamp'])

df.index
df.columns

# show summary
df.info()

# show statistics summary, numeric columns only
df.describe()

ref:
http://pandas.pydata.org/pandas-docs/stable/10min.html
http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/

Convert multiple lists into a DataFrame

pd.DataFrame({
    'dice': list_1[:50],
    'pearsonr': list_2[:50],
    'cosine_similarity': list_3[:50],
})
# output:
#   cosine_similarity   dice   pearsonr
# 0 u3                  u1     u2
# 1 u3                  u1     u2
# 2 u3                  u1     u2
# 3 u3                  u1     u2

Create a DataFrame from a MySQL table

url = 'mysql://root:[email protected]:3306/albedo'
sql = """
SELECT from_user_id AS user, repo_id AS item, 1 AS rating
FROM app_repostarring
WHERE stargazers_count >= 10;
"""
user_item_df = pd.read_sql(sql, con=url)

url = 'mysql://root:[email protected]:3306/albedo'
user_item_df = pd.read_sql_table('app_repostarring', con=url, columns=['from_user_id', 'repo_id'])

Create a DataFrame from a Django queryset

from django.db import connection

query, params = your_django_queryset.query.sql_with_params()
pd.io.sql.read_sql_query(query, connection, params=params)

ref:
https://www.iwoca.co.uk/blog/2016/09/02/using-pandas-django-faster/

Drop columns

# axis=0 means rows
# axis=1 means columns
train = df.drop(['PassengerId', 'Name', 'Ticket'], axis=1)

ref:
http://stackoverflow.com/questions/22149584/what-does-axis-in-pandas-mean

Drop elements by indexes of a series

user_starred = df.loc['vinta', :][df.loc['vinta', :] == 1]
user_unstarred = pdf.loc['vinta', :].drop(user_starred.index)
user_unstarred.sort_values(ascending=False)

Select columns

# select all rows of Fare column
# return a Series
df['Fare']

# select the first 5 rows of repo_description column
df['repo_description'][:5]

# select the 152th row of Fare column
df['Fare'][152]
# equals to
df.Fare[152]

# select specific columns
df[['Survived', 'Age', 'Sex']]
# equals to
df.loc[:, ['Survived', 'Age', 'Sex']]

Select rows

Explicitly use df.loc[] for label-based indexing and use df.iloc[] for positional indexing, don't use df.ix[].

# return a DataFrame
df.loc['row_a':'row_z']

df.loc[['row_a', 'row_b', 'row_c'], ['column_1', 'column_2']]

# in these cases, labeled indexes are numbers
df.loc[0]
df.loc[0:10, ['repo_description']]

# select the first row, it's a Series object
df.iloc[0]

# you can only use positional indexes for both row and column selections with iloc
# df.iloc[row_selection, column_selection]
df.iloc[0, 1]

# select rows at index 3, 4, 5, 6
df.iloc[3:6]

# select rows at index 2, 5, 10
df.iloc[[2, 5, 10]]

df.iloc[2]['similarity']

ref:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing
http://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

Conditionally select

df[df['Age'] >= 70]
df[(df['Age'] >= 60) & (df['Sex'] == 'female')]
df[(df['Age'] >= 60) | (df['Age'] <= 20)]

df['Sex'].unique()
# output:
# array(['male', 'female'], dtype=object)

# show counts of unique values
df['Embarked'].value_counts()
# output:
# S    646
# C    168
# Q     77

df['Embarked'].value_counts(normalize=True)
# output:
# S    0.725028
# C    0.188552
# Q    0.086420

df['Embarked'].value_counts().max()
# output:
# 646

df['Embarked'].value_counts().idxmax()
# output:
# S

df['Survived'][df['Sex'] == 'male'].value_counts()
df['Survived'][df['Sex'] == 'female'].value_counts(normalize=True)

df['Age'].nlargest(3)
# output:
# 630    80.0
# 851    74.0
# 96     71.0

# select rows that passenger class is first class (1) or second class (2)
df[df['Pclass'].isin([1, 2])]

Select only non NaN

df[pd.isnull(df['similarity']) == False]

Get distinct values of a column

item_ids = user_item_df['item'].unique()

Rename columns

df = df.rename(columns={'from_user_id': 'user', 'repo_id': 'item'})

Convert a DataFrame into a utility matrix (rating matrix)

matrix_df = user_item_df.pivot(index='user', columns='item', values='rating')

ref:
http://stackoverflow.com/questions/37576594/rearrange-a-pandas-data-frame-to-create-a-2d-ratings-matrix

Convert a utility matrix into a sparse SciPy COO matrix

from scipy.sparse import coo_matrix

temp_df = user_item_df.set_index(['user', 'item'])
star_matrix = coo_matrix((
    user_item_df['rating'], (temp_df.index.labels[0], temp_df.index.labels[1])
))
# <10003x423 sparse matrix of type '<class 'numpy.float64'>'
# with 245579 stored elements in COOrdinate format>

Group by

df = pd.read_csv('movielens/u.data', sep='\t', names=['UserID', 'ItemId ', 'Rating', 'Timestamp'])

df.groupby(['Rating'])['UserID'].count()

ref:
https://zhuanlan.zhihu.com/p/25184830

Iterate a DataFrame

for index, row in df.iterrows():
    username = index
    score = row['similarity']
    print('{0} / https://github.com/{1}'.format(score, username))

ref:
http://stackoverflow.com/questions/7837722/what-is-the-most-efficient-way-to-loop-through-dataframes-with-pandas

Handle missing values

# if at least values of 4 columns are missing, the row is dropped
df.dropna(thresh=4)

# show that whether columns contain missing value
df.isnull().any()

# fill missing values
df['Embarked'] = df['Embarked'].fillna('S')

# replace missing values with median
df['Age'] = df['Age'].fillna(df['Age'].median())

# replace missing values with the most common value
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].value_counts().idxmax())

Sort a DataFrame by values

# sort by age, from the largest to the smallest
sorted_df = df.sort_values(by='Age', ascending=False)

Operate

df['Sex'] = df['Sex'].astype('category')

# group by
df.loc[:, ['Pclass', 'Fare']].groupby('Pclass').sum()
df.loc[:, ['Pclass', 'Survived', 'Fare']].groupby(('Pclass', 'Survived')).sum()

# replace
df.loc[df['Sex'] == 'male', 'Sex'] = 0
df.loc[df['Sex'] == 'female', 'Sex'] = 1
# equals to
df['Sex'].replace(['male', 'female'], [0, 1], inplace=True)
# dont't do this
df['Sex'][df['Sex'] == 'male'] = 0
df['Sex'][df['Sex'] == 'female'] = 1

# add a column
df['Child'] = 0
df.loc[df['Age'] < 18, 'Child'] = 0
df.loc[df['Age'] >= 18, 'Child'] = 1

ref:
http://tomaugspurger.github.io/modern-1.html

Convert a Series into a DataFrame

data = {
    0: 1.0,
    1: 0.15502648795530585,
    2: 0.07641502252040476,
    3: 0.09437745470567728,
    4: 0.026112204840046276,
    5: 0.090005329668190928,
}
sdf = pd.Series(sim_score).to_frame('similarity')

Iterate a Series

for index, value in my_non_starred.iteritems():
    print(index, value)

my_non_starred.index

Sort a Series

similarities[0].argsort()