Python For Data Science Cheat Sheet

Numpy

linalg

a = [1, 2, 3]
b = [[1, 2, 3], [4, 5, 6]]
np.diag(a) || np.diag(b)
np.trace(b)
np.dot(b, a)
np.linalg.det(b)
np.linalg.eig(b) # first variable:eigenvalue second and third: eigenvector
np.linalg.inv(b)
np.linalg.solve(A, b) # solve Ax = b

statistics

np.sum(a,axis)
np.mean(a,axis
a.min(axis) || a.max(axis)
np.std(a,axis) # Standard Deviation
np.var(a,axis) # variance
np.argmin(a,axis) || np.argmax(a,axis) # get index of max/min value

Pandas

panel data && data analysis

Scipy

scipy.optimize

scipy.linalg

scipy.stats

Scikit-Learn

Classification:
- Supervised Learning
- Identify the category of a given object. .
- Implemented Algs: SVM, nearest neighbors, logical regression, random forests, decision trees, and multi-level perceptron (MLP) neural networks.
Regression
- Predict continuous value attributes associated with a given object.
- Implemented Algs: Support Vector Regression (SVR), Lasso Regression, Bayesian Regression, Stochastic Forest Regression, etc.
Clustering:
- Unsupervised Learning
- Group a given object based on similar characteristics
- Implemented Algs: K-means clustering, spectral clustering, hierarchical clustering, and DBSCAN clustering.
Data dimensionality reduction
- Excessive computational complexity, and feature sparsity is too severe
- Such as principal component analysis (PCA), non negative matrix decomposition (NMF), or feature selection to reduce the number of random variables.
Model selection
- Select which model works best for a given parameter and model.
- The main purpose is to run the model by setting different parameters, and then select the optimal parameters through the results to improve the final model accuracy.
- Implemented Modules: Lattice search, cross validation, and various measurement functions for prediction error evaluation.
Data Processing
- The feature extraction and normalized feature extraction of data preprocessing data refers to converting text or image data into digital variables.
- Learn by removing invariant, covariant, or other statistically insignificant feature quantities.
- Normalization refers to converting input data into new variables with a zero mean and unit weight variance.

matplotlib

import numpy as np
import matplotlib.pyplot as plt


def limit(c, n=25, p=2):
    z = 0
    for i in range(1, n):
        z = z ** p + c
        if np.abs(z) > 2:
            return np.inf
    return np.abs(z)


def mandelbrot(n=25, size=250, xlim=(-2, 2), ylim=(-2, 2), p=2):
    x = np.linspace(xlim[0], xlim[1], size)
    y = np.linspace(ylim[0], ylim[1], size)
    m = np.zeros((size, size))
    for i in range(size):
        for j in range(size):
            z = limit(x[i] + y[j] * 1j, n, p)
            if z < 2:
                m[j, i] = 1
    return m


def mandelbrot_color(n=25, size=250, xlim=(-2, 2), ylim=(-2, 2), p=2):
    x = np.linspace(xlim[0], xlim[1], size)
    y = np.linspace(ylim[0], ylim[1], size)
    m = np.zeros((size, size))

    for i in range(size):
        for j in range(size):
            c = x[i] + y[j] * 1j
            z = 0
            for k in range(n):
                z = z ** p + c
                if np.abs(z) < 2:
                    m[j, i] += 1
                else:
                    break
    return m


plt.imshow(mandelbrot(), cmap='plasma')
plt.show()

uuid

uuid1()

基于时间戳
使用主机ID, 序列号, 和当前时间来生成UUID, 可保证全球范围的唯一性。
但由于使用该方法生成的UUID中包含有主机的网络地址, 因此可能危及隐私.
该函数有两个参数, 如果 node 参数未指定, 系统将会自动调用 getnode() 函数来获取主机的硬件地址。
如果 clock_seq 参数未指定系统会使用一个随机产生的14位序列号来代替.

uuid2()

算法和uuid1()相同，不同的是把时间戳的前4位换成POSI的UID，实际当中很少使用

uuid3()

基于名字的MD5散列值
通过计算命名空间和名字的MD5散列值来生成UUID,
可以保证同一命名空间中不同名字的唯一性和不同命名空间中同一名称的唯一性,
但同一命名空间的同一名字生成的UUID相同。

uuid4()

基于随机数
通过伪随机数来生成UUID。有一定的重复概率.

uuid5()

基于名字的SHA-1散列值
通过计算命名空间和名字的SHA-1散列值来生成UUID, 算法与 uuid.uuid3() 相同.

唯一性要求，最好使用uuid3()或uuid5()
全局分布式环境下，最好用uuid1()

base64

s1 = b"hehehe"
print(base64.b64encode(s1))
s2 = b'aGVoZWhl'
print(base64.b64decode(s2))

Use urlsafe_b64encode() and urlsafe_b64decode() for safety when using url
1
2
3
4
5
6
7
8
9
```
# hashlib
```Python
    s1 = b"hehehe"
    m = hashlib.md5() //Or m1 = hashlib.sha1()
    m.update(s1)
    res = m.hexdigest()
    print(res)
    //Update() can be used many times when data is big.

hmac

import hmac
s1 = b"hehehe"
key = b"he"
h = hmac.new(key, s1, digestmod = "MD5")
res = h.hexdigest()
print(res)

time

time.time() //timetag of current time
gmtime([t]) //timetag to UTC
localtime([t]) //timetag to local timetag
mktime([tt])  //local time tuple to timetag
asctime([tt]) //time tuple to string
ctime([tt]) //timetag to string
strftime(format,[tt]) //time tuple to formatted string
strptime(str,format) //formatted string to time tuple
sleep() //sleep......
clock() //return the executing time of the programme

Format

%a	locally simplify week name
%A	locally complete week name
%b	locally simplify month name
%B	locally complete month name
%c	local corresponding date and time expression
%d	the day in a month
%H	the hour in a day(00-23)
%I	the hour in a day(01-12)
%j	the day in a year(001-366)
%m	month(01-12)
%M	minute(00-59)
%p	am/pm
%S	second(00-59)
%U	the number of weeks in a year(Start with Sunday)
%w	the day in a week(0-6)
%W	the number of weeks in a year(Start with Monday)
%x	the corresponding local date
%X	the corresponding local time
%y	year without cenetury(00-99)
%Y	whole year
%Z	the name of time zone, “”for nonexistence

datetime

t1 = datetime.datetime.now() //get current time
t2 = datetime.datetime(2023,1,27,1,1,1) //get datetime
t3 = t2.strftime(format) //datetime to string
t4 = datetime.datetime.strptime(t3,format) //string to datetime
# get time interval by "time subtraction"