Unsupervised learning

PCA

touvlo.unsupv.pca.pca(X)[source]

Runs Principal Component Analysis on dataset

Parameters:X (numpy.array) – Features’ dataset
Returns:
A 2-tuple of U, eigenvectors of covariance
matrix, and S, eigenvalues (on diagonal) of covariance matrix.
Return type:(numpy.array, numpy.array)
touvlo.unsupv.pca.project_data(X, U, k)[source]

Computes reduced data representation (projected data)

Parameters:
  • X (numpy.array) – Normalized features’ dataset
  • U (numpy.array) – eigenvectors of covariance matrix
  • k (int) – Number of features in reduced data representation
Returns:

Reduced data representation (projection)

Return type:

numpy.array

touvlo.unsupv.pca.recover_data(Z, U, k)[source]

Recovers an approximation of original data using the projected data

Parameters:
  • Z (numpy.array) – Reduced data representation (projection)
  • U (numpy.array) – eigenvectors of covariance matrix
  • k (int) – Number of features in reduced data representation
Returns:

Approximated features’ dataset

Return type:

numpy.array

K-means

touvlo.unsupv.kmeans.compute_centroids(X, idx, K)[source]

Computes centroids from the mean of its cluster’s members.

Computes centroids from the mean of its cluster’s members if there are any members for the centroid, else it returns an array of nan.

Parameters:
  • X (numpy.array) – Features’ dataset
  • idx (numpy.array) – Column vector of assigned centroids’ indices.
  • K (int) – Number of centroids.
Returns:

Column vector of newly computed centroids

Return type:

numpy.array

touvlo.unsupv.kmeans.cost_function(X, idx, centroids)[source]

Calculates the cost function for K means.

Parameters:
  • X (numpy.array) – Features’ dataset
  • idx (numpy.array) – Column vector of assigned centroids’ indices.
Returns:

Computed cost

Return type:

float

touvlo.unsupv.kmeans.elbow_method(X, K_values, max_iters, n_inits)[source]

Calculates the cost for each given K.

Parameters:
  • X (numpy.array) – Features’ dataset
  • K_values (list(int)) – List of possible number of centroids.
  • max_iters (int) – Number of times the algorithm will be fitted.
  • n_inits (int) – Number of random initialization.
Returns:

a list of cost values for each K.

Return type:

(list(float))

touvlo.unsupv.kmeans.euclidean_dist(p, q)[source]

Calculates Euclidean distance between 2 n-dimensional points.

Parameters:
  • p (numpy.array) – First n-dimensional point.
  • q (numpy.array) – Second n-dimensional point.
Returns:

Distance between 2 points.

Return type:

float

touvlo.unsupv.kmeans.find_closest_centroids(X, initial_centroids)[source]

Assigns to each example the indice of the closest centroid.

Parameters:
  • X (numpy.array) – Features’ dataset
  • initial_centroids (numpy.array) – List of initialized centroids.
Returns:

Column vector of assigned centroids’ indices.

Return type:

numpy.array

touvlo.unsupv.kmeans.init_centroids(X, K)[source]

Computes centroids from the mean of its cluster’s members.

Parameters:
  • X (numpy.array) – Features’ dataset
  • idx (numpy.array) – Column vector of assigned centroids’ indices.
  • K (int) – Number of centroids.
Returns:

Column vector of centroids randomly picked from dataset

Return type:

numpy.array

touvlo.unsupv.kmeans.run_intensive_kmeans(X, K, max_iters, n_inits)[source]

Applies kmeans using multiple random initializations.

Parameters:
  • X (numpy.array) – Features’ dataset
  • K (int) – Number of centroids.
  • max_iters (int) – Number of times the algorithm will be fitted.
  • n_inits (int) – Number of random initialization.
Returns:

A 2-tuple of centroids, a column vector of

centroids, and idx, a column vector of assigned centroids’ indices.

Return type:

(numpy.array, numpy.array)

touvlo.unsupv.kmeans.run_kmeans(X, K, max_iters)[source]

Applies kmeans using a single random initialization.

Parameters:
  • X (numpy.array) – Features’ dataset
  • K (int) – Number of centroids.
  • max_iters (int) – Number of times the algorithm will be fitted.
Returns:

A 2-tuple of centroids, a column vector of

centroids, and idx, a column vector of assigned centroids’ indices.

Return type:

(numpy.array, numpy.array)

Anomaly Detection

touvlo.unsupv.anmly_detc.cov_matrix(X, mu)[source]

Calculates the covariance matrix for matrix X (m x n).

Parameters:
  • X (numpy.array) – Features’ dataset.
  • mu (numpy.array) – Mean of each feature/column of.
Returns:

Covariance matrix (n x n)

Return type:

int

touvlo.unsupv.anmly_detc.estimate_multi_gaussian(X)[source]

Estimates parameters for Multivariate Gaussian distribution.

Parameters:X (numpy.array) – Features’ dataset.
Returns:
A 2-tuple of mu, the mean of each
feature/column of X, and sigma, the covariance matrix for X.
Return type:(numpy.array, numpy.array)
touvlo.unsupv.anmly_detc.estimate_uni_gaussian(X)[source]

Estimates parameters for Univariate Gaussian distribution.

Parameters:X (numpy.array) – Features’ dataset.
Returns:
A 2-tuple of mu, the mean of each
feature/column of X, and sigma2, the variance of each feature/column of X.
Return type:(numpy.array, numpy.array)
touvlo.unsupv.anmly_detc.is_anomaly(p, threshold=0.5)[source]

Predicts whether a probability falls into class 1 (anomaly).

Parameters:
  • p (numpy.array) – Probability that example belongs to class 1 (is anomaly).
  • threshold (float) – point below which an example is considered of class 1.
Returns:

Binary value to denote class 1 or 0

Return type:

int

touvlo.unsupv.anmly_detc.multi_gaussian(X, mu, sigma)[source]

Estimates probability that examples belong to Multivariate Gaussian.

Parameters:
  • X (numpy.array) – Features’ dataset.
  • mu (numpy.array) – Mean of each feature/column of X.
  • sigma (numpy.array) – Covariance matrix for X.
Returns:

Probability density function for each example

Return type:

numpy.array

touvlo.unsupv.anmly_detc.predict(X, epsilon, gaussian, **kwargs)[source]

Predicts whether examples are anomalies.

Parameters:
  • X (numpy.array) – Features’ dataset.
  • epsilon (float) – point below which an example is considered of class 1.
  • gaussian (numpy.array) – Function that estimates pertinency probability.
Returns:

Column vector of classification

Return type:

numpy.array

touvlo.unsupv.anmly_detc.uni_gaussian(X, mu, sigma2)[source]

Estimates probability that examples belong to Univariate Gaussian.

Parameters:
  • X (numpy.array) – Features’ dataset.
  • mu (numpy.array) – Mean of each feature/column of X.
  • sigma2 (numpy.array) – Variance of each feature/column of X.
Returns:

Probability density function for each example

Return type:

numpy.array