Unsupervised learning

PCA

touvlo.unsupv.pca.pca(X)[source]

Runs Principal Component Analysis on dataset

Parameters:X (numpy.array) – Features’ dataset
Returns:
A 2-tuple of U, eigenvectors of covariance
matrix, and S, eigenvalues (on diagonal) of covariance matrix.
Return type:(numpy.array, numpy.array)
touvlo.unsupv.pca.project_data(X, U, k)[source]

Computes reduced data representation (projected data)

Parameters:
  • X (numpy.array) – Normalized features’ dataset
  • U (numpy.array) – eigenvectors of covariance matrix
  • k (int) – Number of features in reduced data representation
Returns:

Reduced data representation (projection)

Return type:

numpy.array

touvlo.unsupv.pca.recover_data(Z, U, k)[source]

Recovers an approximation of original data using the projected data

Parameters:
  • Z (numpy.array) – Reduced data representation (projection)
  • U (numpy.array) – eigenvectors of covariance matrix
  • k (int) – Number of features in reduced data representation
Returns:

Approximated features’ dataset

Return type:

numpy.array

K-means

touvlo.unsupv.kmeans.compute_centroids(X, idx, K)[source]

Computes centroids from the mean of its cluster’s members.

Parameters:
  • X (numpy.array) – Features’ dataset
  • idx (numpy.array) – Column vector of assigned centroids’ indices.
  • K (int) – Number of centroids.
Returns:

Column vector of newly computed centroids

Return type:

numpy.array

touvlo.unsupv.kmeans.cost_function(X, idx, centroids)[source]

Calculates the cost function for K means.

Parameters:
  • X (numpy.array) – Features’ dataset
  • idx (numpy.array) – Column vector of assigned centroids’ indices.
Returns:

Computed cost

Return type:

float

touvlo.unsupv.kmeans.elbow_method(X, K_values, max_iters, n_inits)[source]

Calculates the cost for each given K.

Parameters:
  • X (numpy.array) – Features’ dataset
  • K_values (list(int)) – List of possible number of centroids.
  • max_iters (int) – Number of times the algorithm will be fitted.
  • n_inits (int) – Number of random initialization.
Returns:

A 2-tuple of K_values, a list of possible

numbers of centroids, and cost_values, a computed cost for each K.

Return type:

(list(int), list(float))

touvlo.unsupv.kmeans.euclidean_dist(p, q)[source]

Calculates Euclidean distance between 2 n-dimensional points.

Parameters:
  • p (numpy.array) – First n-dimensional point.
  • q (numpy.array) – Second n-dimensional point.
Returns:

Distance between 2 points.

Return type:

float

touvlo.unsupv.kmeans.find_closest_centroids(X, initial_centroids)[source]

Assigns to each example the indice of the closest centroid.

Parameters:
  • X (numpy.array) – Features’ dataset
  • initial_centroids (numpy.array) – List of initialized centroids.
Returns:

Column vector of assigned centroids’ indices.

Return type:

numpy.array

touvlo.unsupv.kmeans.init_centroids(X, K)[source]

Computes centroids from the mean of its cluster’s members.

Parameters:
  • X (numpy.array) – Features’ dataset
  • idx (numpy.array) – Column vector of assigned centroids’ indices.
  • K (int) – Number of centroids.
Returns:

Column vector of centroids randomly picked from dataset

Return type:

numpy.array

touvlo.unsupv.kmeans.run_intensive_kmeans(X, K, max_iters, n_inits)[source]

Applies kmeans using multiple random initializations.

Parameters:
  • X (numpy.array) – Features’ dataset
  • K (int) – Number of centroids.
  • max_iters (int) – Number of times the algorithm will be fitted.
  • n_inits (int) – Number of random initialization.
Returns:

A 2-tuple of centroids, a column vector of

centroids, and idx, a column vector of assigned centroids’ indices.

Return type:

(numpy.array, numpy.array)

touvlo.unsupv.kmeans.run_kmeans(X, K, max_iters)[source]

Applies kmeans using a single random initialization.

Parameters:
  • X (numpy.array) – Features’ dataset
  • K (int) – Number of centroids.
  • max_iters (int) – Number of times the algorithm will be fitted.
Returns:

A 2-tuple of centroids, a column vector of

centroids, and idx, a column vector of assigned centroids’ indices.

Return type:

(numpy.array, numpy.array)

Anomaly Detection

touvlo.unsupv.anmly_detc.cov_matrix(X, mu)[source]

Calculates the covariance matrix for matrix X (m x n).

Parameters:
  • X (numpy.array) – Features’ dataset.
  • mu (numpy.array) – Mean of each feature/column of.
Returns:

Covariance matrix (n x n)

Return type:

int

touvlo.unsupv.anmly_detc.estimate_multi_gaussian(X)[source]

Estimates parameters for Multivariate Gaussian distribution.

Parameters:X (numpy.array) – Features’ dataset.
Returns:
A 2-tuple of mu, the mean of each
feature/column of X, and sigma, the covariance matrix for X.
Return type:(numpy.array, numpy.array)
touvlo.unsupv.anmly_detc.estimate_uni_gaussian(X)[source]

Estimates parameters for Univariate Gaussian distribution.

Parameters:X (numpy.array) – Features’ dataset.
Returns:
A 2-tuple of mu, the mean of each
feature/column of X, and sigma2, the variance of each feature/column of X.
Return type:(numpy.array, numpy.array)
touvlo.unsupv.anmly_detc.is_anomaly(p, threshold=0.5)[source]

Predicts whether a probability falls into class 1 (anomaly).

Parameters:
  • p (numpy.array) – Probability that example belongs to class 1 (is anomaly).
  • threshold (float) – point below which an example is considered of class 1.
Returns:

Binary value to denote class 1 or 0

Return type:

int

touvlo.unsupv.anmly_detc.multi_gaussian(X, mu, sigma)[source]

Estimates probability that examples belong to Multivariate Gaussian.

Parameters:
  • X (numpy.array) – Features’ dataset.
  • mu (numpy.array) – Mean of each feature/column of X.
  • sigma (numpy.array) – Covariance matrix for X.
Returns:

Probability density function for each example

Return type:

numpy.array

touvlo.unsupv.anmly_detc.predict(X, epsilon, gaussian, **kwargs)[source]

Predicts whether examples are anomalies.

Parameters:
  • X (numpy.array) – Features’ dataset.
  • epsilon (float) – point below which an example is considered of class 1.
  • gaussian (numpy.array) – Function that estimates pertinency probability.
Returns:

Column vector of classification

Return type:

numpy.array

touvlo.unsupv.anmly_detc.uni_gaussian(X, mu, sigma2)[source]

Estimates probability that examples belong to Univariate Gaussian.

Parameters:
  • X (numpy.array) – Features’ dataset.
  • mu (numpy.array) – Mean of each feature/column of X.
  • sigma2 (numpy.array) – Variance of each feature/column of X.
Returns:

Probability density function for each example

Return type:

numpy.array