Unsupervised learning¶
PCA¶
-
touvlo.unsupv.pca.
pca
(X)[source]¶ Runs Principal Component Analysis on dataset
Parameters: X (numpy.array) – Features’ dataset Returns: - A 2-tuple of U, eigenvectors of covariance
- matrix, and S, eigenvalues (on diagonal) of covariance matrix.
Return type: (numpy.array, numpy.array)
-
touvlo.unsupv.pca.
project_data
(X, U, k)[source]¶ Computes reduced data representation (projected data)
Parameters: - X (numpy.array) – Normalized features’ dataset
- U (numpy.array) – eigenvectors of covariance matrix
- k (int) – Number of features in reduced data representation
Returns: Reduced data representation (projection)
Return type: numpy.array
-
touvlo.unsupv.pca.
recover_data
(Z, U, k)[source]¶ Recovers an approximation of original data using the projected data
Parameters: - Z (numpy.array) – Reduced data representation (projection)
- U (numpy.array) – eigenvectors of covariance matrix
- k (int) – Number of features in reduced data representation
Returns: Approximated features’ dataset
Return type: numpy.array
K-means¶
-
touvlo.unsupv.kmeans.
compute_centroids
(X, idx, K)[source]¶ Computes centroids from the mean of its cluster’s members.
Parameters: - X (numpy.array) – Features’ dataset
- idx (numpy.array) – Column vector of assigned centroids’ indices.
- K (int) – Number of centroids.
Returns: Column vector of newly computed centroids
Return type: numpy.array
-
touvlo.unsupv.kmeans.
cost_function
(X, idx, centroids)[source]¶ Calculates the cost function for K means.
Parameters: - X (numpy.array) – Features’ dataset
- idx (numpy.array) – Column vector of assigned centroids’ indices.
Returns: Computed cost
Return type:
-
touvlo.unsupv.kmeans.
elbow_method
(X, K_values, max_iters, n_inits)[source]¶ Calculates the cost for each given K.
Parameters: Returns: - A 2-tuple of K_values, a list of possible
numbers of centroids, and cost_values, a computed cost for each K.
Return type:
-
touvlo.unsupv.kmeans.
euclidean_dist
(p, q)[source]¶ Calculates Euclidean distance between 2 n-dimensional points.
Parameters: - p (numpy.array) – First n-dimensional point.
- q (numpy.array) – Second n-dimensional point.
Returns: Distance between 2 points.
Return type:
-
touvlo.unsupv.kmeans.
find_closest_centroids
(X, initial_centroids)[source]¶ Assigns to each example the indice of the closest centroid.
Parameters: - X (numpy.array) – Features’ dataset
- initial_centroids (numpy.array) – List of initialized centroids.
Returns: Column vector of assigned centroids’ indices.
Return type: numpy.array
-
touvlo.unsupv.kmeans.
init_centroids
(X, K)[source]¶ Computes centroids from the mean of its cluster’s members.
Parameters: - X (numpy.array) – Features’ dataset
- idx (numpy.array) – Column vector of assigned centroids’ indices.
- K (int) – Number of centroids.
Returns: Column vector of centroids randomly picked from dataset
Return type: numpy.array
-
touvlo.unsupv.kmeans.
run_intensive_kmeans
(X, K, max_iters, n_inits)[source]¶ Applies kmeans using multiple random initializations.
Parameters: Returns: - A 2-tuple of centroids, a column vector of
centroids, and idx, a column vector of assigned centroids’ indices.
Return type: (numpy.array, numpy.array)
Anomaly Detection¶
-
touvlo.unsupv.anmly_detc.
cov_matrix
(X, mu)[source]¶ Calculates the covariance matrix for matrix X (m x n).
Parameters: - X (numpy.array) – Features’ dataset.
- mu (numpy.array) – Mean of each feature/column of.
Returns: Covariance matrix (n x n)
Return type:
-
touvlo.unsupv.anmly_detc.
estimate_multi_gaussian
(X)[source]¶ Estimates parameters for Multivariate Gaussian distribution.
Parameters: X (numpy.array) – Features’ dataset. Returns: - A 2-tuple of mu, the mean of each
- feature/column of X, and sigma, the covariance matrix for X.
Return type: (numpy.array, numpy.array)
-
touvlo.unsupv.anmly_detc.
estimate_uni_gaussian
(X)[source]¶ Estimates parameters for Univariate Gaussian distribution.
Parameters: X (numpy.array) – Features’ dataset. Returns: - A 2-tuple of mu, the mean of each
- feature/column of X, and sigma2, the variance of each feature/column of X.
Return type: (numpy.array, numpy.array)
-
touvlo.unsupv.anmly_detc.
is_anomaly
(p, threshold=0.5)[source]¶ Predicts whether a probability falls into class 1 (anomaly).
Parameters: - p (numpy.array) – Probability that example belongs to class 1 (is anomaly).
- threshold (float) – point below which an example is considered of class 1.
Returns: Binary value to denote class 1 or 0
Return type:
-
touvlo.unsupv.anmly_detc.
multi_gaussian
(X, mu, sigma)[source]¶ Estimates probability that examples belong to Multivariate Gaussian.
Parameters: - X (numpy.array) – Features’ dataset.
- mu (numpy.array) – Mean of each feature/column of X.
- sigma (numpy.array) – Covariance matrix for X.
Returns: Probability density function for each example
Return type: numpy.array
-
touvlo.unsupv.anmly_detc.
predict
(X, epsilon, gaussian, **kwargs)[source]¶ Predicts whether examples are anomalies.
Parameters: - X (numpy.array) – Features’ dataset.
- epsilon (float) – point below which an example is considered of class 1.
- gaussian (numpy.array) – Function that estimates pertinency probability.
Returns: Column vector of classification
Return type: numpy.array
-
touvlo.unsupv.anmly_detc.
uni_gaussian
(X, mu, sigma2)[source]¶ Estimates probability that examples belong to Univariate Gaussian.
Parameters: - X (numpy.array) – Features’ dataset.
- mu (numpy.array) – Mean of each feature/column of X.
- sigma2 (numpy.array) – Variance of each feature/column of X.
Returns: Probability density function for each example
Return type: numpy.array