Unsupervised learning¶
PCA¶
-
touvlo.unsupv.pca.pca(X)[source]¶ Runs Principal Component Analysis on dataset
Parameters: X (numpy.array) – Features’ dataset Returns: - A 2-tuple of U, eigenvectors of covariance
- matrix, and S, eigenvalues (on diagonal) of covariance matrix.
Return type: (numpy.array, numpy.array)
-
touvlo.unsupv.pca.project_data(X, U, k)[source]¶ Computes reduced data representation (projected data)
Parameters: - X (numpy.array) – Normalized features’ dataset
- U (numpy.array) – eigenvectors of covariance matrix
- k (int) – Number of features in reduced data representation
Returns: Reduced data representation (projection)
Return type: numpy.array
-
touvlo.unsupv.pca.recover_data(Z, U, k)[source]¶ Recovers an approximation of original data using the projected data
Parameters: - Z (numpy.array) – Reduced data representation (projection)
- U (numpy.array) – eigenvectors of covariance matrix
- k (int) – Number of features in reduced data representation
Returns: Approximated features’ dataset
Return type: numpy.array
K-means¶
-
touvlo.unsupv.kmeans.compute_centroids(X, idx, K)[source]¶ Computes centroids from the mean of its cluster’s members.
Computes centroids from the mean of its cluster’s members if there are any members for the centroid, else it returns an array of nan.
Parameters: - X (numpy.array) – Features’ dataset
- idx (numpy.array) – Column vector of assigned centroids’ indices.
- K (int) – Number of centroids.
Returns: Column vector of newly computed centroids
Return type: numpy.array
-
touvlo.unsupv.kmeans.cost_function(X, idx, centroids)[source]¶ Calculates the cost function for K means.
Parameters: - X (numpy.array) – Features’ dataset
- idx (numpy.array) – Column vector of assigned centroids’ indices.
Returns: Computed cost
Return type:
-
touvlo.unsupv.kmeans.elbow_method(X, K_values, max_iters, n_inits)[source]¶ Calculates the cost for each given K.
Parameters: Returns: a list of cost values for each K.
Return type:
-
touvlo.unsupv.kmeans.euclidean_dist(p, q)[source]¶ Calculates Euclidean distance between 2 n-dimensional points.
Parameters: - p (numpy.array) – First n-dimensional point.
- q (numpy.array) – Second n-dimensional point.
Returns: Distance between 2 points.
Return type:
-
touvlo.unsupv.kmeans.find_closest_centroids(X, initial_centroids)[source]¶ Assigns to each example the indice of the closest centroid.
Parameters: - X (numpy.array) – Features’ dataset
- initial_centroids (numpy.array) – List of initialized centroids.
Returns: Column vector of assigned centroids’ indices.
Return type: numpy.array
-
touvlo.unsupv.kmeans.init_centroids(X, K)[source]¶ Computes centroids from the mean of its cluster’s members.
Parameters: - X (numpy.array) – Features’ dataset
- idx (numpy.array) – Column vector of assigned centroids’ indices.
- K (int) – Number of centroids.
Returns: Column vector of centroids randomly picked from dataset
Return type: numpy.array
-
touvlo.unsupv.kmeans.run_intensive_kmeans(X, K, max_iters, n_inits)[source]¶ Applies kmeans using multiple random initializations.
Parameters: Returns: - A 2-tuple of centroids, a column vector of
centroids, and idx, a column vector of assigned centroids’ indices.
Return type: (numpy.array, numpy.array)
Anomaly Detection¶
-
touvlo.unsupv.anmly_detc.cov_matrix(X, mu)[source]¶ Calculates the covariance matrix for matrix X (m x n).
Parameters: - X (numpy.array) – Features’ dataset.
- mu (numpy.array) – Mean of each feature/column of.
Returns: Covariance matrix (n x n)
Return type:
-
touvlo.unsupv.anmly_detc.estimate_multi_gaussian(X)[source]¶ Estimates parameters for Multivariate Gaussian distribution.
Parameters: X (numpy.array) – Features’ dataset. Returns: - A 2-tuple of mu, the mean of each
- feature/column of X, and sigma, the covariance matrix for X.
Return type: (numpy.array, numpy.array)
-
touvlo.unsupv.anmly_detc.estimate_uni_gaussian(X)[source]¶ Estimates parameters for Univariate Gaussian distribution.
Parameters: X (numpy.array) – Features’ dataset. Returns: - A 2-tuple of mu, the mean of each
- feature/column of X, and sigma2, the variance of each feature/column of X.
Return type: (numpy.array, numpy.array)
-
touvlo.unsupv.anmly_detc.is_anomaly(p, threshold=0.5)[source]¶ Predicts whether a probability falls into class 1 (anomaly).
Parameters: - p (numpy.array) – Probability that example belongs to class 1 (is anomaly).
- threshold (float) – point below which an example is considered of class 1.
Returns: Binary value to denote class 1 or 0
Return type:
-
touvlo.unsupv.anmly_detc.multi_gaussian(X, mu, sigma)[source]¶ Estimates probability that examples belong to Multivariate Gaussian.
Parameters: - X (numpy.array) – Features’ dataset.
- mu (numpy.array) – Mean of each feature/column of X.
- sigma (numpy.array) – Covariance matrix for X.
Returns: Probability density function for each example
Return type: numpy.array
-
touvlo.unsupv.anmly_detc.predict(X, epsilon, gaussian, **kwargs)[source]¶ Predicts whether examples are anomalies.
Parameters: - X (numpy.array) – Features’ dataset.
- epsilon (float) – point below which an example is considered of class 1.
- gaussian (numpy.array) – Function that estimates pertinency probability.
Returns: Column vector of classification
Return type: numpy.array
-
touvlo.unsupv.anmly_detc.uni_gaussian(X, mu, sigma2)[source]¶ Estimates probability that examples belong to Univariate Gaussian.
Parameters: - X (numpy.array) – Features’ dataset.
- mu (numpy.array) – Mean of each feature/column of X.
- sigma2 (numpy.array) – Variance of each feature/column of X.
Returns: Probability density function for each example
Return type: numpy.array