Machine learning, a branch of artificial intelligence, enables systems to learn from data and improve without explicit programming. It addresses various data-driven challenges using different methodologies. This blog explores three primary types of machine learning: supervised, unsupervised, and reinforcement learning, along with key techniques and algorithms.
Supervised Learning:
Supervised learning uses labeled datasets to train algorithms, enabling them to predict outputs for new data. By learning the relationship between dependent (target) and independent (feature) variables, these models can make accurate predictions. Supervised learning tasks are divided into two categories: regression and classification.
Regression (Continuous):
Predicting a continuous value. For example, predicting house prices based on features like size and location.
Linear Regression
Linear Regression is useful for finding the linear relationship between the independent and dependent variables of a dataset.
where Y' is predicted/dependent variable and X is independent variable.
This equation is called the Linear Regression model. The above explanation is demonstrated in the below picture:
To ensure the reliability of the Linear Regression model on unseen data, several metrics are commonly used:
Mean Absolute Error (MAE): Measures the average absolute difference between predicted and actual values.
Mean Squared Error (MSE): Calculates the average of the squared differences between predicted and actual values.
R-squared (R²): Indicates the proportion of the variance in the dependent variable predictable from the independent variable(s).
These metrics help assess the accuracy and goodness of fit of the model.
Assumption of Linear Regression:
When applying linear regression, it operates under key assumptions that must be satisfied for reliable results:
Linearity: The relationship between dependent and independent variables must be linear.
Multicollinearity: Independent variables should not be highly correlated with each other.
Homoscedasticity: The variance of errors should be constant across all levels of the independent variables.
Normal Distribution of Errors: Error terms should be normally distributed to ensure accurate confidence intervals.
Endogeneity: Independent variables should not be correlated with the error terms to avoid biased estimates.
Violating these assumptions can lead to unreliable model performance and conclusions.
Classification (Categorical):
Predicting a category or class label. For example, determining whether an email is spam or not.
Two of the most common classification algorithms: Logistic Regression and K-Nearest Neighbors (K-NN).
Logistic Regression
As we saw in Linear Regression, we fit a line to our data. However, linear regression is not ideal for the situation in which we only need to output binary values like 0 and 1, as seen in the image below.
To overcome this problem, we add non-linearity to our regression equation. This is done using a Sigmoid function which is given as:
Where, X is the set of input features and P is the probability.
The values of a Sigmoid function range from 0 to 1 which makes it suitable to represent probability.
K-Nearest Neighbors
K-NN can be used for both classification and regression problems. However, it is more widely used for classification problems. Classifies data points based on the closest data points.
Unsupervised Learning
Unlike supervised learning, unsupervised learning is a category of techniques that train computers to use a set of unlabeled / unseen data and learn by themselves. The algorithm is provided with a large volume of data and expected to identify hidden patterns. Since there's no prior information / label available for each data instance, there is no defined / correct outcome. The machines only need to determine if there are any patterns in the given data.
Clustering:
In clustering, the key idea is to divide the data into groups such that each group shares similar properties and each group is as dissimilar as possible to the other groups. An example is k-means clustering, where data is partitioned into k clusters based on feature similarities.
K-means
K-means Clustering is an unsupervised learning algorithm. Like other clustering algorithms, it tries to aggregate similar objects into groups called clusters. In K-means Clustering, K refers to the number of clusters required. The concept of a centroid, which is the geometric center of a cluster, is used to determine the clusters that K-means finds in the dataset.
Let’s understand this using an example. Suppose you go to a vegetable shop to buy some vegetables. There, you'll see different kinds of vegetables. One thing you may notice is that the vegetables will be arranged in a group of their type. The carrots and radishes will probably be kept together in one place, onion and garlic will probably be arranged in another place, potatoes will be kept together and so on. This arrangement resembles a group or a cluster, where each vegetable is kept within their kind of group, forming the clusters.
The image on the left is before clustering, where all the categories or groups appear to be mixed up (same color), while the image on the right is after clustering where groups of similar data points seem to be clustered together and depicted with different colors.
To human eyes, it can be difficult to analyze or understand a mixed-up group like the one represented by the image on the left. So we apply clustering techniques like K-means Clustering to convert this into a more visually distinct dataset with separate groups or categories of data.
Gaussian Mixture Models (GMM)
Gaussian Mixture Models (GMMs) are an unsupervised machine learning technique used for clustering data. K-means performs clustering taking only the mean of the data into consideration, but Gaussian Mixture Models use the mean as well as the variance of the data for clustering.
Gaussian Mixture Models assume that the underlying data represents a mixture of Gaussians, as shown in the image below.
Gaussian Mixture Models are considered probabilistic models for representing normally distributed subpopulations within an overall population. The Gaussian Mixture Model tries to learn the parameters (mean, variance) of each Gaussian in the dataset to cluster the data points. The parameters mean (µ) and variance (σ2), of each Gaussian, are learned using the Expectation-Maximization (EM) algorithm.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
DBSCAN is another type of clustering algorithm. It works based on the concept of density difference. Density is interpreted as the concentration of the samples present in a region. The algorithm differentiates between high and low-density regions and clusters the points accordingly.
There are two parameters to the algorithm which is used to define the term dense.
epsilon (eps): It specifies how close points should be to each other to be considered as part of a cluster.
Two points are considered neighbors if the distance between the two points is below the threshold epsilon.
minPoints: The minimum number of points to form a dense region.
Dimensionality Reduction:
In real-world situations, we often deal with high-dimensional data, with lots of columns or "features" that represent the information collected about each observation. High-dimensional data is disadvantageous for a couple of reasons:
It is generally difficult to analyze or visualize high-dimensional data and identify hidden patterns.
Not all the features or dimensions of the data are equally important.
Therefore, we need to reduce the dimensionality of the dataset in such a way that by losing only a minimal amount of information, we can visualize the data and identify patterns more easily with a smaller number of features.
Two important techniques that we can use for dimensionality reduction are:
PCA (Principal Component Analysis)
The main idea of principal component analysis (PCA) is to reduce the dimensionality of a dataset consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset.
This is done by transforming the variables to a new coordinate space of variables, which are known as principal components (or simply, the PCs), and are orthogonal to each other.
The selection of principal components is such that the retention of variation present in the original variables is the maximum for the first principal component and decreases as we move down in the order. The principal components are the eigenvectors of the covariance matrix, and hence they are orthogonal.
t-SNE (t-Distributed Stochastic Neighbor Embedding)
The t-SNE algorithm calculates a similarity measure between pairs of instances in the high dimensional space and in the low dimensional space. This is done by setting the probabilities from the low-dimensional space similar to those of the high dimensional space. We measure the difference between the probability distributions of the two-dimensional spaces using Kullback-Leibler divergence and try to optimize it.
Anomaly Detection:
Identifying data points that deviate significantly from the norm. This is crucial in fraud detection, network security, and quality control. Common methods for anomaly detection include clustering algorithms like DBSCAN and statistical approaches such as Gaussian Mixture Models (GMM).
Reinforcement Learning
Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize cumulative rewards. The agent receives feedback in the form of rewards or penalties and adjusts its actions accordingly. This learning process is recursive, as the agent continuously refines its strategy based on the outcomes of previous actions. RL is widely used in robotics, gaming, and autonomous systems.
Machine learning, as a key component of AI, utilizes supervised, unsupervised, and reinforcement learning to solve data-driven problems. Understanding these learning types and their techniques, such as linear regression and k-means clustering, is essential for creating effective models and harnessing the power of data.
Comments