Understanding the Basic Machine Learning Terminology
2023-09-18 | By Maker.io Staff
Like with any other topic, understanding the basic machine learning terms can help you grasp the more advanced topics required to build an ML model and let it make predictions. This article will summarize a few key terms everyone will encounter when learning about AI and ML.
The Difference Between a Machine Learning Algorithm and a Model
Differentiating between a machine learning algorithm and a model is essential. An algorithm is a set of rules and instructions that trains and generates the machine learning model. As discussed in another article, machine learning aims to identify data patterns and similarities between samples to make predictions when observing new data. In contrast to an algorithm, a model captures and encodes these statistical patterns and allows the ML program to make predictions when observing new samples.
Understanding Machine Learning Training and Inference
Training is the process of building an ML model from existing data samples, yet not all machine learning algorithms require training. In this context, a lazy learner is an ML algorithm that doesn’t require a training phase. Instead, it makes predictions on the fly when observing a new sample. The k-Nearest-Neighbors algorithm (kNN) is one of the most prominent lazy learners. In contrast, an eager learner requires a training phase where it uses the training data to build a model. Examples of such algorithms include neural networks.
In machine learning, you’ll often encounter supervised and unsupervised learning. Aside from reinforcement learning, these are the two main ways you can extrapolate patterns from data. In supervised learning, the training data samples contain not only the input values but also the expected output. The model learns to predict the label from observable patterns in the data. In unsupervised learning, this output label is missing in the data, and the model learns to discover relationships and patterns in the input variables without a known output.
Once trained, the finished model (or the algorithm, in the case of a lazy learner) performs a classification or prediction task when it receives a new input typically not seen before. The process of making a prediction or classifying samples is referred to as inference.
Dataset Terminology in Machine Learning
As mentioned, the algorithm builds a model based on training data that can either be labeled or unlabeled. This dataset consists of individual samples, and each sample contains one or multiple features that serve as input variables. If the data is labeled, the output is also referred to as the label of a sample.
The features generally come in one of four forms, namely nominal, ordinal, interval, and ratio. These differ in what arithmetic operations can be performed and what level of preprocessing is required before the data can be used to train a model. Preprocessing, in this context, describes cleaning the data and bringing it into a format that the used ML algorithm can understand.
Various problems may occur during training, and these problems affect the model’s ability to make generalizations and predictions once deployed. When a model overfits the data, it too closely follows individual data samples to correctly classify them during training. However, overfitting leads to problems during inference once deployed, as the model’s capability to generalize is limited. Usually, overfitting is a symptom of a too-complex model, resulting in low bias and high variance. Underfitting, in contrast, occurs when a machine learning model is too simple or lacks the capacity to capture the underlying data patterns, resulting in high bias and poor performance on both training and test data. Cross-validation is one technique to tackle overfitting problems.
How Evaluation Metrics Help Build Better Machine Learning Models
Image source: https://pixabay.com/illustrations/woman-home-office-silhouette-6010924/
Once training is done, you may typically want to assess how well the model performs on data it has never seen before. You can use qualitative measures to determine a model’s inference performance for that purpose. Depending on the task, various methods exist — examples include accuracy, precision, recall, F1 score, mean squared error (MSE), or area under the curve (AUC). Read this article to learn more about evaluation metrics in machine learning.
Summary
Machine learning algorithms and models differ in their roles. An algorithm is a set of rules that trains and generates a model, which captures statistical patterns and enables predictions on new data.
Training is the process of building an ML model from existing data, while lazy learners like k-Nearest Neighbors (kNN) make predictions on the fly without a training phase. Supervised learning uses labeled data with input-output pairs, while unsupervised learning discovers patterns in unlabeled data. After training, the model performs inference, making predictions on unseen data.
Datasets comprise multiple samples with input features and output labels. Features can be nominal, ordinal, interval, or ratio, impacting preprocessing requirements.
Overfitting occurs when a model closely follows training data but fails to generalize, while underfitting arises from a model lacking the complexity to capture patterns. Evaluation metrics like accuracy, precision, recall, and mean squared error help assess model performance.
Have questions or comments? Continue the conversation on TechForum, DigiKey's online community and technical resource.
Visit TechForum