Common Tasks in Machine Learning
2023-09-20 | By Maker.io Staff
A recent article explored how machine learning (ML) involves teaching a computer system to make predictions on unfamiliar data. This is achieved by utilizing the system’s observations during training, using a set of either labeled or unlabeled training data. As diverse as the topic of machine learning is, there are many similarities across most ML projects. This article will discuss the seven steps commonly found in every machine learning project.
Data Collection and Preprocessing
Machine learning formalizes the task of learning patterns from existing data. Regardless of input data quality, the process will always yield a model. However, a model’s performance when applied to new, unseen data relies on the quality of the training data used. Therefore, a consensus argues that the first two steps in the ML process are the most important, as insufficient training data will thereby produce an inadequate model as output.
The first step in the ML process is data collection. Data samples emerge from various sources, including user inputs, sensor values, or data collected on the Internet, to name a few. There is no standardized solution for data collection and preprocessing, as data can come from a variety of limitless sources and have various storage uses. However, once the data is collected and stored, there are steps developers perform to ensure it is of acceptable quality.
Initially, there shouldn't be any missing data values. If a data sample contains blank information, it can be removed from the data set entirely, or the missing value can be replaced with either a placeholder or an average value. Additional preprocessing steps include normalizing values, removing special characters, and converting text to numeric data.
Extracting Relevant Features
Data used for training ML models often have high dimensionality and are commonly represented in a table. Each column of the table corresponds to a variable, frequently referred to as an attribute, and each row represents one data sample:
This example contains five samples with three variables (taste, color, calories) and a single output label (type). Real-world datasets are considerably large, containing hundreds or even thousands of samples and columns. In the context of the small example dataset, it’s feasible to identify a variable that can accurately predict food type. Here, the caloric content of a food is an ideal indicator for predicting whether a food item is bread or cake, as anything above 400 calories can be classified as a cake. On the contrary, the color variable is ambiguous due to insufficient data for determining if a dark food item is classified as a cake.
Some attributes are more suited than others to determine the output label of a data sample. Therefore, most ML applications aim to identify the most useful features for determining output when classifying new observations. Doing so minimizes the dimensionality of the data, enabling ML algorithms to be more effective in their learning process. The feature extraction step can substantially affect the outcome of particular algorithms. For example, when using decision trees, optimization makes the difference between simple, small trees and convoluted, large trees.
Model Selection and Training
Similar to optimizing and preprocessing a dataset, there is no single model type that is optimal for all applications. Therefore, you must carefully evaluate your data to determine which model is best to use. You have the option to choose from a range of algorithms, including decision trees, KNN, Bayesian networks, and neural networks, to name a few.
Every algorithm has advantages and drawbacks, with some requiring more preprocessing steps than others. For example, distance-aware algorithms like KNN often require data normalization and the removal of outlier values. On the other hand, algorithms such as neural networks may not encounter as many drastic value differences in the dataset, and normalization may be optional.
Moreover, certain algorithms are lazy learners, meaning they do not require training. In contrast, eager learners require training time, although inference typically requires less time. The exact training process differs from one ML algorithm to another, and you may want to utilize multiple ML algorithms in ensemble learning. Don’t worry if these terms don’t mean much to you now, as another article discusses common ML-related terminology in more detail.
Evaluation, Deployment, and Improvement
Once the model is trained and ready to make predictions, it's necessary to evaluate how well the computer can make predictions on previously unseen data. Here, you can choose from a range of techniques to assess model performance, and similar to the standard terms, another article dives deeper into the topic of ML model evaluation.
Once deployed, the ML system analyzes new data samples and makes predictions on its own, which it can then present to users in a dashboard. Source: https://pixabay.com/photos/digital-marketing-technology-1433427/
Once a model has optimal performance, it is ready for real-world deployment. Here, the ML algorithm receives real-world inputs and generates decisions on its own or assists users in their decision-making process. During this phase, continual model re-evaluation is crucial as the surrounding world changes and presents new scenarios unseen during initial training. Periodic retraining of the model is essential, particularly when the environment around the device undergoes significant changes.
Summary
When viewing the concept from an outside perspective, machine learning can appear intimidating to understand. However, it’s important to remember that at its core, the concept solely describes algorithms and techniques employed to make decisions based on statistics and patterns derived from large datasets.
There are several common steps that practically every ML project follows. These are data collection, preprocessing, feature extraction, model selection, training, evaluation, and deployment.
During data collection and preprocessing, engineers ensure data meets specific quality standards. Feature extraction de-clutters data and detects the most critical attributes. In the model selection and training process, you will select an algorithm that aligns with your data and then executes it in model training. Next, you can evaluate how well the algorithm functions with the data and make necessary adjustments to the implementation or parameters. Finally, deployment releases the ML model into the real world to generate predictions based on real-world inputs.
Have questions or comments? Continue the conversation on TechForum, DigiKey's online community and technical resource.
Visit TechForum