Machine Learning Tutorial - Random Learning
The key to understanding machine learning is to break it down to first principles. At its core, machine learning is about automatically making, updating, and validating predictions. While there are many elegant ways to accomplish this, it is helpful to start with a simplified model and build from there.
Introduction
This tutorial walks through the process of making and testing predictions for a data set by using random weights to generate predictions and then testing those predictions against the labels. To construct the dataset, the following weights were applied to the input features to generate the labels:
If the weights multiplied by the features result in a value greater than the threshold of 0.5, the labels row is marked with a classification as true, otherwise it is marked as false.
Wrangling
The code and data for this tutorial are available here. The provided sample data contains six columns. The first five columns represent the features while the sixth column represents the labels. The visualization above demonstrates how the process_data method (shown below) splits the features and the labels into separate objects.
The next stage of the process_data method transposes the inputes and labels so that the tools of linear algebra can be applied.
Analysis
Taking the expected value from the labels results in a probability of about 50%, so always guess true, we will be right about half of the time.
Training
After wrangling the data to a form that we can apply the tools of linear algebra, the next step is to train the model by using the randomly generated weights to make predictions and then testing accuracy of the predictions generated by those weights. If the accuracy is better than the previously best weights, the weights will be updated.
Results
Conclusion
Even though there are better techniques available, using random learning to make predictions against a simple data set can yield impressive results. It is also clear that, even when we know the ground truth, there is more than one explanation that can perfectly fit the data.