Let's opt for the Linear Regression model to train our dataset. Why choose Linear Regression? Linear Regression stands as one of the most fundamental machine-learning algorithms available. For beginners, it offers a straightforward understanding of its inner workings. Basically, Linear Regression finds a straight-line pattern in the data and uses that pattern to make predictions.
Training Linear Regression
We can easily train Linear Regression using scikit-learn.
Here the model.fit() function takes the training feature and target as parameters.
Congratulations 🎉, you have just successfully trained your first machine-learning model. Now, let's evaluate how this model performs on our test dataset. We'll feed our model with X_test (test features) and expect it to return the target value. To achieve this,
Here the “y_pred” returns an array of house prices according to the predicted features (X_test). Here is a graph showing the entire process.
Now, there are various methods available for comparing our predicted values y_pred
to our actual values y_test
. We can utilize the model.score()
method to evaluate our model's performance. Basically, model.score()
returns a value between 0 and 1, with higher scores indicating better model performance. For instance, if the score is 0.75, it signifies that 75% of the variance in the dependent variable is explained by the model, while the remaining 25% of the variance remains unexplained or attributed to factors not considered in the model.
Let’s evaluate the model score,
Hold on, what's this❗ Only 3%? It seems like something is going wrong. Let's delve into our dataset to gain a clearer understanding of what's happening.
Let’s plot actual prices and predicted prices and see what is going on
Here, the 🔴red line represents our predicted prices line, but our goal is to achieve a line resembling the 🟢green line. So, let's investigate what's happening with our data. A more effective way to analyze the data is by creating a heatmap. Heatmaps are frequently used in machine learning to visualize the pairwise correlations between features (columns) in a dataset. These correlation heatmaps assist in identifying which features exhibit strong positive or negative correlations with each other.
The first thing we can see is that the country class has no effect on prices. (because the entire dataset has only one country USA so we can drop the country from our dataframe).
Let’s check our price range
From this analysis, we can observe that there are houses priced at $0, which is an outlier. To address this issue, let's replace all rows where the house price is $0 with the calculated average price:
Similarly, we can apply a similar approach to investigate the bedroom range.
# Output
# 3 2032
# 4 1531
# 2 566
# 5 353
# 6 61
# 1 38
# 7 14
# 8 2
# 0 2
# 9 1
In our dataset, we can identify potential outliers by examining the number of bedrooms. Specifically, there are only 2 rows with 8 bedrooms, 1 row with 9 bedrooms, and 1 row with 0 bedrooms. We can just remove these rows:
Now, let's re-run our model and assess whether it shows improvement or not.
20% not bad, not bad. By analyzing just two columns (price and bedrooms), we've managed to enhance our model's performance from 3% to 20%.
Additionally, another approach we can try is shuffling our dataset. Currently, our model is trained on the first 80% of the data and tested on the remaining 20%. This may pose an issue. For example, consider a math exam where there are 100 problems in the syllabus. If you've only practiced the first 60 problems, there's a higher chance that the exam questions might come from the remaining 40 problems. However, if you had practiced randomly from all 100 problems, your chances of encountering familiar problems would be higher. Similarly, our model might perform better if it is trained on a shuffled dataset. Thus, let's shuffle our dataset and evaluate if this improves our model's performance or not.
Now, let's run the model again. What! It's now performing at 60%. We've managed to boost our model's accuracy from 20% to 60% with just one line of code—simply by shuffling our dataset. This underscores the significance of data preprocessing. Typically, data preprocessing is conducted prior to model training. There are various data preprocessing techniques, such as data cleaning, data transformation, data reduction, feature selection, feature scaling, data imputation, and more. These techniques can often have a substantial impact on improving the accuracy and robustness of our machine-learning models. While it's impossible to cover all of them in a single blog, the exploration of these tricks and techniques is up to you. They can be instrumental in making your machine-learning models more accurate and reliable.
Here is the full code practical-machine-learning-code
This concludes the "Practical Machine Learning" series. I hope that this series has provided valuable insights to kickstart your machine-learning journey. I wish you all the best in your future endeavors!