Practical Machine Learning: Part-4
project photo
author photo
Iftekhar AhmedSep 4, 2023

Let's opt for the Linear Regression model to train our dataset. Why choose Linear Regression? Linear Regression stands as one of the most fundamental machine-learning algorithms available. For beginners, it offers a straightforward understanding of its inner workings. Basically, Linear Regression finds a straight-line pattern in the data and uses that pattern to make predictions.

Training Linear Regression

We can easily train Linear Regression using scikit-learn.

1from sklearn.linear_model import LinearRegression
2
3# Create a Linear Regression model
4model = LinearRegression()
5
6# Fitting the model to the training data
7model.fit(X_train, y_train)

Here the model.fit() function takes the training feature and target as parameters.

Congratulations 🎉, you have just successfully trained your first machine-learning model. Now, let's evaluate how this model performs on our test dataset. We'll feed our model with X_test (test features) and expect it to return the target value. To achieve this,

1# Predicting house prices on the test data
2y_pred = model.predict(X_test)
3
4y_pred

Here the “y_pred” returns an array of house prices according to the predicted features (X_test). Here is a graph showing the entire process.

project photo

Now, there are various methods available for comparing our predicted values (y_pred) to our actual values (y_test). We can utilize the model.score() method to evaluate our model's performance. Basically, model.score() returns a value between 0 and 1, with higher scores indicating better model performance. For instance, if the score is 0.75, it signifies that 75% of the variance in the dependent variable is explained by the model, while the remaining 25% of the variance remains unexplained or attributed to factors not considered in the model.

Let’s evaluate the model score,

1score = model.score(X_test,y_test)
2print("Model Score:", score*100)

Hold on, what's thisâť— Only 3%? It seems like something is going wrong. Let's delve into our dataset to gain a clearer understanding of what's happening.

Let’s plot actual prices and predicted prices and see what is going on

1import matplotlib.pyplot as plt
2plt.scatter(y_test, y_pred, alpha=1)
3plt.plot([min(y_test), max(y_test)], [min(y_pred), max(y_pred)], color='red', linewidth=2)
4plt.xlabel('Actual Prices')
5plt.ylabel('Predicted Prices')
6plt.title('Actual Prices vs. Predicted Prices')
7plt.grid(True)
8plt.show()
project photo

Here, the 🔴red line represents our predicted prices line, but our goal is to achieve a line resembling the 🟢green line. So, let's investigate what's happening with our data. A more effective way to analyze the data is by creating a heatmap. Heatmaps are frequently used in machine learning to visualize the pairwise correlations between features (columns) in a dataset. These correlation heatmaps assist in identifying which features exhibit strong positive or negative correlations with each other.

1import seaborn as sns
2
3plt.figure(figsize=(16, 10))
4sns.heatmap(dataframe.corr(), annot=True)
5plt.title('Heat Map', size=20)
6plt.yticks(rotation = 0)
7plt.show()
project photo

The first thing we can see is that the country class has no effect on prices. (because the entire dataset has only one country USA so we can drop the country from our dataframe).

Let’s check our price range

1# Calculate the price range
2max_price = dataframe['price'].max() 
3min_price = dataframe['price'].min()
4
5print("Max price:", max_price)
6print("Min price:", min_price)
7
8# output
9# Max price: 26590000.0
10# Min price: 0.0

From this analysis, we can observe that there are houses priced at $0, which is an outlier. To address this issue, let's replace all rows where the house price is $0 with the calculated average price:

1# Calculate the average price excluding rows where the price is 0
2average_price = dataframe[dataframe['price'] != 0]['price'].mean()
3
4# Replace 0 values with the calculated average price
5dataframe['price'] = dataframe['price'].replace(0, average_price)

Similarly, we can apply a similar approach to investigate the bedroom range.

1dataframe['bedrooms'].value_counts()
2
3# Output
4# 3    2032
5# 4    1531
6# 2     566
7# 5     353
8# 6      61
9# 1      38
10# 7      14
11# 8       2
12# 0       2
13# 9       1

In our dataset, we can identify potential outliers by examining the number of bedrooms. Specifically, there are only 2 rows with 8 bedrooms, 1 row with 9 bedrooms, and 1 row with 0 bedrooms. We can just remove these rows:

1rows_to_remove = dataframe[(dataframe['bedrooms'] == 0) | (dataframe['bedrooms'] == 8)  | (dataframe['bedrooms'] == 9)].index
2
3# Remove rows with a price of 0 using the drop method
4dataframe = dataframe.drop(rows_to_remove)

Now, let's re-run our model and assess whether it shows improvement or not.

20% not bad, not bad. By analyzing just two columns (price and bedrooms), we've managed to enhance our model's performance from 3% to 20%.

Additionally, another approach we can try is shuffling our dataset. Currently, our model is trained on the first 80% of the data and tested on the remaining 20%. This may pose an issue. For example, consider a math exam where there are 100 problems in the syllabus. If you've only practiced the first 60 problems, there's a higher chance that the exam questions might come from the remaining 40 problems. However, if you had practiced randomly from all 100 problems, your chances of encountering familiar problems would be higher. Similarly, our model might perform better if it is trained on a shuffled dataset. Thus, let's shuffle our dataset and evaluate if this improves our model's performance or not.

1# Shuffle the DataFrame
2dataframe = dataframe.sample(frac=1, random_state=42)

Now, let's run the model again. What! It's now performing at 60%. We've managed to boost our model's accuracy from 20% to 60% with just one line of code—simply by shuffling our dataset. This underscores the significance of data preprocessing. Typically, data preprocessing is conducted prior to model training. There are various data preprocessing techniques, such as data cleaning, data transformation, data reduction, feature selection, feature scaling, data imputation, and more. These techniques can often have a substantial impact on improving the accuracy and robustness of our machine-learning models. While it's impossible to cover all of them in a single blog, the exploration of these tricks and techniques is up to you. They can be instrumental in making your machine-learning models more accurate and reliable.

Here is the full implementation practical-machine-learning-code

This concludes the "Practical Machine Learning" series. I hope that this series has provided valuable insights to kickstart your machine-learning journey. I wish you all the best in your future endeavors!