Before we start training our machine learning model we need to divide the dataset into two parts: training and testing. This involves splitting the dataset in a way that one part is used for training the model, while the other part is used to test its performance.
The process of splitting the dataset into training and testing is essential in machine learning because it allows us to evaluate the performance of our model on unseen data. By using the entire dataset for training, the model may memorize the data, resulting in overfitting.
Overfitting means the model is too closely fitted to the training data and may not perform well on new or unseen data. It's like memorizing answers to specific questions for an exam without actually understanding the concepts behind them. If a new question comes up that is not within the scope of what was memorized, the student will not be able to answer it. Similarly, if a model is overfitted on the training data, it may not perform well on new or unseen data. Splitting the dataset into training and testing allows us to test the generalizability of our model and avoid overfitting.
The recommended ratio for this partition is 80:20, where 80% of the data is used for training and the remaining 20% is used for testing.
First, we need to separate our features and the target. In our case, price is our target ( what we want to predict ) and the rest of the dataset is our features. So,
Now for the splitting, we will use scikit-learn’s train_test_split function. The train_test_split function from scikit-learn is a useful tool for dividing a dataset into separate training and testing sets. This function randomly splits the data into two separate sets, one for training the machine learning model and one for testing the model's performance.
The function takes a few arguments, the most important of which are the ‘X’ and ‘y’ parameters, which represent the dataset features and target variables, respectively. The test_size parameter determines the proportion of the data to be used for testing, in our case it’s 0.2 because we want to split the dataset into 80/20. The random_state parameter sets a seed for the random number generator to ensure that the same split is produced each time the code is run.
Now if we run the ‘head()’ function for X_train, X_test, y_train, and y_test we will see that, ‘X_train’ contains 80% of the features and ‘X_test’ contains the other 20% of the features, where ‘y_train’ contains the 80% of the target corresponding to ‘X_train’ and ‘y_test’ contains 20% of the target corresponding to ‘X_test’.
From the dataset, you can already tell that predicting house price prediction is a Supervised Learning problem. Supervised learning models can be broadly categorized into two types based on the nature of the target variable: classification and regression.
In the case of our dataset, the task is to predict the price of houses based on various features. Since the target variable (price) is a continuous variable, the house price prediction task falls under the domain of regression. Therefore, we need to utilize a regression model to accurately estimate house prices. Keep in mind that this is a practical machine-learning blog so this section will give you only a black-box understanding of these algorithms. If you want to dive deep into these algorithms then Andrew Ng’s machine learning course on Coursera can be a great resource. And If you prefer Bangla you can check out Nafis Nihal’s book on machine learning.
Given the nature of the dataset, the following regression models can be considered for house price prediction:
The size of the dataset, the computational resources available, the interpretability of the model, the complexity of the relationships in the dataset, and other factors all influence the choice of the supervised learning model. It is recommended to experiment with various models, assess their performance using suitable metrics, and choose the model that yields the best predictions for the given house price dataset.