Loan prediction - simple classification algorithm
Hello, this algorithm was built as a final project for the first cohort mentoring program conducted by She Code Africa community. The mentoring program ran for three months and this was my final project before graduation. The whole process of building this algorithm was for three and a half weeks.
To successfully carry out this project the basic data science methodology steps must be followed, the steps include; Business Understanding, Analytic Approach, Data Collection, Data Understanding, Data Preparation, Model Building, Evaluation and Conclusion.
First, a problem had to be identified as the problem to be solved would help in determining what data to be used. The problem was to identify the individuals a bank should give loans to based on their available information. The aim of solving this problem is to aid the bank in reducing loss with regards to loans.
With the problem been defined the next thing was to select a suitable analytic approach. Since the problem had to do with a loan being Approved or Denied(1 or 0) it was clear that this problem had to be solved using a classification approach. I decided on using Logical regression and this was because my experiment relayed on a binary dependent variable.
As I am in Nigeria I search for a dataset that was a representation of my location but I was not successful. I ended up with a dataset from Kaggle the dataset was own by a company called Dream Housing Finance company, they deal in home loans. They are present across all urban, semi-urban and rural areas. The company allows clients to apply for a home loan after they have been validated the customer eligibility for a loan. The dataset contained a total of 13 columns and 614 rows, description the data set is in fig 1 below
To understand the data, the CSV file of the dataset was imported into a Jupyter Notebook, for this project I used Google Colab. I began by first importing all necessary libraries into my notebook and imported my training dataset. I proceeded to print out the dataset info, description and some rows of the dataset.
In a bid to understand the distribution of the data I plotted all categorical variables ( categorical variables are variables that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group based on some properties)against the loan status.
I discovered that 2/3rd of applicants who applied for loans had been granted the loan. There where more males than females requesting for loans and 2/3rd of the requesting population were married also married applicant had better chances of getting a loan. I also discovered that applicants with zero dependencies, education and stay in semi-urban areas are more likely to get approvals for loans.
I decided to replace the missing values on the dataset, normally the mean of the various variables are used to replace missing values. I was not sure that using the mean values was the best option, so I decided to use whisker boxplot to verify that using the mean was the right path to use. Fact: Outliers have a significant effect on the mean of a data hence if there are a lot of outliers it is safe to say that the mean is not the right path to go. It turned out I had a lot of outliers so I decided to use the mode values to fill the missing values instead.
In preparation to build my model, I decided to change the loan status which had values of Y and N to 1 and 0 as I felt this would help the algorithm (LogisticRegession) I had decided to work with. After that I imported the SKLEARN model selection, this was to aid me in splitting my data into train and test data in preparation for training the model.
Evaluation and Conclusion
After that, I fit the data and started predicting at first my model had an accuracy of 60% when my test size was 0.4 but I changed it to 0.3 and it became 75%. When I change the test size to 0.1 I got an accuracy of 85%, I stopped here in other to avoid overfitting the data.
I was able to carry out this project successfully with the aid of my mentor (Becky Mashaido) assigned to me during the She Codes Africa mentoring programme. The program began on the first of January 2020 and is slated to end by March 31st. I have really learnt a lot and made new friends as a result of the mentoring programme. I would recommend this programme for all ladies who have imposter syndrome as I suffered from it before joining this program. Now I am confident I can become a data scientist more than ever.
Thank you for reading