Sunday, 12 September 2021

Loan Approval Prediction using Python


 

 

Introduction 


Background


    Loan eligibility is primarily dependent on the income and repayment capacity of the individual(s). There are other factors that determine the eligibility of loan such as age, financial position, credit history, credit score, and other financial duty.


Problem Description

    Loan Approval Prediction is automate loan eligibility process from customer details. These details are Gender, Marital status, Education, Number of dependents, Self-employed, Monthly income, Loan Application amount and Credit history. In this process identify the customer segmentation and those are eligible for the loan amount.

Interest

    Financial Company or banks needs to automate the loan eligibility process based on customer detail provided while filling online application form.  In this project used to automate loan eligibility process from historical data for customer details.


Data acquisition and cleaning


Data set

Range Index: 614 entries (0 to 613)

Data columns: 13 columns


Data type


Dataset have three type of data types of the values. These are Object, Float and Integer. The number of categorical datatype in shown in figure 1.1.

Figure 1.1


You can find more information about the data, go to Loan Approval Data

Data cleaning


1. First downloaded the data from source and find the data information, description and shape of the data in after analysis. There were missing values from dataset, because of lack of record keeping. And also lot of data is object type. It is trouble to feature prediction. So we change the datatype from object to numeric values.

2. Data set has to several problems, so start the cleaning of data. 

3. Many columns are contain object type of datatype. And then some other columns are complicated values like date, float and negative values. After drop the unwanted columns based on further analysis.

4. After fixing these problems, I checked for outliers in the data. I found there were some extreme outliers, mostly caused by some types of small sample size problem.


Feature Selection


 Final step of the Data acquisition, Feature selection is important to the predictive modelling. After data cleaning, there were 194673 samples and 49 features in the data. Upon examining the meaning of each feature, it was clear that there was some redundancy in the features. But these data set contain all features are important to future predicting.


Exploratory Data Analysis

    The problem is Loan Approval prediction. So the target value is Loan Status.  The Loan _Status column contain loan approved or not. It is categorical (Yes / No) values in shown figure 1.2.


Figure 1.2


    Gender is common feature of identical data. It is the range of characteristics pertaining to, and differentiating between, masculinity and femininity. It has two different types of feature in this columns. There are male and female. The total number of male is 489 and total number of female is 112. And then given plot is define how many male or female approval for loan in shown figure 1.3.

Figure 1.3



    Marital status is one of the important features. It distinct options that describe a person’ relationship with a significant other. Here are some of the important ways a change in marital status can affect the target variable. The value of ‘Married’ are ‘Yes’ and ‘No’. The total number of ‘Yes’ is 398 and total number of ‘No’ is 213. And then given plot is define marital status for loan approval in shown figure 1.4.

Figure 1.4

    Education is major role on loan approval prediction. Loan eligibility criteria is pursing graduate or post graduate degree.  It has to contain two values, there are ‘Graduate’ and ‘Not Graduate’. The total number of ‘Graduate’ is 480 and total number of ‘Not Graduate’ is 134.  And then given plot is define Education for loan approval in shown figure 1.5.


Figure 1.5

    Self-Employment is another major role on loan approval prediction. Here the maximum applicants are not self-employed for this dataset. The value of ‘Self-Employed’ are ‘Yes’ and ‘No’. The total number of ‘Yes’ is 82 and total number of ‘No’ is 500. And then given plot is define Self-Employed for loan approval in shown figure 1.6.



Figure 1.6

    The distribution is represent for log transformation. The log transformation can be used to make highly skewed distributions less skewed. This can be valuable both for making patterns in the data more interpretable and for helping to meet the assumptions of inferential statistics.

    The plots are different between normal distribution and log normal distribution for total income in shown figure 1.7.

Figure 1.7


    This plot are different between normal distribution and log normal distribution for loan amount in shown figure 1.8. 

Figure 1.8


Box plot of major features

    The box plot is represent shape of distribution, its central value, and its variability. Its helps to understanding of data in shown figure 1.9. 

Figure 1.9


Correlation between each data


    Correlation of data is define relationship between each columns in the data. It has to understand how to handle the data and columns. In this plot of diagram is represent with different colors of range (-0.2 to 1.0). The range of color is negative value, it low rate of correlation between the data. For example, Education and Loan Amount is low correlation of each data. Its correlation value is -0.2.

    The range of color is equal to zero, it normal of correlation between the data. In dataset more amount of columns are correlation between each data should be normal. For example, Gender and Self_Employed, ApplicantIncome and Loan_Amount_Term.




Figure 1.10


    The range of color is positive value, it high rate of correlation between the data. Each columns are highly correlated in x-axis and y-axis. So both are same values. And otherwise some columns are highly correlated. For example, Total income and Application Income is high correlation of each data. Its correlation value is 1.0. The correlation of each data is visualize in shown figure 1.10.


Relationship between major features


    The matrix plot is represent to the relationship between major features. In this plot visualize the data based on scatter plot and histogram. The histogram is present in diagonal. The scatter plot is represent to the relationship between major columns without diagonal. The matrix plot in shown figure 1.11.


 



Figure 1.11

Predictive Modelling

    Predictive modelling uses statistics to predict outcomes. It is the general concept of building a model that capable of making predictions. Typically, such a model includes a machine learning algorithm that learns certain properties from a training data set in order to make those predictions.


Models

    Models can use one or more classifiers in trying to determine the probability of a set of data belonging to another set. There are two types of models, Regression and Classification. 
    
    Regression is Supervised Learning task where output is having continuous value. The goal here is to predict a value as much closer to actual output value as our model can and then evaluation is done by calculating error value. The smaller the error the greater the accuracy of our regression model. 

    Classification is a Supervised Learning task where output is having defined labels (discrete value). The goal here is to predict discrete values belonging to a particular class and evaluate on the basis of accuracy. It can be either binary or multi class classification. In binary classification, model predicts either 0 or 1; yes or no but in case of multi class classification, model predicts more than one class.  

    In this project target value is categorical type (discrete value). So I choose classification model.



Applying standard Classification algorithms 

    Classification in machine learning and statistics is a supervised learning approach in which the computer program learns from the data given to it and make new observations or classifications. A classification model attempts to draw some conclusion from observed values. Given one or more inputs a classification model will try to predict the value of one or more outcomes. Outcomes are labels that can be applied to a data set.


    There are a number of classification models. Classification models include K nearest neighbor and Naive Bayes, Logistic regression, Decision tree, and Random forest.



Performance of Models

    Model evaluation metrics are required to quantify model performance. I choose the model evaluation metrics depends on our machine learning task such as classification algorithms. In precision – recall are useful for multiple tasks.

    I applied some classification matrix for model evaluation. There are Classification accuracy and Confusion matrix. Classification accuracy is the number of correct prediction made as a ratio of all predictions made. Confusion matrix provide a more detailed breakdown of correct and incorrect classification for each class. And also fine Actual and predicted values (True and False).

    It estimated performance of a model tells as how well it preform on unseen data. And also I find best classifier on this problem based on the table, it is decision tree. It has to high accuracy and better result on confusion matrix. In table explain performance of different models.

Conclusion 


    Finally, I predicted the loan approval based on further analysis. I achieved above 80% accuracy in classification algorithms. That is helps to identify the eligibility of loan. And also analysis of major features, it used to get better result of this problem.

Purpose of this project was to predict the Loan Approval. Company wants to automate the loan eligibility process based on customer details. These details are Gender, Marital status, Education, Number of dependents, Self-employed, Monthly income, Loan Application amount and Credit history. In this process identify the customer segmentation and those are eligible for the loan amount.



And you want to explore the project:



No comments:

Post a Comment