Skip to main content

Big Data in Credit Scoring

Global Findex (Financial Inclusion Index) report which was released in 2018 shows only 48.9 percent of the adults in Indonesia own a bank account. Millions of unbanked Indonesian adults work in private sector and get paid in cash. What is the main reason to these young adults not having bank account? the reason is the distance and surprisingly 69 percent of this population segment have their own mobile phone. We see some efforts from the banking institution to reach out this unbanked population but it is not enough, there is still a wide gap. 

We may now realize that why there are many advanced technology multi-finance companies occur in the recent years, these companies fill the gap. They know the characteristics of the unbanked population and by utilizing the technology they can reach out more of this population. But, reaching this population is not without some risks.

Multi-finance companies compete with each other to capture the market, they will offer many products to attract the clients. Some of the companies focus on lending primarily to people with little or no credit history. Asymmetric information, also known as "information failure," is bound to happen. When it comes to borrowing or lending money, asymmetric information occurs when the borrower has more information about his financial state than the lender does.

Have you ever wondered how a bank or financing company can approve or reject someone  (client) credit application? most of financing companies use services from credit rating agencies (CRA) to measure the credit worthiness of the clients. They will measure the client's credit score each time the client apply for a credit, this attempt will also help reducing the asymmetric information. 

The process of generating the credit score is called credit scoring. It is widely applied in many industries especially in the banking. Generally, it contains two main parts: Building the statistical model and applying a statistical model to assign a score to a credit application or an existing credit account. The statistical model for credit scoring is called Scorecard Model and most of the time the model is based on Logistic Regression.

Why Logistic Regression? It is more about finding relationships between variables and the significance of those relationships. Most of the time it is more stable and easy to interpret compared to advanced or black box model. Interpretability of model should be important since finance companies should have 'clear' explanation of why a client is rejected or accepted. But on the other side, less advanced model like Logistic Regression is often sacrificing the predictive power to cater the interpretability.



Score from the statistical model usually shows the probability of the clients to be default or not able to pay the credit in the future. It means as the Score increases, the clients tend to be default. But, most of the CRA will convert (to make it more interpretable for the public) this default probability to some ranges of value that show credit worthiness, it means as the score increases the client tends to be a good client.

Utilizing Big Data for Credit Scoring

The ideal Scorecard model should have the capabilities to capture all the behaviours of the clients and CRA usually have access to the credit history of the client, but sometimes it is not enough. Nowadays, some of advanced technology CRAs start utilizing the big data. It has been estimated that 2.5 quintillion bytes of data are generated each day. An interesting way to visualize this much data is to imagine this: this amount of data would fill 10 million Blu-ray discs, which, stacked, would equal the height of four Eiffel Towers arranged on top of each other. These astonishing amounts of data are often referred to as big data.

The ability of a financial institution to use all of the data, whether structured or semistructured, is crucial in the age of big data analytics. Using data to make decisions that span across the entire financial institution can make that institution more efficient, and drive an increase in revenue. As stated above, 69 percent of the unbanked population have their own mobile phone. All activities that they do in their mobile phone are captured somewhere. Those are valuable data that can be changed to the predictors for the Scorecard model. 

We can see some patterns or even make wild hypothesis from the big data comes from the mobile phone. For example, fraudsters tend to use WiFi connection when they apply for credit through mobile application or default clients tend to visit betting website excessively prior applying for credit. Client's mobile phone brand and combining with some other data can also (loosely) approximate their economic condition. CRA may hypothesize that if the clients use unpopular phone they usually come from lower income population and likely have difficulties in credit repayment. Total main storage of client's phone, on the other hand, can approximate whether the clients posses high-technology phone or not thus again can describe their economic condition. Many other hypothesis can be derived from the big data and in the end the Scorecard model will prove whether the hypothesis are right  (statistically significant) or not.

Comments

Popular posts from this blog

How to Create Indonesia Map in R

Creating the Map In this article, I will try to explain how to make Indonesia Map in R. I will assume that you are already familiar with the basic codes in R. First, we need the required libraries : require (maps) #loading maps package require (mapdata) #loading mapdata package library(ggplot2) #ggplot2 package library(readxl) #package for read .xlsx file library(ggthemes) #package for ggplot2 theme library(ggrepel) #extendig the plotting package ggplot2 for maps Then, we prepare the data that contains the information of provinces name, latitude, and longitude of every province in Indonesia, e.g. : You can download the data in here:  Data Now open the file and create the polygon: setwd( "your file's path" ) #set your own directory mydata<- read _xlsx( "dummy.xlsx" ) #assign the data to "mydata" View(mydata) #view the data, notice the column of "latitude","longitude", "woe_label" glo...

Modifying Some Plots in R

Grant Data : 500 Households that get Social Grant in a certain region, the data comes from Social and Economic Survey. Download the dummy data:  DATA grant <- read.csv( "grant.csv" ) #PFOODEXP: Proportion of Food Expenditure to Total Expenditure #HH_Income: Household Income ($) #HH_FOOD: Household Food Expenditure ($) #HH_Loc: Household Location (0: Rural, 1: Urban) #Educ_H: Household Head Education (Year) #HH_Size: Household Family Member #Gender_H: Household Head Gender (0: Female, 1: Male ) #Age_H: Household Head Age #Remove 1st column grant[ 1 ] <- NULL #change variable type to factor grant$Gender_H <- as.factor(grant$Gender_H) grant$HH_Loc <- as.factor(grant$HH_Loc) head(grant) ## PFOODEXP HH_Food HH_Loc Gender_H Age_H Educ_H HH_Size HH_Income ## 1 73.16933 675.1766 1 0 32 22 4 922.7590 ## 2 69.77417 596.2286 1 1 27 22 13 854.5119 ## 3 63.31992 585.4946 0 1 42 15...

Interactive Visualisation Using R

Shiny is an R package that makes it easy to build interactive web apps straight from R. You can host standalone apps on a webpage or embed them in R Markdown documents or build dashboards . (R Studio) This one below is an example of how useful Shiny for data visualisation: *Sudden changes starting in year 2010 caused by the implementation of new HDI method Codes for the above visualisation: library(shiny) library(googleCharts) library(dplyr) library(readxl) head(data) ## # A tibble: 6 x 6 ## woe_label Island year Grdpc HDI Population ## <fct> <fct> <dbl> <dbl> <dbl> <dbl> ## 1 Aceh Sumatera 2000 4995. 65.3 3930905 ## 2 North Sumatra Sumatera 2000 5848. 66.6 11649655 ## 3 West Sumatra Sumatera 2000 5388. 65.8 4248931 ## 4 Riau Sumatera 2000 5746. 67.3 4957627 ## 5 Jambi Sumatera 2000 3503. 65.4 2413846 ## 6 South Sumatra Sumatera 2000 4506. ...