Optimal Feature Selection for Credit Scoring and Classification

From some recent work by Andrew Milne, Max Rounds and Phil Goddard at 1QBit.

Feature Selection with a Quantum Annealer

The growing power of quantum annealers gives us new reasons to look at previously studied quadratic unconstrained binary optimization (QUBO) problems.

Over the last few months, we have been looking at a technique we call QUBO Feature Selection, applied to a well-known classification problem in the open literature: the German Credit Data from the University of California Irvine (UCI) Machine Learning Repository.

QUBO Feature Selection is surprisingly effective. When applied with a variable parameter representing the desired balance between feature independence and feature influence, the method returned some 50 candidate subsets out of 32 trillion possibilities. . The best of these was clearly better than other nearby subsets, as we show in the graph above and later in the post.

A complete description of QUBO Feature Selection is given in our white paper, Optimal Feature Selection in Credit Scoring and Classification Using a Quantum Annealer, along with extensive references to work by other researchers.

As you read through this post, remember that the problem of selecting a “good” subset from a large feature set is not limited to credit scoring and classification. The problem of optimal feature selection can be relevant in any field where the volumes and types of data are growing.

Credit Scoring as a Problem in Computational Finance

Credit scoring and classification is a significant problem in computational finance. U.S. household debt alone is estimated to be around \$14 trillion, with 2% of these loans being non-performing.  Superficially, this indicates that lenders are making good decisions 98% of the time.  However, according to the U.S. Federal Deposit Insurance Corporation,  547 U.S. institutions have failed in the 15 years since 2001.

According to the 2016 Credit Access Survey by the U.S. Federal Reserve Bank of New York, approximately 40% of U.S. credit applications are rejected.  Moreover, between 20% and 40% of consumers expect to be rejected (it depends on the type of credit), and many do not even apply.  Yet among these people there may well be qualified customers for the right kind of lender.

In a recent article in Forbes magazine, consumer lending veteran Matt Harris offered an interesting perspective:

“Most start-up originators focus on the opportunity to innovate in credit decisioning. The headline appeal of new data sources and new data science seem too good to be true as a way to compete with stodgy old banks. And, in fact, they are indeed too good to be true…

… The recipe for success here starts with picking a truly under-served segment. Then figure out some new methods for sifting the gold from what everyone else sees as sand; this will end up being a combination of data sources, data science and hard won credit observations. …

… progressive and thoughtful traditional lenders like Capital One have mined most of the segments large enough to build a business around. The only way to build a durable competitive moat based on Customer Acquisition is to become integrated into a channel that is proprietarycontextual and data-rich.”

If Harris is correct, the key success criterion for new Credit Scoring and Classification tools will not be their speed and accuracy in large-scale applications, for which tools like FICO already exist, but in their flexibility and ease of integration into specialized applications.

Formulation of the Credit Scoring and Classification Problem

In this analysis, credit data is comprised of features, where each feature may be:

• an integer, where the order (lower to higher) has potential meaning, e.g. bank balances, years of education, etc.
• a category, such as a geographic region code, where higher and lower values are arbitrary.  May also include “missing” or “refused to answer”.
• a decimal number, such as age, dollar amounts, interest rates and so on, where the order has meaning, i.e. data such as the latitude and longitude of the applicant’s home would be better represented as a category.
• a Boolean (yes/no), which may be considered as an integer or a category.
• A credit observation, or sample, consists of an observation for each feature, and some means of identifying the applicant and the outcome.

The raw data may come from many sources.  For analysis, however, we clean the raw data to create a row of “feature variables” for each observation.  For example, we may convert some or all of the categorial variables to binary indicators, a step described in a later section of this post.  We may also scale the data and (in some cases) replace missing values with inferences.  In the description that follows, we will assume that these steps have been taken and the data is in a form suitable for input to our feature selector and classifier.

For convenience, we organize the clean data as a feature matrix of  m  rows and  n  columns.  Each column represents a feature, while each row represents the specific data values for a specific past credit applicant.

$$U=\begin{bmatrix} u_{11} & u_{12} & u_{13} & \dots & u_{1n} \\ u_{21} & v_{22} & u_{23} & \dots & u_{2n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ u_{m1} & u_{m2} & u_{m3} & \dots & u_{mn} \end{bmatrix}$$

We also have a record of the decisions that were made.  We represent these as the  m-element class vector,

$$V=\begin{bmatrix} v_{1} \\ v_{2} \\ \vdots \\ v_{m} \end{bmatrix}$$

The  vi  will be constrained to take on the values  0  and  1, where  represents the acceptance and 1 represents the rejection of  credit application  i.  Conceptually, the classifier will be a “credit risk detector” that signals when an applicant should not be granted credit.  This is consistent with the U.S. Fed data showing that rejections are less common than acceptances, i.e. acceptance is the rule, rejection is the exception.

Feature Selection

Assume that from the original set of n features we now want to select a subset of  K  features to use in making a credit decision.  Data costs money, and there’s no point in paying for data that we don’t need.  However, the number of possible subsets for each  K  is given by the combinatorial function C( n, K ), and this can make the search space very large. Finding a good subset can be a challenge.

Mathematically, our goal will be to find the columns of  U  that are correlated with  V,  but which are not correlated with each other.  We have not yet defined what correlation means here, but we assume that such a calculation is possible and that the value of the correlation coefficient can take on values from  -1  to 1.  Note that we can interpret “correlation” quite liberally: a “hard-won credit observation” might appear as a “hard requirement” that an attribute be present.  Taking this a step further, the conversion of categorical variables to binary indicators (see below) can be expanded to include corroborations that the lender sees as necessary.

Let  ρij  represent the correlation between column  i  and column  j  of the matrix  U, and let  ρVj  represent the correlation between column  j  of  U  and the single column of V.

We now introduce  n  binary variables  xj , which have the property
$$x_j = \begin{cases} 1, & \text{if feature }j \text{ is in the subset} \\ 0, & \text{otherwise} \end{cases}$$
The  objective function is then constructed from two components.  The first of these represents the influence that features have on the marked class, shown here in a form that increases as more terms are included:

$$\sum_{j=1}^{n} x_j |\rho_{Vj}|$$

The second component represents the independence. In the form shown below, the sum increases as more cross-correlated terms are included, which is the opposite of independence.  The negative sign is added at the front so that the term increases as independence increases. Generally speaking, we will want the  xj  values to be zero for the larger cross-correlation terms.
$$-\sum_{j=1}^{n}\sum_{k=1,\\k\neq j\enspace }^n x_j x_k|\rho_{jk}|$$
We combine these using a parameter  α  ( 0 ≤  α ≤ 1 ) to represent the relative weighting of independence (greatest at α=0) and  influence (greatest at α=1).

Combining the two terms and negating the expression overall (to optimize at the minimum), we obtain
$$f(\mathbf{x})= – \Bigg[ \alpha\sum_{j=1}^{n} x_j |\rho_{Vj}| – (1-\alpha) \sum_{j=1}^{n}\sum_{k=1,\\k\neq j\enspace }^n x_j x_k|\rho_{jk}| \Bigg]$$
We can make use of the property that xj × xj = xj  for binary variables, which  allows us to rewrite the summation as a vector product,
$$f(\mathbf{x})= -\mathbf{x}^TQ\mathbf{x}$$
From this we can express the problem in terms of the  argmin  operator, which returns the vector  x  for which its function argument is minimized:
$$\mathbf{x}= \underset{\mathbf{x}}{\textrm{argmin}}\Bigg[ -\mathbf{x}^TQ\mathbf{x} \phantom{i} \Bigg]$$

Classification

The classification problem is conceptually separate from the feature selection problem described above.  From here on, we assume that our “vector of observations” consists only of features from the subset we have decided to use.

The classification problem may then be stated as follows:  given a vector of observations for a new applicant, determine whether the applicant belongs to the creditworthy class.  More specifically, find a classification function  that returns 0 for acceptance and 1 for rejection.

In our study, we used a simple classifier based on logistic regression.  Unlike linear regression, where one can imagine two continuous variables and fitting a line to a scattered set of points, logistic regression assumes that the dependent variable is a category,  e.g. the zero and one of a binary classifier.  The fitted line no longer predicts the value of the dependent variable, but rather the probability that the dependent variable will have a specific value.  It is a well-established technique with a long pedigree.

Binarizing, Scaling and Correlating the German Credit Data

The German Credit Data under consideration was originally published in 1994 by Hans Hofmann at the Institute for Statistics and Econometrics the University of Hamburg.  It has been studied extensively.

The data consists of 20 features (7 numerical, 13 categorical) and a binary classification (good credit or bad credit).  There are 1000 rows, of which 700 are “good” and 300 are “bad”.  The data is intended for use with a cost matrix, i.e. giving credit to a bad applicant is five times worse than not giving credit to a good applicant. In this study, however, we were mainly concerned with how the separability of the data varied across the feature subsets, so the cost matrix was not used.

The data was prepared as follows:

• The  german.data  file from UCI was imported into a Jupyter (iPython) notebook as a Pandas DataFrame and given column headers with names from the accompanying  german.doc file.
• The categorial variables were converted to “one-hot” binary indicators using the DictVectorizer class from scikit-learn
• The first binary indicator in each group was removed (for k indicators only k-1 are independent)
• All of the numerical features were scaled to mean-zero, variance-one.
• The classification variable was transformed to 0=good, 1=bad.

A variety of correlation methods were studied.  These led to small differences the feature subsets, but no real difference in the accuracy of the classifications.  It was noted, however, that methods with a “smooth” distribution of coefficients (Spearman, Pearson, etc.) worked better with the quadratic objective function than correlation methods with sharper jumps, such as Mutual Information scores.

The binarization and scaling procedure transformed the 20 features in the German Credit Data into 48 feature variables.  For example, Attribute 1 (“Status of existing checking account”) which had four possible values in the original data, was converted into three binary indicators.  (In the resulting DataFrame these appeared as columns “ChqStat=A12”, “ChqStat=A13” and “ChqStat=A14”.).  These 48 feature variables are the input to the feature selection algorithm.  The feature subset at the output of the feature selector then forms the input to the classifier.

Coding for the 1QBit SDK versus Other Machine Learning

The 1QBit SDK provides a toolkit for solving Quadratic Unconstrained Binary Optimization (QUBO) problems. The details of the underlying quantum hardware are abstracted away, so that the code used by the analyst is no more complex than the code needed to use machine learning packages such as Weka or scikit-learn.

The RFECV class from scikit-learn is a good example.  Recursive feature elimination is “wrapped” with a classifier chosen by the user, e.g.

   featureMatrix = U.values
classVector   = V.values
estimator     = LogisticRegression()
selector      = RFECV(estimator, step=1, cv=3)
selector      = selector.fit( featureMatrix, classVector )
indexList     = selector.get_support()
featureList   = np.where(indexList)[0]


The 1QBit SDK allows LogisticRegression()  and other classes from scikit-learn and other packages to be wrapped with its solution to the QUBO problem, along with a testing and scoring class such as scikit-learn’s ShuffleSplit or StratifiedShuffleSplit.

Inside the wrapper,  a search is made for the best value of  α  (for the QUBO) and the best values for the parameters used by the classifier.

At the core of the search loops, or wherever the QUBO problem must be solved to obtain a candidate feature set, we construct a  Q  matrix  using the QDK’s QuadraticBinaryPolynomialBuilder  class, which returns a polynomial object representing the objective function seen earlier.  The poly object is passed to a solver and the solution is returned as a list of ones and zeros, referred to as a configuration.

We convert this to a list of integer indices that can be used to extract columns from a Pandas DataFrame or NumPy array, which is typically how feature matrices and class vectors are passed to machine learning methods.    A simplified example is shown below.


##                 ... Some code to construct the Q matrix as in the equations above...
poly=              builder.build_polynomial()
solver=            qdk.MultiTabu1OptSolver()
solutionList=      solver.minimize(poly)
lowEnergySolution= solutionList.get_minimum_energy_solution()
config=            lowEnergySolution.configuration
featureList=       np.argwhere(config.values()).flatten().tolist()
featureMatrix=     U.iloc[:, featureList].values
classVector=       V.values
estimator=         LogisticRegression()  # and so on...


In the work that led to this post, we fixed the parameters for the classifier, and focused on studying how the value of   α  affected the feature subset returned by QUBO Feature Selection, and how that subset affected the accuracy scores returned by the logistic regression classifier.  A full holistic optimization across all of the available parameters is a topic for future work.

Evaluation Metrics

The evaluation of QUBO Feature Selection and Recursive Feature Elimination was performed by wrapping them with the LogisticRegression model from scikit-learn.  The evaluation metric was defined as the unweighted accuracy, i.e. the number of correct classifications  divided by the total number of classifications made.  Other metrics from scikit-learn were tried, but they always led to the same optimal alpha or RFE feature set.  Unweighted accuracy was kept for compatibility with other work and ease of understanding, e.g. (Huang op.cit.)

Testing and scoring was done with the StratifiedShuffleSplit cross-validation class from scikit-learn. Given the feature matrix, this class returns sets of row indices that can be used to divide the matrix rows into a training set and a test set.  The separation is done in folds, with the number of folds set by an argument.  For example, a shuffle and split with 5 folds will take 80% of the matrix for the training set and 20% for the test set, and repeat this process until all 5 of the possible 20% folds have been used as test sets.

The accuracy score for each split is slightly different.  Since these reflect a random selection of data for training and testing, it is conventional to report the mean of the individual accuracy scores.  We follow the convention used by scikit-learn and report the error bars at the 95% confidence level.

Establishing the Zero-Rule and Other Baseline Properties

The German Credit Data has 700 class 0 samples (“good credit”) and 300 class 1 samples (“bad credit”).  A zero-rule classifier that assigns all the samples to class 0 will therefore achieve a success rate of 70%.

We want our proposed feature selection and classification scheme to do better than the zero-rule.  We want the feature selection component to choose subsets that are better than randomly selected subsets and (for that matter) better than no selection at all.  These will form a baseline against which QUBO Feature Selection can be compared.

Feature selection is motivated by the intuitive concept that not all features are equally important.  For example, in Figure 1 below, we see the feature variables from the binarized German Credit data ranked by their Spearman correlation coefficient with the classification variable.  There are four relatively important features at the left hand end, followed by a gradual, almost linear decline.  It turns out that the four features on the left are not enough to form a predictive subset on their own, so the problem for the feature selector is to find where on the line “enough is enough”.

Figure 1.  Spearman correlation. The most influential features involve the status of the applicant’s chequing account, savings account and loans at other institutions, all which are included in all practical feature subsets.  As we move towards the right, however, the influence of each new feature declines.

It also turns out that the smooth decline of the coefficients works well with quadratic objective function used in QUBO Feature Selection, as was seen in comparisons of Pearson, Spearman and Kendall correlations (readily available in the Pandas package).  In contrast,  the use of a Mutual Information score in place of a correlation coefficient was not as successful.  The integer features were binned into ordered categories and generally behaved as expected.  However, the more arbitrary binarized categories (e.g. loan purpose) had uniformly low mutual information scores, and the objective function tended to cycle amongst features.  A plot of the ranked mutual information scores (features and the classification variable) is shown in Figure 2.

Figure 2.  Mutual Information Scores. The Age, Term and Credit Amount fields were binned into categories.  However, the net effect was fluctuation in the accuracy scores and a lower mean accuracy at a “best” feature subset with 34 elements.

The results shown in this post were all calculated using the Spearman correlation coefficient.  As mentioned previously, this is an area where more research is needed, e.g. using other data sets with a broader mix of feature variables.

Before we select any features, however, we first examine how well the logistic regression classifier performs on the full feature set, i.e. all 48 feature variables.  We do this using the StratifiedShuffleSplit() class from scikit-learn.  Figure 3 shows how the mean accuracy depends on how the data is shuffled, and on how the data is split between the training set and the test set.

Figure 3. Mean Accuracies measured for various numbers of shuffles (10 to 3000) and for different fractions of the data assigned to the test set.

The combination of 1000 shuffles and 20% test share was chosen arbitrarily as the standard for initial performance comparisons, being much more convenient than the larger numbers.  It avoids the fluctuations found below 500 shuffles, and is close to the converged scores at 3000 samples and above.  For the definitive score comparison, however, the full 3000 shuffles were used.  The results for 10%, 15% and 20% share were always very close and typically in the median position.  The 20% test share was therefore used throughout.

It can be seen in Figure 3 that the mean accuracy increases as the test share is reduced, i.e. a bigger training set yields a more accurate predictor. However, the difference is small in relation to the dispersion of scores from different shuffles, as can be seen in Figure 4.

In Figure 4, we see the distribution of accuracy scores for all 48 features at a 20% test share (800/200 train/test split) counted over 1000 shuffles. The “stratification” causes the training set and the test set to have the same 70/30 distribution of good and bad credit samples, so that a 200 sample test set will contain 60 bad credit samples chosen from 300 in the set overall.  In a large number of shuffles there will inevitably be some repetition (a point highlighted by scikit-learn in its documentation).  Thus, although the data looks “Gaussian” and fits into the Gaussian overlay, in practice there are certain scores that occur more frequently, and extreme values are not observed above a certain limit. Note that in comparison to the spread of accuracy scores from 0.7 to 0.8 seen in Figure 4,  the (converged) spread of mean scores from 0.75 to 0.76 in Figure 3 is relatively small.  So long as we avoid small test shares and low numbers of shuffles, the error bars from the accuracy scores will dominate the uncertainty overall.

Figure 4.  Accuracy scores for all 48 samples.  Using 1000 shuffles with 20% test share.  This is a typical distribution for  accuracy scores in the German Credit Data.

QUBO Feature Selection with Logistic Regression

QUBO feature selection and a logistic regression classifier were “wrapped” together.  Practically speaking, this means that the feature selector and the classifier were placed together at the interior of the loops used to optimize the selection and classification parameters.  For simplicity, the only optimization parameter was  α, which determines the relative weighting of independence (greatest at α=0) and  influence (greatest at α=1).

It was found that the cardinality of the feature subset tended to increase with α, as had been noted by other researchers such as Huang and Demirer (references below).  However, an advantage of the wrapper model is that optimization can be done “at the  α  level” without looking at the details of the subsets.

Figure 5 shows the full range of  α  from 0 to 1.  On the left hand side where  α  is close to zero, the emphasis is on feature independence.  This favors small subsets, and since their regression coefficients are often not large enough to push the classifier across the cutoff point of the logistic classifier (p ≥ 0.5 so that the class is 1), they classify almost all the samples as “good credit” and achieve the zero-rule’s 70% success rate.

Figure 5.  Accuracy increases as we choose influential features over independent features.

In the second chart, we look more closely at the region between α=0.9 and α=1.  Here, the emphasis is on feature influence, and the subsets eventually grow to include all 48 available feature variables.

The interesting thing is that accuracy increases with the size of the subset, reaches a peak at α=0.977 with 24 elements, and then declines gradually as more features are added.  The drop in accuracy to the left of  α=0.977  is quite sharp, and although it’s encouraging to see a global maximum so clearly defined, this may be due to the data, and should be further investigated.

Figure 6.  A closer look at QUBO Feature Selection in the  α=0.9  to  α=1  region.

Recursive Feature Elimination with Logistic Regression

Recursive feature elimination is a technique for pruning features from a feature list.  The procedure begins with a set of available features.  It fits the logistic regression model, and eliminates the feature with the lowest weight.  The fitting and elimination is continued until the desired number of features is reached,  In recursive feature elimination with cross-validation (RFECV), the fitting is accompanied by testing, using a training set and test set chosen according to a folding parameter.

Conceptually, RFE is like the QUBO starting at  α = 1  and working backwards, except that RFE is recursive and treats each iteration as a new feature set (this could be done with the QUBO too, of course, and would make an interesting topic for future work).   Unlike QUBO, RFE does not test explicitly for feature independence, nor does it allow a feature to “come back” after it has after it has been eliminated.

We begin with the direct version of RFE, where we specify the desired number of features and then measure the performance of the returned feature subset.

The mean accuracy also depended on the number of folds, and this is also shown in Figure 7 with a range of error bars.  The accuracies were measured with the same 1000 shuffles and 20% test share that was used earlier.  The best mean accuracy of 0.77 ± 0.05 was found at the target cardinality of 28.

Figure 7.  Recursive Feature Elimination for cardinality targets from 1 to 48.  Using 1000 shuffles, 20% test share. Mean accuracy 0.77 ± 0.05

The RFECV method (with built-in cross validation) was very fast, although it converged to different feature subsets as the cross validation settings and random seeds were varied.  In practice, howver,  it was easy to search these (and faster than running RFE for thousands of shuffles).  RFECV ultimately delivered a 31 element feature subset with an accuracy of  0.76 ± 0.05, comparable with the other methods.

Comparison of QUBO, RFE and Random Search

Figure 8 shows the mean accuracies for QUBO Feature Selection, Recursive Feature Elimination and Random Search over 1000 sample subsets. The differences between the methods are smaller than the error bars. A larger search over more random samples tends to push the random maxima upwards, but it is not an effective search technique. Its main value is to inform us whether our feature subset selection algorithms are genuinely “better than chance”.

Figure 8.  Comparison of QUBO Feature Selection, Recursive Feature Elimination and Random Search over all 48 cardinalities.

When the best subsets from each method are compared at 3000 shuffles, we obtain the following, table, ranked by the number of features in the subset.

   Method   Accuracy            F1-Score        Number of Features
QUBO     0.764 ± 0.049       0.54 ± 0.1      24
RFE28    0.767 ± 0.050       0.55 ± 0.1      28
RFECV    0.764 ± 0.050       0.54 ± 0.1      31
Rand10k  0.762 ± 0.050       0.54 ± 0.1      35


These are equivalent accuracies, with QUBO giving the smallest feature subset.  However, random feature selection with 10,000 samples per cardinality had the lowest score of all, which shows that 10,000 samples (out of several trillion) cannot tell us much about extreme values.  There may be other subsets “out there”  that were missed by both QUBO and RFE. To assess this possibility, we examine some of the feature subsets that are “near” the QUBO and RFE results, in the sense that they differ by one or two features.

The F1-Score was computed using the f1_score() method of the LogisticRegression() class from scikit-learn.  It is defined as the harmonic mean of the precision and the recall, where precision is the ratio of true positives to all predicted positives, and recall is the ratio of true positives to all actual positives.  The above values reflect the classification of the German Credit Data without the application of a cost matrix, and can only be compared with results in the literature that were calculated in the same way.

Comparison with Potentially Missed Subsets

Figure 9 shows the mean accuracy scores from  feature subsets that have the same cardinality as the best subsets, but which differ in one feature or two features. Gaussian curves showing the distribution of the accuracy scores for the best subsets have been overlaid on the histograms.

The dashed vertical lines represent the means of the best subsets.  The vertical bars represent the counted occurrences of mean accuracies for the perturbed feature subsets.   QUBO Feature Selection does a better job than RFE in finding a subset that is better than its neighbors.  This reflects the behavior of the quadratic objective function, which considers all of the feature sets simultaneously (especially in the quantum annealer implementation).  In contrast, once a feature has been eliminated by RFE, there is no possibility of bringing it back on a later iteration.

Figure 9.  Comparison of mean accuracies from the “best” QUBO (blue) and “best” RFECV (green) with mean accuracies from feature sets of the same cardinality but differing by one or two features.  Gaussian distributions with the corresponding “best”  standard deviations have been overlaid.

The same behavior is observed when adding or subtracting a feature from the best subsets.  In Figure 10, we see that the accuracy of the best QUBO subset is again better than the perturbed subsets.

Figure 10.  Comparison of mean accuracies from the “best” QUBO (blue) and “best” RFE (green) with mean accuracies from feature sets with one more feature and one less feature.  Gaussian distributions with the corresponding “best”  standard deviations have been overlaid.

Comparison with Previously Reported Results

The German Credit Data was originally published in 1994 by Hans Hofmann at the Institute for Statistics and Econometrics the University of Hamburg. Since that time it has been studied extensively.

At the time of writing, the most thorough survey of how quadratic optimization compares with other methods is given by Waad (see reference below).  Waad also used a publicly available machine learning package: Weka 3.7.0, and created a  “three-stage feature selection fusion” technique with QUBO Feature Selection as the first stage, which yielded very good accuracies.  Waad’s results are not directly comparable to the results presented here, since they were given in terms of precision and recall, which are affected by whether or not the 5:1 cost ratio prescribed by Hoffmann has been applied in the training set, i.e. it is 5 times worse to give credit to a non-creditworthy borrower than to reject a creditworthy borrower.  In the work reported in this post, we did not use the cost function and Waad does not mention it.  Also, the Weka package does not have a function like scikit-learn’s StratifiedShuffleSplit(), and in Waad’s reported experimental procedure, the division of the samples into a training set and a test set was done at an early stage with 10 folds but no shuffling.

Taken in total, however, Waad provides a strong motivation for studying how QUBO Feature Selection might be used as part of a larger procedure. For example, Figure 8 shows that none of the methods was the best at every subset cardinality, and an expanded view of the feature selection histograms in Figure 9 would show that a few subsets with two-feature changes were very slightly better than the best subset returned by QUBO Feature Selection.  Additional searching might uncover more.

For feature selection overall, Chen (see reference below) was able to achieve good results with 12 features from the original German Credit Data with its 20 categorical and  integer (or fixed decimal) features.  These were manually selected after comparing various correlation measures between the features and the classification.  Chen did not use the programmatic binarization of categorical variables that came into greater popularity between 1998 and the present day.  Chen’s results are primarily for Support Vector Machines and do not deliver notable accuracy, especially since Chen also reports performing only a single 10-fold cross validation.   The real lesson from this early work is that correlation makes a difference, and that automatic binarization might not always be a good idea.

The thrust of our work has been to show how QUBO Feature Selection can be used with common techniques from publicly available software packages, as a means of programmatically reducing the number of feature variables, including those created by the wholesale binarization of categorical variables.  This was done successfully, but the literature shows that there is still some progress to be made.

Conclusions

On the German Credit Data, a comparison of QUBO Feature Selection with Recursive Feature Elimination showed that the QUBO method could find a smaller feature subset with no loss in accuracy.   When a method like Logistic Regression is used as the classifier, it can also reveal the tradeoff between feature independence and feature influence, using correlation coefficients that can be related to real world understanding of the credit scoring problem.  A priority for future work is to study its behavior on different data sets, with different correlation methods, and with a multi-stage approach that could improve its performance.  The authors hope that the availability of QUBO Feature Selection via 1QBit’s Quantum-Ready SDK will make this method more accessible to researchers interested in studying its possibilities.

Acknowledgements

The authors would like to thank Majid Dadashi and the 1QBit Software Development Team for their assistance with the SDK.  Anna Levit provided useful comments on the draft. Jaspreet Oberoi contributed the idea that led to Recursive QUBO Feature Selection, a concept whose possibilities we have yet to explore.

References

1QBit (2017), qdk.1qbit.com/documentation/, accessed Jan 23, 2017.

Anscombe (1973),  F. J. Anscombe, Graphs in Statistical Analysis, The American Statistician,  Vol. 27, No. 1 (Feb., 1973), pp. 17-21. Stable URL: http://www.jstor.org/stable/2682899 (and the summary on the Wikipedia is excellent).

Board (2016), Board of Governors of the Federal Reserve System (US), Households and Nonprofit Organizations; Credit Market Instruments; Liability, Level [CMDEBT], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/CMDEBT, December 21, 2016.

Bouckaert (2009),  Bouckaert, R. R., E. Frank, M. Hall, R. Kirkby, P. Reutemann, A. Seewald, and D. Scuse (2009). Weka manual (3.7.1).  This is the citation given by Waad (2016), and describes the Weka features available at the time.  For additional details see www.cs.waikato.ac.nz/ml/weka/

Chen (2010),  Fei-Long Chen, Feng-Chia Li, Combination of feature selection approaches with SVM in credit scoring,  Expert Systems with Applications 37 (2010) 4902–4909,  www.elsevier.com/locate/eswa

Cox (1958). Cox, David, “The regression analysis of binary sequences (with discussion)”. J Roy Stat Soc B. 20: 215–242. JSTOR 2983890

Demirer (1998),   Demirer, Riza and Eksioglu, Burak, Subset Selection in Multiple Linear Regression: A New Mathematical Programming Approach (working paper) University of Kansas, School of Business.  A later version of the paper appeared in Computers and Industrial Engineering, Volume 49 Issue 1, August 2005.

FDIC (2017), Bank Failures in Brief, https://www.fdic.gov/bank/historical/bank/, accessed Jan 8, 2017.

FDIC (2017), Statistics at a Glance, https://www.fdic.gov/bank/statistical/stats/, accessed Jan 8, 2017

FICO (2017)  FICO (formerly the Fair Isaac Company) www.fico.com, accessed Jan 10, 2017

FRBNY (2016), SCE Credit Access Survey (October 2016), Center for Microeconomic Data, Federal Reserve Bank of New York, https://www.newyorkfed.org/microeconomics/sceIndex/credit-access.html#indicators/overall-credit-rates-t1-a/g18, accessed Jan 8, 2017

Goel (2016),  Goel, Vindu, Russian Cyberforgers Steal Millions a Day With Fake Sites, New York Times Online, December 20, 2016,   http://www.nytimes.com/2016/12/20/technology/forgers-use-fake-web-users-to-steal-real-ad-revenue.html

Harris (2016), Harris, Matt, The Short History And Long Future Of The Online Lending Industry, Forbes Valley Voices, www.forbes.com, accessed Jan 8, 2017.

Huang (2014), Huang, Jun, FEATURE SELECTION IN CREDIT SCORING- A QUADRATIC PROGRAMMING APPROACH SOLVING WITH BISECTION METHOD BASED ON TABU SEARCH,   PhD Dissertation by JUN HUANG at Texas A&M International University.

OECD (2016), Household debt (indicator). doi: 10.1787/f03b6469-en (Accessed on 21 December 2016)

OECD (2016), Household disposable income (indicator). doi: 10.1787/dd50eddd-en (Accessed on 21 December 2016)

Pedregosa (2011),  Pedregosa et al.Scikit-learn: Machine Learning in Python,  JMLR 12, pp. 2825-2830, 2011

Powers (2011),  Powers, D.M.W., EVALUATION: FROM PRECISION, RECALL AND F-MEASURE TO ROC, INFORMEDNESS, MARKEDNESS & CORRELATION, Journal of Machine Learning Technologies, ISSN: 2229-3981 & ISSN: 2229-399X, Volume 2, Issue 1, 2011, pp-37-63,  Available online at http://www.bioinfo.in/contents.php?id=51, accessed 11 Jan, 2017

Rao (2016), Rao, Malleswara, How to Evaluate Bank Credit Risk Prediction Accuracy based on SVM and Decision Tree Models, Capgemini “Capping IT Off” Blog, Nov 2, 2016,  https://www.capgemini.com/blog/capping-it-off/2016/11/how-to-evaluate-bank-credit-risk-prediction-accuracy-based-on-svm-and, accessed Jan 18, 2017

Rosenberg (2007),  Rosenberg, Andrew and Hirschberg, Julia, V-Measure: A conditional entropy-based external cluster evaluation measure, Department of Computer Science, Columbia University, New York, 2007, from http://www1.cs.columbia.edu/~amaxwell/pubs/v_measure-emnlp07.pdf, accessed Jan 18, 2017

scikit (2017), http://scikit-learn.org/, accessed Jan 12, 2017.  Also Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011 (citation requested by the publisher).

sklearn cv (2017)  3.1. Cross-validation: evaluating estimator performance, at http://scikit-learn.org/stable/modules/cross_validation.html, accessed Jan 12, 2017

UCI MLR (2016) University of California, Irvine.  Machine Learning Repository, http://archive.ics.uci.edu/ml/, accessed Jan 8, 2017.

Waad (2016), WAAD BOUAGUEL BEN N’CIR, On Feature Selection Methods for Credit Scoring, PhD Dissertation, Université de Tunis, Institut Supérieur de Gestion, Ecole Doctorale Sciences de Gestion LARODEC, January 2015

World Bank (2017). International Debt Statistics 2017. Washington, DC: World Bank. doi: 10.1596/978-1-4648-0994-1. License: Creative Commons Attribution CC BY 3.0 IGO

World Bank (2016), World Bank, Bank Non-Performing Loans to Gross Loans for United States [DDSI02USA156NWDB], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/DDSI02USA156NWDB, December 21, 2016.