A Novel Example-Dependent Cost-Sensitive Stacking Classifier to Identify Tax Return Defaulters

Tax evasion refers to an entity indulging in illegal activities to avoid paying their actual tax liability. A tax return statement is a periodic report comprising information about income, expenditure, etc. One of the most basic tax evasion methods is failing to file tax returns or delay filing tax return statements. The taxpayers who do not file their returns, or fail to do so within the stipulated period are called tax return defaulters. As a result, the Government has to bear the financial losses due to a taxpayer defaulting, which varies for each taxpayer. Therefore, while designing any statistical model to predict potential return defaulters, we have to consider the real financial loss associated with the misclassification of each individual. This paper proposes a framework for an example-dependent cost-sensitive stacking classifier that uses cost-insensitive classifiers as base generalizers to make predictions on the input space. These predictions are used to train an example-dependent cost-sensitive meta generalizer. Based on the meta-generalizer choice, we propose four variant models used to predict potential return defaulters for the upcoming tax-filing period. These models have been developed for the Commercial Taxes Department, Government of Telangana, India. Applying our proposed variant models to GST data, we observe a significant increase in savings compared to conventional classifiers. Additionally, we develop an empirical study showing that our approach is more adept at identifying potential tax return defaulters than existing example-dependent cost-sensitive classification algorithms.


Introduction
Taxes can be classified into direct taxes, which are payable directly to the government (Eg. Income tax). These taxes cannot be transferred to any other third party, and indirect taxes, which can be shifted to a third party by the entity that is levied the tax (Eg. VAT, excise duty). The Goods and Services Tax (GST) system is an indirect taxation system introduced in India in July 2017. This paper proposes a methodology to predict potential tax return defaulters for the GST system [1].

Working of the GST system
For demonstration purposes, we take a fictitious ornament manufacturer as an example, and 10% as the GST rate levied at every step (See Figure 1). Note that throughout the paper, we . Suppose, the manufacturing process adds a value of Rs. 500 to the ornament. Hence, the value of the ornament is now Rs. 2500. Now, the total tax levied on the sales of this ornament to the retailer is Rs. 250 (10% of 2500). By setting off the tax he had already paid at the time of purchasing the raw material, the GST payable to the manufacturer is Rs. 50 (tax collected-tax already paid), i.e., Rs. 50 (250-200). The retailer adds his margin of Rs. 500 increasing the total value to Rs. 3000 and sells it to the consumer for Rs. 3000, and the consumer is levied Rs. 300 as tax for the purchase. Similarly, the retailer is liable to pay a GST of Rs. 50 (tax collected -tax already paid), i.e., Rs. 50 (300-250) at the time of purchasing the ornament from the manufacturer. Finally, the GST received by the Government is Rs. 300, which is completely borne by the end consumer.

Motivation for this work
In the GST system, taxpayers are required to file their tax returns once a month. Defaulting on filing tax returns has the following consequences: First, defaulters have enough time at their disposal to manipulate their records; second, the penalty imposed by the Government is negligible compared to the going interest rates of the market, and therefore not an effective deterrent. Lastly, having movable assets is always beneficial to businesses involving large monetary transactions. This work's motivation is to construct a classification model to identify potential return defaulters and implement preventive measures such as sending emails, SMS, or physically visiting their business premises. We are working with the Government of Telangana, India, and are using their data for analysis and building models to increase tax returns compliance.
To attack a classification problem such as the one presented here, one would be inclined to use conventional cost-insensitive classification algorithms such as Logistic Regression, K-Nearest Neighbors, etc., to design the classifier. However, this presents us with a significant problem. A conventional classifier assigns equal misclassification costs for every example. In practice, however, the misclassification cost associated with classifying a genuine taxpayer as a return defaulter might vary significantly from classifying a return defaulter as an honest taxpayer. Similarly, the misclassification cost associated with misclassifying a return defaulter as a genuine taxpayer would vary for individual taxpayers based on their respective turnovers. Hence, there is a trade-off between choosing a model with better cost savings and choosing a model with better performance. To deal with this trade-off better, we propose four variants of an example dependent cost-sensitive stacking classifier. In a later section, we show that our Proposed Approach (PA) is adept at identifying potential tax return defaulters for the upcoming month with high accuracy. The approach that we have adopted in this paper can be generalized to any indirect taxation system used globally.
Our paper is structured as follows. In Section 2, we brief on existing works that are related to ours. Section 3 describes the data set and the feature extraction techniques used for designing the model. In Section 4, we describe the framework for the proposed variants of the exampledependent cost-sensitive stacking classifier. Section 5 discusses the performance of the PA on the data set and compares it with some example-dependent cost-sensitive and conventional cost-insensitive classifiers commonly in use. Finally, in Section 6, we provide concluding remarks for our work.

Related Work
In [2], Jasmien Lismont et al. used social network analysis concepts to develop a model to predict tax avoidance by including a wider variety of network features. In [3], Bianchi et al. use network measures of centrality to show that the taxpayers who collaborate with betterconnected auditors are likely to have lower effective tax rates. In [4], Veronique Van Vlasselaer et al. worked on identifying entities that indulge in social security fraud by assigning a timedependent exposure score to each business entity based on its involvement with known fraud business entities in the social network. In [5], Yusuf Sahin and Ekrem Duman have built classification models for detecting credit card fraud using Logistic Regression and Artificial Neural Networks, one of the first studies to compare the performance of Logistic Regression and ANNs for this use case. In [6], Charles X. Ling and Victor S. Sheng showed that cost-sensitive learning is a common approach to solve data imbalance problems. In [7], A. C. Bahnsen et al.
proposed an example dependent cost matrix for credit scoring. They proposed a cost function that introduces the example dependent costs into logistic regression. In [8], A.C Bahnsen et al. propose a framework for cost-sensitive classifiers, including Cost-Sensitive Decision Trees, Cost-Sensitive Random Forests, and ensembles of cost-sensitive models based on techniques such as majority voting and stacking Cost-Sensitive Logistic Regression generalizers. In [9], David H. Wolpert introduces Stacking or Stacked Generalization, an ensemble learning technique that aims to deduce generalizers' biases for the training set provided. In [10], Matjaz Kukar and Igor Kononenko designed a cost-sensitive analog for ANNs, with their study being the first to do so.

Description of the data set
We now proceed to briefly describe the data used to design our models. We were provided two types of data sets to prepare our models, namely: GSTR-1 data and month-wise GST returns data.

GSTR-1 Data
GSTR-1 is a financial statement that every taxpayer is required to submit monthly. This statement consists of details of all outward supplies, i.e., all sales done during the month corresponding to this statement. A fictitious sample of this statement is given in Table 1. Every row in Table 1 corresponds to one transaction. The data set contains several millions of such rows. The actual statement contains more information, such as the tax rate, the number of goods sold etc.

Creation of Network of taxpayers
In this model, we have attempted to quantify the amount of interaction between taxpayers. To compute this independent variable, we created a weighted, directed graph (social network) in which each vertex (node) corresponds to a taxpayer. The weight assigned to the vertices is the average tax paid per month [AT P M ] associated with the corresponding taxpayer from July 2017 to November 2019. Vertex weights have been normalized using min-max normalization.
We have utilized the month-wise GST Returns Data explained in Table 2 to compute each taxpayer's vertex weights. We have placed a weighted, directed edge from taxpayer a to taxpayer b, where the weight of the edge is the amount of sales done by taxpayer a to taxpayer b during the period July 2017 to November 2019. Similar to the vertex weights, the edge weights have been normalized using min-max normalization. For the same, we have used the GSTR-1 data explained in Table 1. This graph captures the scale of interaction between taxpayers.

Ratio
This is the variable extracted from the weighted, directed graph defined in subsection 3.3. This graph captures the degree of interaction and the monetary transactions between taxpayer b and other taxpayers. This variable captures the influence of other taxpayers on b. If b has close ties with taxpayers who are known tax return defaulters, they will influence b not to file GST returns and vice-versa [2]. Let B be the set of all vertices corresponding to defaulters (who have filed at most 1/4 th of their returns) and Y be the set of all vertices corresponding to taxpayers who have filed their returns in time (who have filed more than 3/4 th of their returns in time).
where ω(υ) is the weight of vertex υ and ω(υb) is the weight of directed edge υb

Filed
This is the dependent variable in the model with a binary outcome. This variable gives the GST return filing status (whether the taxpayer has filed returns in-time or not) of the taxpayer b for December 2019. Zero denotes returns were filed in-time (negative class) and, one denotes returns were not filed in-time (positive class).

Not Filed Count
This is the number of GST returns not filed in-time before the due date of the corresponding month by b from July 2017 to November 2019.

Division-Name
The state of Telangana is divided into 12 geographic divisions for simplification of administration works. This independent variable gives the geographic location in which b is located.

ATPM
This is the average tax per month paid by b. We included square, cube, log and the square root terms of the AT P M in the model as the relation between AT P M and Log of Odds of the Filed variable is a polynomial.

MAD Value
It is the Mean absolute deviation value of the first digit Benford's Law (Section 3.1) on sales transactions of b.

Seasonality
Case A: Retailers selling a single commodity: In an actual market, the annual revenue of some businesses may show a seasonal trend. For example, for a taxpayer involved in the yogurt business, one might observe higher revenues in the peak summer (May-June in India) and lower revenues in the winter (November-February in India). To quantify this seasonality, we have calculated the cosine similarity between the output tax of each taxpayer selling a particular type of commodity and the mean of the output tax of all taxpayers selling that commodity.
Here A denotes a vector of the output tax for every month for each taxpayer selling a particular type of commodity, and B denotes the vector of the mean of the output tax of all taxpayers selling that commodity for every month.

Case B: Retailers selling multiple commodities:
In a more general case, a retailer might generate revenue by selling multiple commodities, and each commodity might have its own seasonal trend. Consider a retailer sells commodities from a set I = A, B, C, D. For this retailer, we calculate the seasonality parameter as follows Here , ω i weight associated with each commodity i, defined as ω i = Total revenue generated by sales of commodity i (for that retailer) Total revenue generated by that retailer , s i = Similarity of commodity i for that retailer. Here, similarity is calculated as in case A.

Class Imbalance
The ratio of genuine taxpayers to defaulters was noted to be 0.22, hence, the use of sampling techniques was not deemed necessary.

Framework for Example Dependent Stacking Classifier
In problems such as the identification of tax return defaulters, it is of paramount importance to minimize the government's losses on account of the defaulters. For this task, an exampledependent cost-sensitive classifier would be the most prudent choice, as opposed to costinsensitive classifiers [7]. Intuitively, one can deduce that a cost-sensitive classifier would minimize the total cost (or increase the total savings), compromising overall model performance. On the other hand, a cost-insensitive classifier would aim for optimal model performance while leading to higher losses to the government. It follows that there is a trade-off between higher cost savings and better model performance. To alleviate this problem, we propose a novel framework for example-dependent cost-sensitive stacked classifiers that give a competitive model performance and increased savings compared to cost-insensitive classifiers.

Stacked Generalizers and General Framework
Stacked Generalization or stacking is an ensemble learning technique that aims to improve upon the performance of its constituent generalizers by deducing the biases of each of the individual generalizers. However, stacked generalizers do not always perform better than individual generalizers, and their efficacy depends on the choice of generalizers. While there is no defined architecture for a stacked generalizer, it is observed that stacking is most effective when the choice of individual generalizers is as diverse as possible. We propose a framework for a two-level stacked generalizer constructed as follows: The first level G 1 consists of conventional cost-insensitive classifiers to deduce the biases of classifiers on the input space, such that, The second level G 2 , consists of the meta-generalizer, which generalizes on the second space, consisting of the predictions of G 1 . We consider four choices of generalizers for G 2 , which gives rise to the following four variants: The choice of the meta-generalizers was dictated by the savings score and the AUC-ROC score (Section 5.3.1). The models with the highest savings score and highest AUC-ROC score were chosen as meta-generalizers.

Cost function
Let S be a set of N examples x i , where each example is represented by the augmented feature vector with associated costs and labelled using the class label y i . A classifier f , which generates the predicted label c i for each example i is trained using the set S. Then the cost of using f on x * i is calculated by and the total cost defined as Cost(f (x * i )).

Variant A
For variant A, we have G 1 as defined above, and we use G 2 = Cost-Sensitive Decision Tree Classifier (CSDT) [8]. In CSDTs, instead of using traditional splitting criteria such as Gini, entropy, or misclassification, the cost as defined in (1) is calculated for each node, and the gain of using each split is evaluated as the decrease in the total cost of the algorithm. The costbased impurity measure is defined by comparing the costs when all the examples in a leaf are classified as negative and as positive, Then, using the cost-based impurity, the gain of using the splitting rule (x j , l j ), that is the rule defined as splitting the set S on feature x j on value l j , is calculated as and | · | denotes the cardinality. Afterward, a decision tree is grown using the cost-based gain measure until no further splits can be made. After the tree is constructed, it is pruned by using a cost-based pruning criteria where f * is the classifier of the tree without the pruned node.

Variant B
For variant B, we have G 1 as defined above, and we use G 2 = Cost-Sensitive Bagging Classifier (CSB). Bagging or Bootstrap Aggregating is an ensemble technique that involves fitting base estimator(s) to random samples of the data set. The individual predictions are then aggregated using majority voting or weighted average to form a final prediction. To build our CSB, we have used the CSDTs mentioned above as base estimators and aggregated the individual predictions using majority voting [8].

Variant C
For variant C, we have G 1 as defined above, and we use G 2 = Cost-Sensitive Random Forest Classifier (CSRF). Cost-Sensitive Random Forest Classifiers are ensemble classifiers that work by creating multiple CSDTs and outputting the mode of the predictions made by the CSDTs as the final prediction of the ensemble classifier [8].

Variant D
Finally, we have implemented a Cost-Sensitive analog for an Artificial Neural Network [10]. To design our Cost-Sensitive ANN Classifier (CSANN), we have used the ReLU function as the activation function for the hidden layers and the logistic (sigmoid) function for the output layer. We have used equation (1) as the loss function for the neural network to incorporate the example-dependent cost-sensitive losses.

Software Used
All the models in this work have been designed using Python as it is a high-level, open-source language with an extensive library ecosystem. Python can also handle large amounts of data very well. Table 3 gives different miss-classification costs of a given taxpayer.

Cost Matrix
• True-negative cost (C TN ) is zero. We would not incur any cost for classifying an in-time return filer (actual class zero) as an in-time return filer (predicted class zero). • True-positive cost (C TP ) is the expenditure towards sending SMS, calling the taxpayer and other preventive measures and the cost of associated manpower. This cost is the same for all taxpayers whose actual class is one and predicted class is one. This cost is Rs. 150. • False-positive cost (C FP ) is the expenditure towards sending SMS, calling the taxpayer and other preventive measures and the cost of associated manpower. This cost is the same for all taxpayers whose actual class is zero and predicted class is one. This cost is also Rs. 150. • False-negative cost (C FN ) depends on the AT P M of each taxpayer and the expected number of days of delay in filing return by a taxpayer. This is given by AT P M * expected number of days of delay * 18 36500 * 3 + 100.
Here AT P M * expected number of days of delay * 18 36500 is the loss incurred due to late filing of return, where interest rate is 18%. This cost is different for every taxpayer as the AT P M and expected number of days of delay in filing return may vary for each individual taxpayer. We have multiplied this loss by three times and added 100 to it, in order to deter a defaulter from becoming a chronic defaulter.

Performance of Proposed Variants
In this section, we have compared the four proposed variants (variants A, B, C, and D) on tax return data vis-à-vis each other. The models have been compared on the following metrics:

Savings score
The savings score is defined as the relative improvement in cost using a classifier f (S), compared to the cost of classifying all entries as class one or class zero, whichever is lesser [7].

Balanced Accuracy Score
The balanced accuracy score is a metric for models trained on imbalanced data sets, which avoids inflated performance metrics due to the abundance of one class (in a binary classification problem). It is defined as follows: Balanced accuracy = 1 2 T P T P + F N + T N T N + F P .

Recall
Recall or Recall score refers to the fraction of relevant records correctly classified by the models. It is defined as follows: In the context of this paper, the recall score is the fraction of tax return defaulters correctly identified by the model.

F1-Score
The F1-Score is defined as the harmonic mean of the precision and recall of a model. Thus, The comparative performance of the four variants is summarized in the Table 4. From the four variants, we propose Variant D (G 2 =Cost-sensitive ANN) to be our proposed approach (PA) for this data set as it is the most adept at correctly predicting tax return defaulters, with the highest savings score and the highest AUC-ROC predicted on the train and test data set among the four variants.

Performance of Proposed Approach (PA) with existing algorithms
In this section, we have compared our proposed approach's performance with some costsensitive algorithms mentioned in [8]. Additionally, we compare the performance of the PA with a cost-sensitive ANN [10]. We have also compared the performance of our PA with a costinsensitive ANN. We have chosen a cost-insensitive ANN as it gave the most promising results among various cost-insensitive algorithms we experimented with, including, KNNs, Random Forests, XGBoost Classifier, AdaBoostClassifier, and Logistic Regression. The performance has been compared using the same metrics described in section 5.3. The results have been summarized in Table 5.

Confusion and Cost Matrices
Tables 6 and 7 are the training and the testing confusion matrices for the PA. Tables 8 and 9 are the training matrix and the testing cost matrix for the PA. These give the true-positive cost, false-negative cost, true-negative cost, and false-positive cost of both the training and testing data sets.     Table 9. PA Test Cost Matrix

Training and Testing ROC Curves
Training and testing ROC curves for the PA are given in Figure 2 and 3. The AUC value of training ROC curve is 0.94 and AUC value of testing ROC curves is also 0.93. From these values, one can conclude that the model is neither under-fitting nor over-fitting.

Savings score
To measure an example-dependent cost-sensitive algorithm's performance, we use the savings score (Section 4.3). As observed in Table 5, the savings score for the PA is 0.520 and 0.582 for the training and test sets, respectively. Since the values of the savings score for the training and testing set are reasonably high and almost similar, we can conclude that our PA is performing well.

Conclusion
In this paper, We propose a framework for example-dependent cost-sensitive stacked generalization comprising four variant models. We show that our Proposed Approach (PA) outperforms commonly used example-dependent cost-sensitive classifiers. We use our PA to predict whether a given taxpayer is a potential tax return defaulter or not for the upcoming month. While this framework was designed on the GST returns data for Telangana, it can be generalized to predict potential tax return defaulters using any of the four proposed variants depending on their performance, for any indirect taxation system around the world.