# ii “In this world nothing can be said to be certain

ii

“In this world nothing can be said to be certain, except death and taxes”

Benjamin Franklin

Department of Electrical Engineering, Computer Engineering and Informatics

MSc Data Science and Engineering

Semi-supervised Variational Autoencoders for Tax Evasion Risk Estimation

The Cyprus Tax Department examines the use of advanced analytics as part of its effort to Improve Tax Compliance and thus reducing the VAT GAP. Currently, a ma- chine learning project has been developed with the assistance of the Cyprus Univer- sity of Technology, to improve the selection process of non compliant/ tax evading taxpayers. The model predicts whether a material audit yield will be generated in the event of a tax audit, with an accuracy of 78,4%. The model is currently being evaluated for pilot deployment. The attitude of the senior management of the tax department towards investing resources in to state of the art technology research, close collaboration with the academia, statistical analytics and fit for the purpose data are essential for the success of any such endeavor. The model was developed using off-the-shelf hardware and open source software. The semi-supervised gen- erative approach is being used to target non compliant taxpayers. In this thesis is illustrated how the use of a state of the art machine learning model can assist tax departments around the world to improve their efficiency. Tax department data, can be used not only for assessing tax liabilities but also for modeling taxpayer behavior, highlighting cases of non compliance. The tax department resources for performing tax audits are both scarce and costly. An improvement in case selection can result in: higher compliance, by increasing non-compliant taxpayer detection; and targeting taxpayers with the maximum audit yield, which generates additional revenue for the state. Thus, this thesis provides solid evidence that machine learning can assist in running more efficient tax departments.

iii

Contents

Abstract iii

Acknowledgements v

1 Introduction 1

1.1 ValueAddedTax …………………………. 2 1.1.1 VATGAP …………………………. 2 1.1.2 CyprusVATGAP……………………… 3

1.2 TheTaxdepartment………………………… 3 1.2.1 TaxCompliance………………………. 3 1.2.2 Taxcomplianceactions…………………… 3

1.3 Classicalauditcaseselection……………………. 3

1.4 MachineLearning…………………………. 4 Networks …………………………. 4 ArtificialNeuralNetwork…………………. 4 SupervisedMachineLearning ………………. 5 UnsupervisedMachineLearning……………… 5 SemiSupervisedLearning…………………. 6

2 Approach 7

3 Evaluation 11 3.1 Theconfusionmatrix……………………….. 12 3.2 The hyperparameters of the highest accuracy model . . . . . . . . . . . 12 3.3 Allthehyperparametersexperimented ……………… 12

4 Conclusions and future work 15 4.1 ProjectConclusions………………………… 15 4.2 Futurework ……………………………. 15

vii

To my wife and children.

ix

Chapter 1 Introduction

The European commission uses the concept of VAT GAP to estimate the amount of non-compliance with the VAT legislation. The VAT GAP measures the difference between the amount of VAT that should be paid and the actual VAT paid by the tax- payers. VAT undercollection is a problem all European Union member states have to face and solve.

The abundant data from filed tax returns and other sources can be exploited by ma- chine learning in order to assess whether a taxpayer is complaint.

Semi-supervised learning is used for classification when a fraction of the observa- tions have corresponding class labels. In many real life classification problems, like image search (Fergus et al., 2009), genomics (Shi and Zhang, 2011), natural language parsing (Liang, 2005), and speech analysis (Liu and Kirchhoff, 2013). Similarly tax departments have abundant unlabeled data for taxpayers, but obtaining audit re- sults (class labels) is expensive and impossible to be performed on all taxpayers.

To the author best knowledge deep learning, based on generative semi-supervised learning paradigm has never been used until now for taxpayer audit selection.

“Can a Tax Department use data of unaudited taxpayers to predict with high accu- racy the tax yield in case of a tax audit?

The answer is the development of probabilistic models for inductive and transduc- tive semi-supervised learning by utilizing an explicit model of the data density, fol- lowing the recent advances in deep generative models and scalable variational in- ference (Kingma and Welling, 2014; Rezende et al., 2014).

The basic algorithm for semi-supervised learning is the self-training scheme (Rosen- berg et al., 2005) where labelled data acquired from its own predictions. A number of repetitions is performed until a preset goal is achieved. Poor predictions might be reinforced because these are based on heuristics. Transductive SVMs (TSVM) (Joachims, 1999) extend SVMs with the aim of max-margin classification while en- suring that few predictions close to the margin are utilized. Optimization and uti- lization of these approaches to large datasets of unlabeled data is difficult.

Graph-based methods are popular, and create a graph connecting similar observa- tions, when the minimum energy (MAP) configuration is found, the label informa- tion is propagated between labelled and unlabeled nodes(Blum et al., 2004, Zhu et al., 2003). For Graph-based approaches the graph structure is crucial and eigen- analysis of the graph Laplacian is required, which limits the scale to which these

1

2 Chapter1. Introduction methods can be applied, through efficient spectral methods are now available (Fer-

gus et al., 2009).

Neural network approaches use both supervised and unsupervised learning, feed- forward classifiers are trained with an additional penalty from an auto encoder or other unsupervised embedding of the data (Ranzato and Szummer, 2008, Weston et al., 2012).

Constructive auto-encoders (CAE) are trained to learn the manifold on which the data lies,the Manifold Tangent Classifier (MTC) (Rifai et al., 2011). The most compet- itive model performance was achieved using manifold learning using graph based methods together with kernel (SVM) methods in the Atlas RBF model (Pitelis et al., 2014).

In this thesis, the power of generative models is exploited, which treat the semi- supervised learning problem replacing missing data with substituted values. A new framework where probabilistic modeling and deep neural networks work together to create a semi-supervised learning with generative models.

Using a semi-supervised deep neural network it was possible to achieve 78,4% ac- curacy in detecting whether a material tax yield will be generated from VAT audit of a taxpayer. The model was run on off the shelf hardware and open source software.

This thesis has been prepared with the collaboration of the Cyprus Tax Department and the Cyprus University of Technology. To avoid compromising sensitive data and protecting public interest the information disclosed is limited.

1.1 Value Added Tax

The Value Added Tax (VAT) ,is a consumption tax assessed on the value added to goods and services in the European Union (EU). Its applicable to almost all goods and services that are bought and sold for use or consumption in the EU.

Goods and services which are not sold in EU are exempt from VAT. Conversely im- ports from outside the EU are subject to VAT so that EU producers can compete on equal terms with non EU suppliers.

1.1.1 VAT GAP

EU Member States are losing billions of euros in VAT revenues due to tax fraud and inadequate tax collection systems. The VAT GAP, is the difference between expected VAT receipts and the actual VAT collected.

It’s an estimate of the revenue lost due to tax fraud, tax evasion, tax avoidance, bankruptcies, insolvencies and miscalculations. The source of VAT GAP is taxpay- ers non compliance.

1.2. TheTaxdepartment 3 1.1.2 Cyprus VAT GAP

According to the European Commission, the VAT GAP for Cyprus for 2015 (latest report) was estimated to be 122 million EUR or 7.44%. Its critical for the Tax depart- ment to reduce the VAT GAP.

1.2 The Tax department

The primary goal of a Tax Department is to collect the taxes payable, according to the tax legislation in a sustainable manner by increasing the public confidence to the tax system and its administration.

Taxpayers may be ignorant, careless, reckless, purposely evading taxes or due to the inefficiencies of the tax department. Therefore its impossible for all the taxpay- ers to be compliant all the time. The tax administration effort is placed into strategies and structures ensuring that compliance with tax legislation is maximized.

The tax department with a finite level of resources need to carefully allocate them so as to achieve the maximum possible outcome in terms of improved compliance with the tax legislation.

1.2.1 Tax Compliance

A taxpayer obligations towards the tax department are in general: • registration in the system with accurate data on time;

• timely filing of required information;

• complete and accurate information reporting; and

• on time payment of obligations.

A taxpayer meeting all the above may be considered as tax compliant. Each failure

to meet any of the above obligations, may result to an increase of the VAT GAP.

1.2.2 Tax compliance actions

A Tax administration can utilize a plethora of measures so to tackle non compli- ance based on the taxpayer behavior understanding. For example taxpayers who have been compliant in the past and have not filed the latest return may be sent a reminder letter. A taxpayer with a history of non compliance, can expect legal mea- sures to be taken against him immediately.

The most expensive in respect of resource allocation is the tax audit, it requires highly trained and experienced auditors to perform a detailed lengthy audit on just one taxpayer. The decision whether a taxpayer is audited/ being allocated the scarce resources is of paramount importance.

1.3 Classical audit case selection

Traditionally a tax department would rely on rule based systems to analyze data, assess the risk and make predictions about taxpayer behavior.

4 Chapter1. Introduction

The rule based approach is laborious and time consuming process, rules are created based on the experience and expectations (bias) of auditors. Whether a decision is made, depends on the number of the rules ‘fired’. For example a rule based system will decide whether a tax refund payment will be made to a taxpayer, if the number of rules ‘fired’ does not exceed five (5).

1.4 Machine Learning

Machine Learning is a subset of artificial intelligence where computers change and improve their algorithms themselves through learning from data and information without being explicitly programmed.

Machine learning is more efficient compared to rules based approach since there is no need to create detailed rules on each and every model, saving time for the data analyst and the business expert. The model can be trained using previous results. In the case of creating a model to predict whether the result of a tax audit will be material, the results of previous audits can be used to train the model.

Networks

One approach to solve complex problems is ‘divide and conquer’. A complex prob- lem may be broken down to simpler bite size tasks and vice versa a collection of many small components can be used to create a complex system (Bar Yam, 1997). Networks can be used to solve complex problems by simply using a set of nodes, and connections between them.

The nodes can be computational units, receiving input and giving an output after processing. The process of the input can vary from very simple tasks like multipli- cation to very complex if it contains another network.

The connections represent the information exchange between nodes. The flow can be in one direction (unidirectional) and bidirectional when it flows in either direc- tion.

The behavior of the entire network depends on the individual interactions of nodes through the connections, which is not visible at the node level. The global behavior is emergent since the characteristics of the entire network supersede the characteris- tics of individual nodes resulting to a powerful tool.

As a consequence is common for networks to be used for modeling in many areas including computer science.

Artificial Neural Network

An artificial neuron mimics biological neurons by performing computations, and an Artificial Neural Network(ANN) sees the nodes as artificial neurons.

The Natural neurons in the brain use synapses on the dendrites of the neuron to receive signals . If the signal exceeds a threshold, the neuron is activated and sends a signal to another synapse using the axon.

1.4. MachineLearning 5

The artificial neurons have inputs and weights, a mathematical function, determines whether the neuron will be activated while a second function computes the output. The ANNs use the artificial neurons for information processing.

The weight of the neuron determines the strength of the output, which can be amended accordingly in order to get the desired outcome.

Since the inception of the ANN’s (McCulloch and Pitts (1943)) they have evolved into new models of learning like the back-propagation algorithm (Rumelhart and McClelland, 1986).

Supervised Machine Learning

Supervised machine learning models need adequate number of outcomes like audit results in order to perform the learning process. In case the number of audits per- formed are inadequate, deep neural network models cannot be used.

Supervised learning, the machine learns the mapping function (f) after it has been provided with the inputs (X) and outputs (Y) variables. The machine is the student who is given both the question and the answer by the teacher (supervisor) .

A model has been trained successfully if it can predict the output (Y) using the map- ping function (f) on unseen data (X) accurately. Y = f(X)

Supervised learning problems fall into:

• ClassificationProblem:Theoutputvariable(Y)isacategory,suchascompliant taxpayer or non compliant taxpayer.

• Regression Problem: The output variable is a real value, such as the amount of tax yield from tax audit.

Unsupervised Machine Learning

Unsupervised learning, means that the algorithm is provided only with input vari- ables (X) without the corresponding outputs variables (Y).

The goal of the algorithm is to learn the underlying structure or distribution in the data. There are no correct answers and no teacher, its a self-learning process.

Unsupervised Learning Problems:

• Clustering Problems: Find the groups (clusters) in the data, such as grouping

taxpayers by compliance behavior.

• Association Problems: Find the rules which describe large parts of the data, such as whether taxpayers that do not file tax returns on time also tend stop filing returns.

6 Chapter1. Introduction Semi Supervised Learning

The problem where a plethora of input variables (X) are available and only few out- put variables (Y) are labeled, is called semi supervised. Only few of the taxpayers have been audited (Y) while data for all taxpayers are available at no extra cost.

Semi supervised problems are very common because unlabelled input variables. (X) are freely available and it is very time consuming and costly to label output variables (Y)

The unsupervised learning algorithm discovers the structure of the input variables (X), and uses the predictions it made about the unlabeled data as input for the su- pervised learning algorithm. The semi supervised model makes predictions on new unseen data.

Chapter 2 Approach

A semi supervised model uses the unlabeled data to extract latent features and pairs these with labels to learn an associate classifier.

A hidden (Latent) feature discriminative model (Model 1):

The model provides an embedding or feature representation of the data of all taxpayers. The features are then used to train a separate classifier. The information acquired allows for the clustering of related features in a hidden space.

A deep generative model of both audited and not audited taxpayers data pro- vides a more robust set of hidden(latent) features. The generative model used is:

p(z) = N (z|0,I); p? (x|z) = f (x;z, ?), (1)

where f (x; z, ?)is a Gaussian distribution whose probabilities are formed by a non-linear functions (deep neural networks), with parameters ?, of a set of hidden (latent) variables z.

Approximate samples from the posterior distribution (the probability distribu- tion that represents the updated beliefs about the parameters after the model has seen the data) over the hidden (latent) variables p(z|x) are used as features to train a classifier that predicts whether a material audit yield will result if a taxpayer is audited (y) such as Support Vector Machine (SVM).This approach enables the classi- fication to be performed in a lower dimensional space since we typically use hidden (latent) variables whose dimensionality is much less than that of the observations.

These low dimensional embeddings should now also be more easily separable since we make use of independent hidden (latent) Gaussian posteriors whose pa- rameters are formed by a sequence of non-linear transformations of the data.

Generative semi-supervised model (Model 2):

A probabilistic model describes the data as being generated by a hidden(latent) class variable y in addition to a continuous hidden(latent) variable z. The model used is:

p(y) = Cat(y|?); p(z) = N (z|0, I); p?(X|y, Z) = f (x; y, z, ?), (2)

7

8 Chapter2. Approach

where Cat(y|?) is the multinomial distribution, the class labels y are treated as hid- den (latent) variables if no class label is available and z are additional hidden (latent) variables. These hidden (latent) variables are marginally independent.

As in Model 1, f (X; y, z, ?) is a Gaussian distribution, parameterized by a non- linear function (deep neural networks) of the hidden(latent) variables.

Since most labels y are unobserved, we integrate over the class of any unlabeled data during the inference process, thus performing classification as inference (deriv- ing logical conclusions from premises known or assumed to be true.). The inferred posterior distribution is used to obtain labels for any missing labels.

Stacked generative semi-supervised model: The two models can be stacked to- gether; the Model 1 learns the new hidden (latent) representation z1 using the gen- erative model, and afterwards the generative semi-supervised Model 2 using z1 in- stead of raw data (x).

The outcome is a deep generative model with two layers: p?(x, y,z1, z2)= p(y)p(z2)p?(z1|y,z2)p?(x|z1)

where the priors p(y) and p(z2) equal those of y and z above, and both p?(z1|y, z2) and p?(x|z1) are parameterized as deep neural networks.

The computation of the exact posterior distribution is not easily managed be- cause of the nonlinear, non-conjugate dependencies between the random variables. To allow for easier management and scalable inference and parameter learning, the recent advances in variational inference (Kingma and Welling, 2014; Rezende et al., 2014) are utilized. A fixed form distribution q?(z|x)with parameters ? that approxi- mates the true posterior distribution p(z|x).

The variational principle is used to derive a lower bound on the maximum likeli- hood of the model. This consists in maximizing function of the variational bound and the approximate posterior has the minimum difference with the true poste- rior. The approximate posterior distribution q?(·) is constructed as an inference or recognition model (Dayan, 2000; Kingma and Welling, 2014; Rezende et al., 2014; Stuhlmuller et al., 2013).

With the use of an inference network, a set of global variational parameters ?, al- lowing for fast inference at both training and testing because the delay of inference is for all the posterior estimates for all hidden (latent) variables through the param- eters of the inference network. An inference network is introduced for all hidden (latent) variables, and are parameterized as deep neural networks. Their outputs construct the parameters of the distribution q?(·).

For the latent-feature discriminative model (Model 1), we use a Gaussian inference network q?(z|x)for the hidden(latent) variable z. For the generative semi-supervised model (Model 2),an inference model the hidden(latent) variables z and y, which its assumed have a factorized form q?(z,y|x) = q?(z|x)q?(y|x), specified as Gaussian and multinomial distributions.

Model 1: q?(z|x) = N (z|??(x), diag(?2?(x))), (3)

Model 2: q?(z|y, x) = N (z|??(y, x), diag(?2?(x))); q?(y|x) = Cat(y|??(x)), (4)

Chapter2. Approach 9

where: ??(x) is a vector of standard deviations, ??(x) is a probability vector, func- tions ??(x), ??(x)and??(x) are represented as MLPs.

Generative Semi-supervised Model Objective The label corresponding to a data point is observed and the variational bound is:

logp?(X,y) ? Eq?(z|x,y)logp?(x|y,z)+logp?(y)+logp(z)?logq?(z|x,y) = ?L(x,y), (5)

The objective function is minimized by resorting to AdaGrad, which is a gradient- descent based optimization algorithm. It automatically tunes the learning rate based on its observations of the data’s geometry. AdaGrad is designed to perform well with datasets that have infrequently-occurring features.

Chapter 3 Evaluation

The model was used to analyze taxpayers data from the Cyprus Tax Department database in order to identify taxpayers yielding material additional tax in case of performing a VAT audit. The Deep Generative Models for Semi-supervised Learn- ing is a solution that enables increased efficiency in the audit selection process. Its input includes both audited (supervised) and not audited (unsupervised) taxpayer data. Its output is a collection of labels, each of which corresponds to a taxpayer with one of two possible values (binary) good (1) or bad (0). If the taxpayer is expected to yield a material tax after audit, would be classified as good (1).

Nearly all the VAT returns of the last few years were processed in order to gener- ate the features to be used by the model. These were selected based on the advice of experienced field auditors, data analysts and rules from rule based models. Some of the selected fields where further processed to generate extra fields. The features selected broadly relate to business characteristics like location of the business, type of business and features from its tax returns. For data preparation , the data was cleaned, for example we removed taxpayers with little or no tax history, mainly new businesses.

The details of the criteria used to select the features, the features processing, the new generated features, feature number and cleansing process, cannot be disclosed due to the confidentiality nature of the audit selection. Also publication of the fea- tures could result in compromise of future audit selection as well as being unlawful. For, modelling taxpayer data from the tax department registry like economic activity and from the tax returns (X) and actual audit results (Y) appear as pairs.

(X,Y) = in{(x1, y1), . . . , (xN,yN)}

with the ith observation xi and the corresponding class label yi {1, …, L} for the tax-

payers audited.

For each observation we infer corresponding hidden (latent) variables denoted by z. In the semi-supervised classification, where both audited taxpayers and not au- dited taxpayers are utilized, only a subset of all the taxpayers have corresponding class labels (audit result). The empirical distribution over the labelled (audited) and unlabeled (not audited) subsets as p(x, y) and p(x).

For building the model, Tensorflow was used, an open source software library for high performance numerical computation, running on top of python programming language. The hardware used is a custom build machine of the Cyprus Tax De- partment with an NVIDIA 10 series Graphic Processing Unit. The performance was

11

12 Chapter3. Evaluation measured using a k-fold cross validation on training data.

The model was trained on actual tax audit data collected from the prior years (super- vised) and on actual data of not audited taxpayers(unsupervised). The amount over which an audit yield is classified as material was set following internal guidelines. The same model was used for both large medium and small taxpayers irrespective of the economic activity classification (NACE code). The predictions made by the model were compared to the actual audit findings with an accuracy of 78,4%. The results compared favorably to peer results, using Data Mining Based Tax Audit Se- lection with a reported accuracy of 51% (Kuo-Wei Hsu et al., 2015).

3.1 The confusion matrix

The confusion matrix in Table 1 represents the classification of the model on the training data set. Columns and rows are for predictions. The top-left element indi- cates correctly classified cases, the top-right element indicates the tax audits lost (i.e. cases predicted as bad turning out to be good). The bottom-left element indicates tax audits incorrectly predicted as good, and the bottom-right element indicates cor- rectly predicted bad tax audits. The confusion matrix indicates that the model is balanced. The actual numbers are not disclosed for confidentiality reasons, instead they are presented as percentages.

Table 1

3.2 The hyperparameters of the highest accuracy model

The hyperparameters resulting in the highest accuracy have been disclosed in Ta- ble 2. It can be clearly seen that there is a direct relation between the Epoch size and accuracy. The highest the number of Epochs results to the highest accuracy. But accuracy is not the one and only indicator, it must always be appreciated after reviewing the confusion matrix. The Batch size is 10.000 and activation method Relu.

Table 2

3.3 All the hyperparameters experimented

The results of the thirty (30) experiments of different hyperparameters (i.e Latent Di- mension, Epochs), except the best performing model, which were run are disclosed in Table 3. One interesting aspect is that when the Dropout and Learning rate were set at 0.009 the accuracy was close to 78% with minimal number of Epochs, namely two hundred (200) instead of five hundred (500) of the highest performing settings. This observation is translated to reduction of the time required to run the model

Predicted as good

Predicted as bad

Actually good

39%

11%

Actually bad

11%

38%

Latent Dimension

Intermediate Dimension

Epochs

Dropout Rate

Learning Rate

Accuracy

200

400

500

0,01

0,009

78,4%

3.3. All the hyperparameters experimented 13 by more than 50%. For all the experiments the Batch size is 10.000 and activation

method is Relu. Table 3

Latent Dimension

Intermediate Dimension

Epochs

Dropout Rate

Learning Rate

Accuracy

200

400

150

0,01

0,008

77,4%

200

400

100

0,008

0,008

75,1%

200

400

100

0,005

0,008

74 %

200

400

100

0,01

0,008

75,2%

200

400

100

0,01

0,006

74,6%

200

400

100

0,01

0,0075

75 %

200

400

100

0,01

0,007

74 %

200

400

100

0,01

0,009

77,1%

200

400

100

0,01

0,02

67,1%

200

400

100

0,01

0,01

75,4%

200

400

100

0,01

0,009

76,7%

100

400

100

0,01

0,009

71,3%

300

400

100

0,01

0,009

77 %

400

400

100

0,01

0,009

72,3%

200

800

100

0,01

0,009

70,7%

200

600

100

0,01

0,009

71,4%

200

450

100

0,01

0,009

69,6%

200

450

100

0,05

0,009

70,6%

200

400

100

0,05

0,009

74 %

200

400

200

0,01

0,009

78 %

200

400

300

0,01

0,009

77,4%

200

400

100

0,01

0,009

74,7%

200

400

100

0,009

0,009

77,8%

200

400

100

0,008

0,009

75 %

200

400

100

0,008

0,008

76 %

200

400

250

0,009

0,009

77,8%

100

800

100

0,009

0,009

77,5%

100

800

200

0,009

0,009

77,1%

100

1000

200

0,009

0,009

69,8%

500

1000

100

0,009

0,009

73,2%

Chapter 4

Conclusions and future work

4.1 Project Conclusions

Inaccurate selection of cases does not only result to a waste of scarce and valuable resources but it also has an opportunity cost, the lost revenue.

The labor intensive rule based model with hundreds of complicated rules has been successful and relied upon to select high yield tax audit cases.

A successful tax risk model has the following characteristics:

• Fast and easy development, to address the risks as they arise

• High accuracy to generate maximum yield from audits

The semi-supervised model developed has both characteristics. There is no need for experts to create hundreds of detailed rules, instead they perform accurately se- lected field audits (78,4%), generating significant revenue.

For the first time, a model for semi-supervised learning which enables improvement of the prediction accuracy through the use of generative models for exploitation of the information in the data density, has been used to estimate tax evasion risk.

The greatest achievement of the project was the stimulation of interest for develop- ing even more powerful semi-supervised classification methods based on generative models for the Cyprus Tax Department.

4.2 Future work

This thesis examined only the obligation of the taxpayer to file complete and accu- rate information. The obligations to file and pay on time will be addressed in the future. A return filing model is predicting which taxpayers will miss the next filing deadline while the payment model is predicting which taxpayers will miss the next payment deadline. VAT GAP being the sum of the lost tax due to taxpayers fail- ure to meet their obligations (non compliance) can be reduced, with the increase of taxpayer compliacne due to the of the new models.

15

References

1 Bar-Yam Y (1997) Dynamics of Complex Systems. Perseus Books.

2 Blum, A., Lafferty, J., Rwebangira, M. R., and Reddy, R. (2004). Semi-supervised learning using randomized mincuts. In Proceedings of the International Confer- ence on Machine Learning (ICML).

3 Cox, R. T. (1946). “Probability, Frequency, and Reasonable Expectation”. Ameri- can Journal of Physics. 14: 1–10. doi:10.1119/1.1990764.

4 Dayan, P. (2000). Helmholtz machines and wake-sleep learning. Handbook of Brain Theory and Neural Network. MIT Press, Cambridge, MA, 44(0).

5 Fergus, R., Weiss, Y., and Torralba, A. (2009). Semi-supervised learning in gi- gantic image collections. In Advances in Neural Information Processing Systems (NIPS).

6 Jaynes, E.T. (1986). “Bayesian Methods: General Background”.

7 Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proceeding of the International Conference on Machine Learning (ICML), volume 99, pages 200-209.

8 Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In Pro- ceedings of the International Conference on Learning Representations (ICLR).

9 Kuo-Wei Hsu, Nishith Pathak, Jaideep Srivastava,Greg Tschida and Eric Bjork- lund (2015) Data Mining Based Tax Audit Selection: A Case

Study of a Pilot Project at the Minnesota Department of Revenue, (Table 7 valu- ation of results in success rate, page 241).

10 Liang, P. (2005). Semi-supervised learning for natural language. PhD thesis, Massachusetts Institute of Tech- nology.

11 Liu, Y. and Kirchhoff, K. (2013). Graph-based semi-supervised learning for phone and segment classification. In Proceedings of Interspeech.

12 McCulloch, W. S., ; Pitts, W. (1943). A Logical Calculus of the Ideas Immanent in Nervous Activity. The Bulletin of Mathematical Biophysics, 5, 115-133.

13 OECD, Compliance Risk Management:Managing and Improving Tax Compli- ance,2004

14 Pitelis, N., Russell, C., and Agapito, L. (2014). Semi-supervised learning using an unsupervised atlas. In Proceddings of the European Conference on Machine Learning (ECML), volume LNCS 8725, pages 565-580.

15 Ranzato, M. and Szummer, M. (2008). Semi-supervised learning of compact document representations with deep networks. In Proceedings of the 25th In- ternational Conference on Machine Learning (ICML), pages 792-799.

17

18 REFERENCES

16 Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropaga- tion and approximate inference in deep generative models. In Proceedings of the International Conference on Machine Learning (ICML), volume 32 of JMLR W; CP.

17 Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011). The man- ifold tangent classifier. In Advances in Neural Information Processing Systems (NIPS), pages 2294-2302.

18 Rosenberg, C., Hebert, M., and Schneiderman, H. (2005). Semi supervised self- training of object detection models. In Proceedings of the Seventh IEEE Work- shops on Application of Computer Vision (WACVMOTION 05).

19 Shi, M. and Zhang, B. (2011). Semi-supervised learning improves gene expression-based prediction of cancer recurrence. Bioinformatics, 27(21):3017- 3023.

20 Stuhlmuller, A., Taylor, J., and Goodman, N. (2013). Learning stochastic in- verses. In Advances in neural information processing systems, pages 3048-3056.

21 Weston, J., Ratle, F., Mobahi, H., and Collobert, R. (2012). Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pages 639-655. Springer.

22 Zhu, X., Ghahramani, Z., Lafferty, J., et al. (2003). Semi-supervised learning us- ing Gaussian fields and har- monic functions. In Proceddings of the International Conference on Machine Learning (ICML), volume 3, pages 912-919.