Data mining of corporate financial fraud based on neural network model

Under the active market economy, more and more listed companies emerge. Because of the various interest relationships faced by listed companies, some enterprises which are not well managed or want to enhance company’s value will choose to forge financial reports by improper means. In order to find out the false financial reports as accurately as possible, this paper briefly introduced the relevant indicators for judging the fraudulence of financial reports of listed companies and the recognition model of financial reports based on back propagation (BP) neural network. Then the selection of the input relevant indexes was improved. The improved BP neural network was simulated and analyzed in MATLAB software and compared with the traditional BP neural network and support vector machine (SVM). The results showed that the importance of total assets net profit, earnings per share, cash reinvestment rate, operating gross profit and pre-tax ratio of profit to debt was the top 5 among 20 judgment indexes. In the identification of testing samples of financial report, the accuracy, precision, recall rate and F value all showed that the performance of the improved BP neural network was better than that of the traditional BP network and SVM.


Introduction
Since the reform and opening up, China's market economy has gradually become active, the growth rate is increasing day by day, and more and more companies are listed on the stock market [1].For listed companies, regular financial reports [2] can help managers, investors and creditors to make reasonable judgments.However, for a listed company, it is impossible to operate smoothly.The financial reports issued will always have some problems.When a company is trying to gain unfair benefits or cover up the phenomenon of the company's poor operation, it will falsify the financial reports [3].On the one hand, false financial reports will mislead investors and creditors to make wrong judgments and cause direct economic losses; on the other hand, the market risk prediction made by the government's market supervision department based on false financial reports is wrong, which makes it difficult to avoid the risks brought by corporate violations and affects market equity [4].Therefore, effective identification of false financial reports is conducive to maintaining market stability and reducing the economic risks of investors and creditors.Fanning et al. [5] designed a false financial report recognizer by using generalized adaptive neural network structure and adaptive logic network method.The simulation results showed that the method could identify false and real financial reports and its accuracy was higher than that of Bell's cascade logic method.Kanapickienė et al. [6] identified whether financial reports were fraudulent or not based on financial ratio analysis, made financial ratio analysis on 40 sets of fraudulent financial reports and 125 sets of real financial reports using logistic regression model, and found that the method could effectively identify the fraudulent financial reports.Lin et al. [7] classified different fraud identification factors using expert questionnaires and data mining technology and ranked their importance.Data mining technology includes logical regression, decision tree and artificial neural network, among which the classification accuracy of artificial neural network is higher than that of logical regression and decision tree.This paper briefly introduces the relevant indicators used for determining the fraudulent financial reports of listed companies and the recognition model of financial reports based on Back Propagation (BP) neural network and gives a simulation analysis on the regular financial reports and irregular financial reports in CSMAR database by using MATLAB software.

Relevant analysis of corporate financial fraud
For a listed company, the annual financial reports are intuitive data reflecting the operating conditions of the company in different periods of a year.The main content of the annual financial reports are balance sheet, cash flow statement and profit statement.In addition to the above measurable data statements, they also include non-data information that can not be directly measured, such as notes to the enterprise statements and honesty guarantee that has an impact on the operation [8].As the financial statements can reflect a company's operating conditions, generally speaking, the financial statements need to ensure authenticity and provide accurate reference for investors.However, in the actual operation process, some companies will make false statements on financial reports for some economic purposes, including achieving financing conditions, malicious manipulation of holding prices, avoiding delisting penalties, dressing up performance, etc.The objective reasons include unreasonable company rules and regulations and imperfect securities market.Means of financial fraud usually include forging economic transaction vouchers, disguising irregular transactions between related parties, malicious use of debt restructuring, etc.The fraudulent financial reports of listed companies will cause serious damages to the operation of market economy, so the government's regulatory authorities attach great importance to the test of the authenticity of financial reports.There are many related indicators that can be used to identify real and false financial reports.Generally speaking, the more indicators used to identify, the higher the recognition accuracy.But the premise of the above theory is that the indicators are independent of each other and the indicators have a clear relationship with the object identified.There are different degrees of correlations between the indicators in the actual financial report, and the degree of the relationship between the indicators and the object identified is different.Too many financial indicators will affect the accuracy of identification [9].As shown in Table 1, 20 indicators for identifying the authenticity of financial reports are selected after the correlation test of the indicators.X 1 ~ X 2 are the non-financial indicators of corporate governance structure, which have a great impact on financial reports; X 3 ~ X 4 are the long-term solvency of a company; X 5 ~ X 6 are the profitability indicators of a company; X 7 ~ X 8 are the profitability indicators of shareholders; X 9 ~ X 12 are the development ability indicators of a company; X 13 is the risk level indicator of a company; X 14 ~ X 17 are cash flow indicators of a company; X 18 ~ X 20 are a company's short-term solvency indicator.

Fig. 1. BP neural network model
As shown in Fig. 1, the basic structure of BP neural network [10] is divided into input layer i, hidden layer j and output layer k, where i, j and j are the number of nodes in the corresponding layer.BP Neural Network with three-layer structure is used in this study.X 1 ~ X n are input vectors, indicators for identifying the authenticity of financial reports in this study; d 1 ~ d n are output vectors, values for determining the authenticity of financial reports in this study.The algorithm used in BP neural network is error BP algorithm.The training principle of error BP algorithm [11] is as follows.Firstly, input and predetermined output are set.Then actual output is calculated forward layer by layer and compared with the predetermined output.When there are errors, the weight is adjusted according to the opposite direction of the network to make the error between the actual output and predetermined output within the specified range.
The basic training procedures of BP neural network are as follows.
1.The neural network is initialized, including the number of nodes in the input layer, hidden layer and output layer.
2. The training sample was input and calculated using the feed forward formula: where H j stands for the output of the j-th node of the hidden layer, d k is the k-th output of the output layer,  ij and  ik are the weight values transmitted from the input layer to the hidden layer and from the hidden layer to the output layer, ɑ j and b k are bias terms of the hidden layer and output layer, and () is an activation function.Sigmoid function was used as the activation function in this study.
3. Error calculation.Through the above forward calculation, the calculation result is obtained in the output layer.The result is compared to the preset expected result, and the error is calculated using the following formula [12]: where E is the error between the output vector and actual output vector obtained by calculation, n stands for the number of nodes in the output layer, y k is the actual output vector that is set, and t k is the label of the actual exact solution that is set.4. Reverse adjustment.The error is determined.If the error is within the specified range, then the result will be output directly; If not, the weight and bias terms of the calculation formula in the hidden layer and the output layer will be reversely adjusted.The weight adjustment formula from the output layer to the hidden layer is: where  is the learning rate, ij   and jk   are the weights after adjustment, and e k is the error between the k-th output node and expected value.The adjustment formula of the bias term is: where j a and k b are the bias terms after adjustment. 5. Whether the training stops is determined.The process from the forward calculation to the reverse adjustment of weight and bias terms according to error is considered one time of iteration.The above iteration process is repeated until the error between two adjacent iteration processes was smaller than the set threshold or the times of iterations was the maximum.

Fig. 2. The calculation process of the improved BP neural network
Although the traditional BP neural network described above can effectively fit the change rule to a certain extent in the training of financial report authenticity identification, there are 20 identification variables to be input in the model training as described above, which are relatively large in quantity.Although BP neural network itself can cope with many input indicators, it will still affect its performance when there are many input indicators to be processed, which is mainly reflected in the calculation efficiency.Secondly, the relationship between the 20 input identification variables and financial report is not the same.The identification variables with shallow connection play a small role in the training process of the identification model or even may interfere with other more effective identification variables, which will make the convergence curve fall into the local optimal solution.Therefore, the traditional BP neural network is improved in this study.Before training with training samples, the importance of identification variables is ranked, and the top 10 most important indicators are selected as the input variables for training.The training process of the improved BP neural network is shown in Fig. 2.
1.The weight in the model is initialized, usually as 0, but other proper values are also allowable, in order to reduce the learning time.
2. The learning samples are input, including fake and non-fake financial reports.The identification variables needed by the model are relevant indexes in financial reports, and the actual judgement is the authenticity of the report.
3. As the dimension of some indexes in the input data is different, they cannot be directly used for calculation.Therefore, it is necessary to standardize the indicators, and the related formula is: where ij X  is the value after the conversion of the j-th sample among the i-th class indicator, X ij is the j-th sam-ple among the i-th class indicator, i X is the mean value of all the samples among the i-th class indicator, and Si is the mean square of the i-th class indicator.
The authenticity of the report is expressed as 0 and 1, 0 as the true report and 1 as the false report.
4. The importance of the identification indexes after normalization is calculated [13]: where N(i) is the distance of the i-th indicator between the true and false samples, u 1 (i) is the mean value of the ith indicator in the false report sample, u 0 (i) is the mean value of the i-th indicator in the real report sample,  1 (i) is the mean square of the i-th indicator in the false report sample, and  0 (i) is the mean square of the i-th indicator in the real report sample.According to equation ( 6), the indicator is sorted in descending order, and the first ten indicators are selected as the training input variables.5.The input training sample is processed by forward calculation according to equation (1).Then the error is calculated according to equation ( 2), and the weight and bias terms are reversely adjusted according to equation (3).
6.The training model is iterated repeatedly until the error converges to be stable or the times of iterations was the maximum.
Then the authenticity of the report is determined according to the value.A report is determined as real if the value is smaller than 0.5; otherwise it is determined as fake.

Experimental environment
BP neural network model algorithm was complied using MATLAB software [14].The experiment was carried out on a laboratory server.The configuration of the server was Windows 7 system, I7 processor and 16G memory.

Experimental data
The illegal financial reports and normal financial reports of listed companies between 2000 and 2010 were selected from CSMAR database [15] as the training samples and testing samples.The training samples included 250 illegal financial reports which were randomly selected from the above financial reports and 250 normal financial reports with the year corresponding to the illegal financial reports.There were totally 500 training samples, and the ratio of the normal ones to the illegal ones was 1:1.Then 500 illegal financial reports and 500 normal financial reports were selected from the remaining financial reports to be used as the testing samples; and the year of two kinds of reports corresponded one by one.The selection criteria for the fraud reports included fictitious profit, fictitious asset, fraud listing and postponing disclosure; reports with any of the above violation were evaluated as the fraud report.

Experiment setup
The parameters of the traditional BP neural network were as follows.The number of nodes in the input layer was 20.The number of nodes in the output layer was 1.The number of nodes in the hidden layer was finally determined as 6 after test.The initial weight generated randomly in (-1, 1).The learning rate was set as 0.1.
The parameters of the improved BP neural network were the same with the traditional BP neural network except the number of nodes in the input layer; the number of nodes in the input layer was set as 10.
To better verify the performance of the improved BP neural network in recognizing fake reports, support vector machine (SVM) was used for comparison.In the SVM, the penalty parameter was 10, the kernel function was Gaussian function, and the  2 of the kernel function was 2.

Criteria for judging the recognition effect of model
The recognition results of the model and the actual results were determined using confusion matrix.
As shown in Table 2, TP stands for the number of reports which were real actually and were recognized as real, FN stands for the number of reports which were real actually and were recognized as fake, FP stands for the number of reports which were fake actually and were recognized as real, and TN stands for the number of reports which were fake actually and were recognized as fake.Accuracy, precision, recall degree and F value were used to measure the recognition effect of the model.Accuracy refers to the proportion of correct classification, and its expression is: Precision refers to the proportion of reports which were fake actually among the reports which were recognized as fake, and its expression is: Recall rate is the proportion of reports which were recognized as fake among the reports which were fake actually, and its expression is: F value can comprehensively reflect the recognition effect of the model, and its expression is:

Fig. 3. Importance of indicators for judging whether a report is true or false
The importance of indicators for judging whether a report is true or false after sorting by BP neural network is shown in Fig. 3.It can be seen that the importance of net profit margin of total assets, earnings per share, cash reinvestment rate, gross operating profit and pre-tax profit-to-debt ratio were the top 5 among the 20 indicators, while the importance of council size, the first shareholder's shareholding ratio and long-term asset turnover ratio was relatively small.

Fig. 4. Comparison of recognition effect of the model under different numbers of features
As shown in Fig. 4, the recognition accuracy of SVM was 86.63 %, the precision was 82.05 %, the recall rate was 87.67 %, and the F value was 84.77 %; the recognition accuracy of the traditional BP neural network was 90.70 %, the precision was 90.14 %, the recall rate was 87.67 %, and the F value was 88.89 %; the accuracy of the improved BP neural network was 92.44 %, the precision was 90.54 %, the recall rate was 91.78 %, and the F value was 91.16 %.It was seen from Fig. 4 that the improved BP neural network had the highest accuracy in judging whether a financial report was fake or not, and the SVM was the lowest; in terms of accuracy, the improved BP neural network was slightly higher than the traditional BP neural network, and both were significantly higher than the SVM; in terms of recall rate, the SVM and the traditional BP neural network were relatively close, while the improved BP neural network was significantly higher than the other two identification models; in terms of F value, the improved BP neural network was the highest and SVM was the lowest.It was concluded that the improved BP neural network had the best recognition performance and SVM had the worst recognition performance in judging whether a financial report was fake or not.

Conclusion
This paper briefly introduced the relevant indicators for identifying false financial reports of listed companies and the recognition model of financial reports based on BP neural network.Some improvements were made in the selection of relevant input indexes, and then the improved BP neural network was simulated and analyzed in MATLAB software and compared with the traditional BP neural network and SVM.The results are as follows.The importance of total assets net profit, earnings per share, cash reinvestment rate, operating gross profit and pre-tax profit debt ratio ranked top 5 among 20 judgment indexes.In the judgment of whether a financial report was fake or not, the accuracy of the improved BP network was the highest, and the accuracy of SVM was the lowest; the accuracy of the improved BP network was close to that of the traditional BP network and significantly higher than that of SVM; the recall rates of the SVM and traditional BP network were close, and the recall rate of the improved BP network was significantly higher than the other two networks; the F value of the improved BP network was the highest, and the F value of SVM was the lowest.