Microsoft Word - Volume 11, Issue 4-3 Journal of Risk Analysis and Crisis Response, 2021, 11(4), 176-188 https://jracr.com/ ISSN Print: 2210-8491 ISSN Online: 2210-8505 DOI: https://doi.org/10.54560/jracr.v11i4.311 176 Article Research on Enterprise Credit Risk Prediction Based on Text Information Haonan Zhang 1,2, Hongmei Zhang 1,2,* and Mu Zhang 1 1 School of Big Data Application and Economics, Guizhou University of Finance and Economics, Guiyang (550025), Guizhou, China 2 Guizhou Institution for Technology Innovation & Entrepreneurship Investment, Guizhou University of Finance and Economics, Guiyang (550025), Guizhou, China * Correspondence: zhm1035@qq.com; Tel.: +86-0851-88510575 Received: September 3, 2021; Accepted: December 26, 2021; Published: January 25, 2022 Abstract: This paper uses the text data mining method to separate the intonation in the annual reports of credit risk enterprises and non-credit risk enterprises, quantify it, and study the impact of annual report intonation on the effectiveness of credit risk prediction. In the empirical research, this paper uses the factor analysis method for some traditional financial variables, and uses the extracted components and intonation variables to predict the credit risk through the logistic model. The results show that the tone of enterprises with credit risk is more negative, and the degree of pessimism is significantly positively correlated with the probability of credit risk. By comparing the ROC curves of the prediction results before and after the addition of intonation variables, adding intonation variables to the credit risk prediction based on financial variables can improve the effectiveness of the prediction. Keywords: Credit Risk; Text Data Mining; Factor Analysis; Logistic Model; Text Intonation 1. Introduction Credit is the foundation of financial development, and credit risk is also an uneasy factor enough to destroy the whole financial system. Preventing and resolving credit risk is a necessary means to maintain social stability and ensure the healthy development of economy. Nowadays, with the rapid development of Finance and the increasingly frequent financial exchanges among social subjects, it also brings complex interest relations. Once a credit risk occurs in a certain interest link, the associated losses will be immeasurable. Therefore, scholars at home and abroad regard the prevention of credit risk as an important research object. Credit risk usually refers to the default caused by the reluctance or inability of the borrower, securities issuer, or transaction party to perform the contract [1]. Yang Lian and Shi Baofeng [2] introduced the focal loss modified cross entropy loss function into the credit risk evaluation model to predict the risk of several individual samples. The empirical results show that this prediction method can improve the identification ability of difficult samples. Wang Chongren and Han Dongmei [3] proposed a Bayesian parameter optimization method and XGboost algorithm for personal credit risk assessment of Internet credit industry. The empirical results show that this method is superior to traditional prediction models such as support vector machine. Luo Fangke and Chen Xiaohong [4] brought the Internet Financial personal microfinance data of commercial banks Haonan Zhang, Hongmei Zhang and Mu Zhang / Journal of Risk Analysis and Crisis Response, 2021, 11(4), 176-188 DOI: https://doi.org/10.54560/jracr.v11i4.311 177 into the logistic model to screen out the factors that have a significant impact on credit risk. As companies have greater influence than individual borrowers and investors, and the harm caused by credit risk is more destructive, improving the prediction accuracy of corporate credit risk is also a hot issue in the field of risk management. Zhang Tong and Chi Guotai [5] empirically analyzed the data of 2169 Chinese A-share listed companies from the perspective of credit characteristics, and concluded that the model of feature Division has higher discrimination accuracy. Compared with credit characteristics, more scholars' research on credit risk is based on the perspective of the optimal combination of credit risk indicators. Zhou Ying and Su Xiaoting [6] found that different financial indicators have different effects on the prediction of long-term and short-term default status. Li Zhe and Chi Guotai [7] screened 31 indicators with strong ability to distinguish default status from 610 indicators by using the data of listed companies. Li Meng and Wang Jin [8] investigated the impact of enterprise internal control level on its debt default risk through traditional financial indicators. The results show that enterprises with high internal control quality tend to have lower debt default risk. Previous studies mostly focused on the analysis of financial data. With the progress of computer technology and the rapid development of the Internet, more and more unstructured data are applied to the research of financial problems [9]. Structured data is field variable data. For example, Wu Fei et al. [10] collected the keywords related to "digital transformation" in the enterprise annual report through crawler technology to describe the intensity of enterprise digital transformation. Li Bin et al. [11] identified 29 important risk points in the insurance industry by mining 1682 financial report texts of listed insurance companies in the United States, and analyzed the change trend of important risks in the insurance industry. Liang Kun and He Jun [12] believes that text information effectively alleviates information mismatch and significantly improves the predictability of credit evaluation model. Therefore, text big data can also be applied to the field of credit risk. Cecchini M, et al. [13] extracts the effective information of the management analysis and discussion module in the annual report, and integrates other financial data to improve the prediction default accuracy of the traditional prediction model. Liu Yishuang and Chen Yiyun [14] studied the relationship between text emotion and financial distress through the management tone in the company's annual report. Wang Xiaoyan et al. [15] constructed a priori word frequency of credit risk indicators by mining the text information in journal papers. The empirical results show that the classification effect of credit risk model is significantly improved after using such a priori word frequency. Wang Z, et al. [16] and others believe that in addition to the traditional hard information, soft information can also enter the loan decision-making process, and the effect of credit risk assessment is significantly improved after adding semantic indicators. Zhang Yiwei and Gao Weihe [17] took the borrower's SMS data as the text mining object, analyzed the relationship between the expression of "我" and "我们" and default, and found that the cultural level adjusted the role of these two words in credit risk prediction. Wang Shuxia et al. [18] identified the characteristics of the lender from the text description and used these characteristics to evaluate the credit risk of the loan. The empirical results show that the text data can effectively replace the traditional financial data, and the combination of structured data and unstructured data can improve the performance of the credit risk evaluation system. It can be seen from the existing literature that the research on the use of text information for credit risk prediction at home and abroad mostly focuses on individual investors, while the research on corporate credit risk mainly focuses on traditional structured data, but there is also a lot of information in many public information such as the company's annual report. Obtaining this kind of Haonan Zhang, Hongmei Zhang and Mu Zhang / Journal of Risk Analysis and Crisis Response, 2021, 11(4), 176-188 DOI: https://doi.org/10.54560/jracr.v11i4.311 178 text information will help us reduce the impact caused by information asymmetry, so as to improve the effectiveness of credit risk prediction. In view of this, this paper will combine the real default data, select 25 listed companies with credit risk and 53 listed companies without credit risk from 2018 to 2020 as the total sample, peel the tone from the company's annual report, conduct quantitative analysis, and predict the company's probability of credit risk combined with traditional structured data. The research value of this paper is to expand the traditional credit risk identification indicators, prove that it is meaningful to add intonation to the risk identification model, provide new ideas for predicting credit risk, increase China's identification means of credit risk, and enhance the monitoring of systemic financial risk. 2. Research Methods and Index Selection of Enterprise Credit Risk Identification 2.1. Logistic Model and Research Ideas Logistic regression is a common machine learning method, which is mainly used to classify samples and belongs to "generalized linear regression". This model is often used in credit risk research, such as Zhang Jie and Zhang Yuansheng [19], Bian Yuning et al. [20], Liang Weisen and Wen Simei [21], etc. The reason is that the logistic model has the excellent characteristics that the value of dependent variable is between 0 ~ 1 and does not need to obey normal distribution [22]. The expression of logistic model is: ln(P/1-P) = β0+∑βi*Xi (1) In this paper, the enterprise with credit risk is marked as 1. In formula (1), P represents the probability of credit risk, β0 is a constant term, Xi is a dependent variable affecting the predicted credit risk, βi is the influence degree of each dependent variable on credit risk. The research idea of this paper is as follows: Firstly, the dimensionality of multiple financial indicators is reduced, and three main components are extracted by factor analysis method. Secondly, the logistic model is used to predict credit risk in two steps. In the first step, only three principal components are input to predict credit risk, and in the second step, three principal components and intonation variables are used as input data to predict credit risk. Finally, the ROC curve is used to compare the credit risk prediction effect of the model before and after adding intonation variables, and the BP neural network model is used to test the robustness of the empirical results. 2.2. Data Selection Since the financial status and annual report of the enterprise in the year of credit risk will not be known to investors in that year, the annual report and financial data of the year before the occurrence of credit risk are the main basis for investors to predict whether the enterprise has credit risk. When selecting the data of defaulting enterprises, the company's annual report and financial data of the year before the occurrence of credit risk are selected as risk identification indicators. When selecting the data of non-defaulting enterprises, the 2019 annual report and financial data are uniformly selected as risk identification indicators. The annual report data are from the public disclosure of listed companies on Shanghai Stock Exchange and Shenzhen Stock Exchange, and the financial data are from the RESSET financial research database. 2.3. Data Processing Haonan Zhang, Hongmei Zhang and Mu Zhang / Journal of Risk Analysis and Crisis Response, 2021, 11(4), 176-188 DOI: https://doi.org/10.54560/jracr.v11i4.311 179 Financial data processing: referring to the selection method of financial indicators in the construction of credit risk identification system by Wang Qianhong and Zhang Min [22] and Liu Xiangdong and Wang Weiqing [23], this paper divides the traditional financial indicators into 5 primary indicators and 12 secondary indicators, as shown in Table 1. Due to the vacancy value in some original financial data, this paper fills it with the average value of this index. Table 1. Credit risk identification index. Primary index Secondary index Symbol Debt service level Quick ratio X1 Asset liability ratio X2 Profit level Operating profit margin X3 Net profit margin on sales X4 Return on assets X5 Operational capability Turnover rate of fixed assets X6 Total asset turnover X7 Turnover rate of noncurrent assets X8 Cash flow indicators Cash content of operating income X9 Growth index Growth rate of net assets X10 Growth rate of total assets X11 Growth rate of main business income X12 Unstructured indicators Annual report intonation TONE Quantitative processing of text intonation: This paper uses the "dictionary model" to construct the text intonation of the annual report, and refers to HowNet DICTIONARY [13] and actual financial terms as the emotion dictionary. The dictionary is divided into positive emotion dictionary and negative emotion dictionary. When quantifying the text intonation, first convert the format of the company's annual report downloaded by Shanghai Stock Exchange and Shenzhen Stock Exchange, convert the PDF file format into TXT file format, and then use the Jieba word segmentation package in Python to segment the annual report [24]. Then remove the stop words such as "的" and "了", and make word frequency statistics according to the emotional dictionary. The statistical method is as follows: If there are words in the negative emotion dictionary in the annual report, such as "怀疑", " 难", "疑惑", etc., sum the occurrence times of such words, and use neg to represent the total occurrence times of negative words in an annual report. If the words in the dictionary of positive emotion appear in the annual report, such as "奖励", "引领", "支持", etc., sum the occurrence times of such words, and use POS to represent the total occurrence times of positive words in an annual report. Since negative intonation often has a greater impact on decision makers [14], this paper quantifies the text intonation with formula (2), in which the meanings of NEG and POS have been introduced above. Tone indicates the text intonation, that is, the larger tone, the stronger the negative emotion revealed in the text, otherwise it indicates that the text intonation is more positive. In order to facilitate readers to intuitively understand the positive and negative words in the annual report, this paper generates word clouds from the high-frequency words in the two types of words, as shown in Figure 1 and Figure 2. TONE = NEG/(POS+NEG) (2) Haonan Zhang, Hongmei Zhang and Mu Zhang / Journal of Risk Analysis and Crisis Response, 2021, 11(4), 176-188 DOI: https://doi.org/10.54560/jracr.v11i4.311 180 Figure 1. Positive word cloud. Figure 2. Negative word cloud. 3. Empirical Analysis 3.1. Descriptive Statistics and Inter Group Difference Test This paper uses SPSS 21 software to carry out descriptive statistics and mean "independent sample t-test" on the data of credit risk group and non-credit risk group, and observe the characteristics of the two groups of data and whether there is significant difference [24]. The enterprise with credit risk is marked as 1 and the enterprise without credit risk is marked as 0. The descriptive statistics of main variables and the results of "independent sample t-test" are shown in Table 2. It can be seen from table 2 that the p value of four variables X6 (turnover rate of fixed assets), X8 (turnover rate of noncurrent assets), X9 (cash content of operating income) and X10 (growth rate of net assets) is greater than 0.05, that is, the difference of these four indicators is not significant and cannot better reflect the difference between different types of samples. The other 9 variables including tone passed the "independent sample t-test", which proved that the remaining 9 variables could significantly reflect the differences between groups. In addition, through the analysis of descriptive statistics, the average, maximum and minimum values of tone of enterprises with credit risk are significantly higher than those without credit risk, which indicates that negative emotions are widespread in the annual report of enterprises one year before credit risk. Haonan Zhang, Hongmei Zhang and Mu Zhang / Journal of Risk Analysis and Crisis Response, 2021, 11(4), 176-188 DOI: https://doi.org/10.54560/jracr.v11i4.311 181 Table 2. Descriptive statistics of main variables and t-test results of independent samples. Explanatory variable Enterprise defaults Minimum Maximum Mean Standard deviation p Value TONE 0 0.02263 0.04906 0.03484 0.00662 0.031** 1 0.02733 0.06407 0.03873 0.00857 X1 0 0.27020 4.82480 1.38805 1.17206 0.007*** 1 0.08130 3.30150 0.68852 0.64901 X2 0 12.68890 88.53180 47.17486 21.25192 0.000*** 1 30.96460 175.83540 74.88301 26.74978 X3 0 -18.53640 36.24070 8.94938 10.60676 0.001*** 1 -440.15130 52.32540 -35.86637 92.32086 X4 0 -15.05380 32.30110 7.52735 8.59006 0.001*** 1 -502.99920 40.38250 -43.95637 108.17100 X5 0 -3.67370 16.37470 5.30242 4.43122 0.000*** 1 -109.20290 12.94880 -8.63132 26.38773 X6 0 1.16790 102.52400 8.30737 14.71136 0.762 1 0.34160 94.96940 9.50070 19.00566 X7 0 0.06230 1.22330 0.60662 0.30374 0.006*** 1 0.05700 1.30740 0.39472 0.32921 X8 0 0.09130 4.86020 1.66720 1.09202 0.375 1 0.11960 7.39710 1.37499 1.78710 X9 0 30.15580 132.85340 99.72378 18.74031 0.110 1 65.39690 186.61100 108.65078 29.70843 X10 0 -11.74240 26.17940 5.85995 8.03406 0.093* 1 -136.35060 257.63410 -10.39964 68.99186 X11 0 -15.97900 34.99220 6.18474 8.52548 0.009*** 1 -75.05330 46.43540 -3.97918 24.77621 X12 0 -59.68250 156.73350 9.72063 26.17668 0.022** 1 -68.08130 63.29080 -6.46469 33.30012 *** indicates p<0.01; **p<0.05; *p<0.1. 3.2. Factor Analysis Based on the theoretical and practical impact analysis, the correlation test is conducted for the variables that pass the "independent sample t-test". It can be obtained from table 3 that tone has no significant correlation with the financial variables, but the correlation between the financial variables is relatively significant, which indicates that there may be some same information between the variables and can be explained to each other. If all variables are input into logistic model to predict credit risk, it may lead to wrong conclusions. Therefore, this paper makes factor analysis on the other 8 variables except. This paper uses SPSS 21 software to conduct factor analysis on 8 financial variables. Firstly, the data are processed by Z-score standard method to eliminate the influence of sample data dimension [25]. It can be seen from table 4 that the KMO test value is 0.549, which is greater than the standard Haonan Zhang, Hongmei Zhang and Mu Zhang / Journal of Risk Analysis and Crisis Response, 2021, 11(4), 176-188 DOI: https://doi.org/10.54560/jracr.v11i4.311 182 value of 0.5 and the p value is 0. Again, the eight financial variables contain more similar information and are suitable for factor analysis. Table 3. Correlation coefficient of each variable. TONE X1 X2 X3 X4 X5 X7 X11 X12 TONE 1.000 X1 -0.070 1.000 X2 0.066 -0.633*** 1.000 X3 -0.107 0.154* -0.406*** 1.000 X4 -0.047 0.079 -0.376*** 0.971*** 1.000 X5 -0.062 0.115 -0.637*** 0.480*** 0.537*** 1.000 X7 0.140 0.103 -0.154* 0.173* 0.184* -0.023 1.000 X11 -0.070 0.056 -0.385*** 0.336*** 0.320** 0.677*** -0.098 1.000 X12 -0.113 0.030 -0.166*** 0.412*** 0.377*** 0.282** 0.166* 0.366*** 1.000 *** indicates p<0.01; **p<0.05; *p<0.1. Table 4. KMO and Bartlett Test. Kaiser-Meyer-Olkin 0.549 Bartlett's sphericity test χ2 436.809 df. 28 p Value 0.000 Table 5. Explains the total variance. Ingredients Initial eigenvalue Extract sum of squares load total variance % Accumulate% total variance % Accumulate% 1 3.404 42.550 42.550 3.404 42.550 42.550 2 1.366 17.075 59.626 1.366 17.075 59.626 3 1.242 15.527 75.153 1.242 15.527 75.153 4 0.833 10.413 85.566 5 0.670 8.381 93.947 6 0.326 4.080 98.027 7 0.140 1.750 99.778 8 0.018 0.222 100.000 Table 6. Composition matrix. Variable Ingredients 1 2 3 X1 0.340 -0.834 0.149 X2 -0.725 0.599 0.056 X3 0.834 0.267 0.282 X4 0.825 0.316 0.267 X5 0.803 -0.013 -0.394 X7 0.187 -0.072 0.773 X11 0.650 0.110 -0.548 X12 0.533 0.352 0.118 It can be seen from table 5 that three principal components with eigenvalues greater than 1 are extracted from factor analysis, and the three components contain 75.153% of the total variables. The original financial variable data can be reduced by nearly two-thirds through factor analysis, indicating that the result of factor analysis is good. Let the extracted three components be F1, F2 and F3 respectively (See table 6). The scores of each variable in F1, F2 and F3 are shown in table 7. The following expressions are listed according to the scores. Haonan Zhang, Hongmei Zhang and Mu Zhang / Journal of Risk Analysis and Crisis Response, 2021, 11(4), 176-188 DOI: https://doi.org/10.54560/jracr.v11i4.311 183 F1 = 0.1X1-0.213X2+0.245X3+0.242X4+0.236X5+0.055X7+0.191X11+0.157X12 (3) F2 = -0.610X1+0.439X2+0.196X3+0.231X4-0.01X5-0.052X7+0.081X11+0.257X12 (4) F3 = 0.12X1+0.045X2+0.227X3+0.215X4-0.317X5+0.622X7-0.441X11+0.095X12 (5) Table 7. Component score coefficient matrix. Variable Ingredients 1 2 3 X1 0.100 -0.610 0.120 X2 -0.213 0.439 0.045 X3 0.245 0.196 0.227 X4 0.242 0.231 0.215 X5 0.236 -0.010 -0.317 X7 0.055 -0.052 0.622 X11 0.191 0.081 -0.441 X12 0.157 0.257 0.095 3.3. Logistic Regression When using logistic regression for the first time, only F1, F2 and F3 are used as input variables. The results are shown in Table 8. According to the prediction results of the model, equation (6) can be obtained. Logistic (P) = -2.46F1+0.638F2-0.709F3-0.995 (6) Table 8. Logistic regression without TONE variable. Explanatory variable Coefficient Standard error Wald df F1 -2.460*** 0.683 12.963 1 F2 0.638 0.533 1.434 1 F3 -0.709 0.508 1.947 1 Constant -0.995*** 0.373 7.123 1 *** indicates p<0.01; **p<0.05; *p<0.1. Table 9. Logistic regression with TONE variables. Explanatory variable Coefficient Standard error Wald df TONE 0.774** 0.343 5.090 1 F1 -2.628*** 0.705 13.909 1 F2 0.693 0.584 1.412 1 F3 -1.015* 0.578 3.080 1 Constant -1.108*** 0.396 7.832 1 *** indicates p<0.01; **p<0.05; *p<0.1 According to the logistic regression results including principal components F1, F2 and F3, only F1 is significant at the level of 1%, and the pre F1 coefficient is -2.460, which is negatively correlated Haonan Zhang, Hongmei Zhang and Mu Zhang / Journal of Risk Analysis and Crisis Response, 2021, 11(4), 176-188 DOI: https://doi.org/10.54560/jracr.v11i4.311 184 with the probability of default. According to formula (3), the variables with higher scores in F1 are X2 (asset liability ratio), X3 (operating profit margin), X4 (net profit margin on sales) and X5 (return on assets), of which X3, X4 and X5 are variables reflecting profitability, and the coefficient of such variables in F1 is positive. Therefore, it can be inferred that the profitability of a company is the main factor affecting its default. When the profitability of the company is good, credit risk is not easy to occur. The worse the profitability, the higher the probability of credit risk. When using logistic regression again, input the three components F1, F2 and F3 together with tone variables into the logistic model to predict credit risk. According to table 9, the logistic expression (7) with tone can be obtained, where P represents the probability of occurrence of credit risk. Logistic (P) = 0.774TONE-2.628F1+0.693F2-1.015F3-1.108 (7) According to the logistic regression results with tone variable, the pre tone coefficient is 0.774, which is significant at the level of 5%, indicating that tone is significantly positively correlated with the probability of credit risk. According to equation (2), the larger the value of tone, the more negative the tone in the annual report, that is, the more pessimistic the tone in the annual report of the enterprise in the previous year, the more likely the company is to have credit risk, on the contrary, it is less prone to credit risk. It is worth mentioning that after the tone variable is added, the F3 component has changed from having no significant impact on the credit risk prediction in the first regression to being significant at the 10% level. The variables with higher scores in the F3 component are X7 (total asset turnover rate) and X11 (total asset growth rate). Because this is not the focus of this paper, this phenomenon has not been analyzed in detail. 3.4. ROC Curve Comparison When most scholars use logistic model to predict credit risk, they usually use out of sample resampling method, take the occurrence probability of credit risk of 0.5 as the critical value of risk occurrence, and judge the accuracy of the model to predict credit risk [22, 23, 25], but few articles discuss the scientificity of the critical value. Therefore, this paper uses ROC curve to study the effectiveness of credit risk prediction before and after adding tone to logistic model. The ordinate of ROC curve represents sensitivity, and the higher the index, the higher the diagnostic accuracy; the abscissa represents 1-specificity. The lower the index, the lower the misjudgment rate. Therefore, in general, the closer the point to the upper left corner of the coordinate, the better the diagnostic effect, that is, the larger the area at the lower right side of the ROC curve, the better the credit risk prediction effect. Table 10. Area under ROC curve Test result variable TONE prediction probability Non-TONE prediction probability Area under curve 0.892 0.855 Standard error 0.036 0.049 Significance 0.000 0.000 Asymptotic 95% confidence interval Lower limit 0.821 0.760 Upper limit 0.963 0.950 Haonan Zhang, Hongmei Zhang and Mu Zhang / Journal of Risk Analysis and Crisis Response, 2021, 11(4), 176-188 DOI: https://doi.org/10.54560/jracr.v11i4.311 185 Figure 3 shows the comparison of ROC curves of logistic model for predicting the occurrence probability of credit risk before and after the addition of tone variables. The blue curve represents the prediction probability curve of credit risk with tone, and the green curve represents the prediction probability curve of credit risk without tone. It can be clearly seen that the blue ROC curve is closer to the upper left of the coordinate than the green ROC curve. According to table 10, the area under the ROC curve with tone variable prediction results is 0.892, which is greater than the area under the ROC curve without tone prediction results by 0.855. Both results show that adding intonation to predict enterprise credit risk can improve the effectiveness of credit risk identification. Figure 3. ROC curve. 4. Robustness Test This paper uses BP neural network model to test the robustness of the empirical results. Firstly, it forecasts the credit risk of traditional financial variables, sets the maximum number of iterations for 10000 times, and trains the total samples with 30% test set and 70% training set. Through multiple training comparisons, the model is optimal when there are 12 neuron nodes, and the results are shown in table 11. Table 11. Identification results of 12 nodes without tone samples in BP model. Predicted correct number 48 Prediction errors number 30 Correct rate 61.5% Error rate 38.5% Table 12. Identification results of tone samples at 13 nodes in BP model. Predicted correct number 63 Prediction errors number 15 Correct rate 80.1% Error rate 19.9% Then, the traditional financial variables are combined with tone to predict the credit risk. The maximum number of iterations, the total number of samples, and the distribution ratio of training set Haonan Zhang, Hongmei Zhang and Mu Zhang / Journal of Risk Analysis and Crisis Response, 2021, 11(4), 176-188 DOI: https://doi.org/10.54560/jracr.v11i4.311 186 and test set remain unchanged. Through comparison, the model is optimal when there are 13 neurons, and the results are shown in Table 12. From the comparison of the two output results, the number of correct predictions increased significantly after adding intonation variables, and the prediction accuracy increased from 61.5% to 80.1%, an increase of 18.6%, indicating that adding intonation variables can improve the accuracy of model prediction, which is consistent with the empirical results. 5. Conclusions Taking 25 listed companies with credit risk and 53 listed companies without credit risk from 2018 to 2020 as the research object, this paper uses text data mining method to capture and quantify the intonation, and uses factor analysis method to extract three principal components from the traditional financial data. Finally, we compare the impact of logistic model on the accuracy of credit risk prediction before and after adding intonation variables, and draw the following conclusions. First, when using traditional financial data to predict credit risk, profitability has a great impact on credit risk prediction. The stronger the profitability, the less prone to credit risk. Second, the tone of the annual report of enterprises with credit risk in the previous year is more pessimistic than that of enterprises without credit risk. Investors can observe the tone of the company's annual report to reduce the impact of information asymmetry. Third, according to the empirical results, in the logistic regression, the probability of credit risk is significantly positively correlated with the pessimistic degree of the quantified text tone at the level of 5%, that is, the more negative the tone of the enterprise in the previous year, the greater the probability of credit risk in that year. This result also shows that the tone of the company's annual report contains information related to credit risk, which can solve the problem of information asymmetry between investors and company subjects to a certain extent. Fourth, the ROC curve is used to test the prediction results of the logistic model twice. The results show that compared with the logistic model which only uses the traditional financial data as the input, the effectiveness of the model prediction is improved after adding the text intonation index. It also shows that although enterprises can beautify financial data and increase investor confidence, the negative emotions revealed in the annual report are widespread. By mining the text information of the annual report, we can expand the credit risk identification indicators and improve the effectiveness of credit risk identification. This paper separates the intonation from the company's annual report, quantifies it, and supplements the traditional credit risk identification system based on structured data. According to the research conclusion of this paper, the most of investors, commercial banks and other financial institutions should strengthen the acquisition of text information when predicting enterprise credit risk, build a risk prediction system from multiple dimensions, improve the efficiency of credit risk identification and reduce the loss caused by information asymmetry. Funding: This research was funded by the Regional Project of National Natural Science Foundation of China, grant number 71861003 and Guizhou University of Finance and economics is in urgent need of the support of the special subject (2020ZJXK20). Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. Haonan Zhang, Hongmei Zhang and Mu Zhang / Journal of Risk Analysis and Crisis Response, 2021, 11(4), 176-188 DOI: https://doi.org/10.54560/jracr.v11i4.311 187 References [1] Chen Yanli, Jiang Qi. Business environment, real earnings management and credit risk identification [J]. Journal of Shanxi University of Finance and Economics, 2021, 43(09): 98-110. [2] Yang Lian, Shi Baofeng. Credit risk evaluation model and empirical evidence based on Focal Loss modified cross entropy loss function [J/OL]. Chinese management science: 1-12 [2021-10-10]. https://doi.org/10.16381/j.cnki.issn1003-207x.2020.2188. [3] Wang Chongren, Han Dongmei. Internet credit personal credit evaluation based on hyper-parameter optimization and integrated learning [J]. Statistics and decision-making, 2019, 35(01): 87-91. [4] Lou Fangke, Chen Xiaohong. Credit Risk Assessment and Application of Individual Microfinance Based on Logistic Regression Model [J]. Financial Theory and Practice, 2017, 38(01): 30-35. [5] Zhang Tong, Chi Guotai. Default Discrimination Model Based on Optimal Credit Feature Combination - A Case Study of Chinese A-share Listed Companies [J]. System Engineering Theory and Practice, 2020, 40(10): 2546-2562. [6] Zhou Ying, Su Xiaoting. Enterprise credit risk prediction based on optimal index combination [J]. Journal of System Management, 2021, 30(05): 817-838. [7] Li Zhe, Chi Guotai. Research on Credit Risk of Listed Companies Based on Maximum Index Discrimination and Optimal Relative Membership [J]. China Management Science, 2021, 29(04): 1-15. [8] Li Meng, Wang Jin. Internal control quality and corporate debt default risk [J]. International finance research, 2020(08): 77-86. [9] Shen Yan, Chen Yun, Huang Zhuo. The application of text big data analysis in economics and finance: a literature review [J]. Economics (quarterly), 2019, 18(04): 1153-1186. [10] Wu Fei, Hu Huizhi, Lin Huiyan, Ren Xiaoyi. Digital Transformation of Enterprises and Capital Market Performance - Empirical Evidence from Stock Liquidity [J]. Managing the World, 2021, 37(07): 130-144+10. [11] Li Bin, Wang Yinghui, Zhu Xiaoqian, Li Jianping. Identification and evolution analysis of important risk points in insurance industry - Based on text risk information disclosed in financial report [J/OL]. System engineering theory and practice: 1-15 [2021-10-10]. http://kns.cnki.net/kcms/detail/11.2267.n.20210528.0838.002.html. [12] Liang Kun, He Jun. Analyzing credit risk among Chinese P2P-lending businesses by integrating text- related soft information [J]. Electronic Commerce Research and Applications, 40. https://doi.org/10.1016/j.elerap.2020.100947. [13] Cecchini M, Aytug H, Koehler G J, et al. Making words work: Using financial text as a predictor of financial events [J]. Decision support systems, 2010, 50(1): 164-175. [14] Liu Yishuang, Chen Yiyun. Management Tone and Credit Risk Early Warning of Listed Companies - Based on Content Analysis of Annual Report [J]. Financial Economics Research, 2018, 33(04): 46-54. [15] Wang Xiaoyan, Zhang Zhongyan, Ma Shuangge. Credit Risk Assessment Model Based on Text Prior Information [J]. Chinese Management Science, 2021, 29(05): 34-44. [16] Wang Z, Jiang C, Zhao H, et al. Mining Semantic Soft Factors for Credit Risk Evaluation in Peer-to-Peer Lending [J]. Journal of Management Information Systems, 2020, 37(1): 282-308. [17] Zhang Yiwei, Gao Weihe. Self-construction, Cultural Differences and Credit Risk - Empirical Evidence from Internet Finance [J]. Financial Studies, 2020, 46(01): 34-48. [18] Wang Shuxia; Qi Yuwei; Fu Bin; Liu Hongzhi. Credit Risk Evaluation Based on Text Analysis [J]. International Journal of Cognitive Informatics and Natural Intelligence (IJCINI), 2016, 10(1): 1-11. [19] Zhang Jie, Zhang Yuansheng. Research on the measurement of liquidity risk of P2P lending platform based on Logistic - taking Shangjinfu as an example [J]. Friends of Accounting, 2019(21): 124-127. [20] Bian Yuning, Lu Likun, Li Yeli, Zeng Qingtao, Sun Yanxiong. Implementation of financial venture capital scoring card model based on logical regression [J]. Computer science, 2020, 47(S2): 116-118. [21] Liang Weisen, Wen Simei. Study on Loan Default Risk Assessment of Small and Medium-sized Agricultural Enterprises - Based on Data of Agriculture, Forestry, Animal Husbandry and Fishery Enterprises in 'New Third Board' [J]. Rural Economy, 2019(11): 93-100. [22] Wang Qianhong, Zhang Min. Empirical Study on Credit Default Risk Identification of SMEs in China [J]. Shanghai Economy, 2017(01): 91-100. [23] Liu Xiangdong, Wang Weiqing. Multi-model comparative study on credit risk identification of commercial banks in China [J]. Economic latitude and longitude, 2015,32(06): 132-137. Haonan Zhang, Hongmei Zhang and Mu Zhang / Journal of Risk Analysis and Crisis Response, 2021, 11(4), 176-188 DOI: https://doi.org/10.54560/jracr.v11i4.311 188 [24] Zhang Shuhui, Zhou Meiqiong, Wu Xueqin. Annual report text risk information disclosure and stock price synchronicity [J]. Modern Finance and Economics (Journal of Tianjin University of Finance and Economics), 2021, 41(02): 62-78. [25] Zhang Jingui, Hou Yu. Empirical analysis on credit risk of SMEs based on Logit model [J]. Friends of Accounting, 2014(30): 40-45. Copyright © 2021 by the authors. This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).