1 | International Journal of Informatics Information System and Computer Engineering 3(1) (2022) 19-44 XBRL Open Information Model for Risk Based Tax Audit using Machine Learning Bagas Dwi Suryo Wibowo University of Glasgow, United Kingdom E-mail: vox_eu@yahoo.com A B S T R A C T S A R T I C L E I N F O Tax audit is an effective instrument for preserving tax compliance, and risk-based tax audit selection can optimize it. Risk-based tax audit selection selectively auditing on high financial risk wealthy taxpayers. In contrast, manually selecting amid the plethora of taxpayer data is difficult, prone to human error, costly and time-consuming. Fortunately, using Extensible Business Report Language (XBRL) as a well-known financial statement reporting standard enables automation. This project proposed software named XAFR as a model for extracting, transforming, and loading the latest XBRL Open Information Model (OIM) 1.0 standard US-SEC dataset and provided it as a data source for risk classification using rule-based risk scoring and Machine Learning. Several thorough testing exposed Random Forest classifier as the best model for Machine Learning risk classification with high accuracy, revealing the excellent collaboration of rule-based risk scoring approach with Machine Learning for risk classification and the importance of XBRL as a transparent but robust report standard that tax authorities can utilize. The excellent system integration resulted in the ability to expose wealthy high-risk taxpayers and high-risk industries and predict risk classification based on two-year financial statements. Moreover, this report introduces the critical importance of RCA (Risk, Current Ratio, Assets) analysis and SIC (Standard Industry Classification) utilization to generate risk classification, rank and explanation. This project utilizes financial indicators in the limited year and leaves the semantic analysis for future works because of time and hardware limitations. The possibility of predicting the possible tax debt prediction are promising Machine Learning future developments. Article History: Received 5 May 2022 Revised 20 May 2022 Accepted 25 May 2022 Available online 26 June 2022 ___________________ Keywords: Audit-Selection, Assets, Current-Ratio, Financial- Risk, Machine-Learning, Open Information-Model, Risk, Risk-Based, Risk- Scoring, Rule-Based, Standard-Industry- Classification, Tax-Audit, XBRL International Journal of Informatics, Information System and Computer Engineering International Journal of Informatics Information System and Computer Engineering 3(1) (2022) 19-44 B D S Wibowo. XBRL Open Information Model for Risk Based Tax Audit using… | 20 1. INTRODUCTION Indonesia relies on taxation. Tax contributed to more than three-quarters of total income in 2017-2021 (Central Bureau of Statistics of the Republic of Indonesia, 2021). In contrast, Indonesia’s tax to Gross Domestic Product (GDP) ratio is too low and should be elevated. Thus, it is critical to maximize the essential role of tax audits. Risk management is required to improve compliance by selecting the correct Taxpayer, the high risk but wealthy Taxpayer (Khwaja et al., 2011). The authorities should identify and quantify the chance to choose high-risk taxpayers while eliminating non- compliance amid massive data and various information systems, making the manual selection a hard decision. Fortunately, utilizing the financial reporting standard XBRL enabled comparing and analyzing the corporate disclosure information over time and across entities (SEC 2021, para.2). XBRL technology allows performing automatic risk-based tax audits selection. This report proposes XAFR (XBRL – Artificial Intelligence Financial Risk Detection), a web-based application built on Python to address the identified problem. XAFR aims to classify taxpayers’ financial risk and expose wealthy high financial risk taxpayers based on their financial information data reported to USSEC in XBRL format for tax audit selection. The objective is to extract, transform, validate, and load the dataset to the database. Calculate the risk score using a rule-based approach, utilizing trend and industry level benchmark, resulting in rule-based risk score, classification and explanation. Furthermore, the risk data and statistics were appraised to extract the best features for Machine Learning and predict Taxpayer risk classification. Each risk classification is clustered and ranked by risk-score, current ratio, and total assets (RCA) to reveal the wealthy high financial risk taxpayers. This project contributes by designing a model capable of processing the novel XBRL OIM 1.0 specification datasets to extract essential information for tax authorities. 2. Existing Product, Prior Research and Gathering Requirement The background research is initiated by collecting information on the related software products and prior research. They are categorized into three: 2.1. XBRL tools and services software Well-known XBRL tools and services software are Altova, Arelle, Ez- XBRL, Datatracks and Workiva. These software products are listed on the XBRL International Certified Software. They aim to generate and provide a valid XBRL report for single entity XBRL report creation, review, validation, and analysis complying with XBRL 2.1 and Inline XBRL 1.1 specification standards (XBRL International, 2021c). XBRL 2.1 or Inline XBRL 1.1 rely on Extensible Markup Language (XML) for data transmission. The most crucial feature of XBRL is the precise definitions of taxonomies that provide the meaning and relationships of all reporting terms (XBRL International 2021a, para.15). This technology establishes an appropriate context when reading any words in the XBRL report. Hence, as long as the taxonomy is well regulated, any information can be generated using the correct taxonomies to be exchanged between entities in a functional, practical and accurate digital format. 21 | International Journal of Informatics Information System and Computer Engineering 3(1) (2022) 19-44 However, XBRL OIM 1.0 specification, released in October 2021, is prepared for extensive data analysis (XBRL International, 2021b). They are using the JavaScript Object Notation Link Data (JSON-LD), Tab Separated Value (TSV) and HyperText Markup Language (HTML) format (SEC, 2021). The all-in- one context in a single XML format is distributed to JSON-LD for schema, data type and definition used in machine operation, TSV for the data tuples, and HTML for schema, data type and purpose used by the human reader. 2.1.1. XBRL-Artificial Intelligence application Ashtiani and Raahemi’s (2021) study comprehensively compare forty-seven articles studying financial statement fraud detection (FSFD) using machine learning and data mining (ML/DM). The study informs that most essays use classification supervised machine learning model financial ratio structured data sources. While study to detect financial statement fraud using the hybrid model in China and India found Ensemble Method using RF outperformed others (Hooda, Bawa and Rana, 2020; Yao, Zhang and Wang, 2018). In conclusion, most prior studies have used SVM and RF as excellent classifiers for financial statement fraud detection, while LR, although used in classification, is more fit for trend prediction. However, there are no prior financial statement fraud detection studies using the recent XBRL OIM 1.0. Only one previous study found discusses the reasoning and explanation behind fraud detection. Venters and Mikkilineni’s (2020) study used unsupervised deep learning to detect anomalies in financial statements, found the lack of transparency of reasoning behind the conclusion, and then provided a sophisticated Deep Reasoning model to mimic human reasoning processes. 2.1.2. Risk-Based Tax Audit Selection Application Khwaja et al. (2011) study comprehensively summarise data- mining risk-based tax audits implementation in many countries: The United Kingdom, Sweden, The Netherlands, Bulgaria, India, Ukraine, Kazakhstan, Turkey and some other World Bank studies. The summary implies that the risk-scoring technique is the fundament of risk-based tax audit selection. The risk-scoring approach builds a taxpayer profile based on specific attributes and knowledge acquired during the previous audit. Implementing the risk-scoring strategy requires high- quality data, past audit cases, and current taxpayer attributes. This approach required Information Technology (IT) systems to process the data, score, and feed the pointer into audit programming (Khwaja et al. 2011, pp. 20-21). 2.2. Gathering Requirements Based on the information collected in the background research above, the project should be able to: 1. Design scalable infrastructure configuration for massive data storage, mining, and machine learning processing. The risk-based audit selection data mining required adequate IT systems (Khwaja et al. 2011, pp. 20-21). 2. Design a complex general framework for the overall process, then break it into a manageable smaller framework (Jurney, 2017). The manageable smaller framework identified are XBRL, Artificial Intelligent and Web frameworks. B D S Wibowo. XBRL Open Information Model for Risk Based Tax Audit using… | 22 3. Integrate the smaller framework back to the general framework. Extensive unit testing and system integration testing is required to ensure excellent integration. 4. Design the reliable rule-based risk scoring approach to produce high- quality financial indicators for machine learning features. 5. Design thorough tests and measurements to ensure data validity and integrity, correct taxonomy design to increase the non-zero value extraction and select the best classifier model with high accuracy. 6. Design an informative user interface that provides usability for end-user. Focus more on the data insight, usability, and accuracy interface. 2.2.1. Functional Requirements User Stories and MoSCoW statements The User Stories analysis reveals three User Roles with seven User Story cards, allocated using MoSCoW analysis to 4 Must-Have, 2 Should-Have, 1 Could- Have an additional 3 Will-Not-Have cards. The effort available is 30 ideal days to accomplish 30 User Acceptance Tests. The detail is provided in Appendix A. 2.2.2. Non-Functional Requirements Hardware and Software Requirement Hardware and software infrastructure is essential for massive dataset processing. The hardware and software requirements are gathered to build the Application Infrastructure Framework, as shown in Figure 1. Fig. 1. Application Infrastructure Framework The software deliveries will be conducted in three stages: development, testing, and deployment using the environment listed in Table 1. The setting is set up in the development stage, and the software is built until the software is ready for testing. The testing stage replicates the developed software and environment into a Virtual Machine (VM). The unit and system integration test will be conducted in the development and testing stages. Finally, the software will be replicated to the cloud service using Vagrant if it passes all user and system integration tests. This project will utilise (AWS) and will take advantage of the AWS Free Tier offer for its easy build full- stack software and infrastructure for web-based Machine-Learning development (Amazon, 2021; Amunategui and Roopaei, 2018; Jurney, 2017). K 23 | International Journal of Informatics Information System and Computer Engineering 3(1) (2022) 19-44 Table 1. Hardware and software environment in each development stage Machine Local Machine – Notebook Computer Virtual Machine – VirtualBox (Oracle, 2021b) Amazon Web Service Free Tier – Elastic Compute Cloud (EC2) (Jurney, 2017) Core Processor 4 2 Memory 16 GB 12 GB Operating System Windows 10 64bit (Microsoft, 2021a) An excellent user interface, feature-rich, most widely used with extensive supports Linux Ubuntu 18 LTS 64 bit (Canonical, 2021) Resilient, resource-efficient, more resistant to viruses, malware and trojans Reasons Container - The final software OS is Linux Ubuntu, not Windows Vagrant (HashiCorp, 2021) a lightweight, reproducible, and portable VM environment container software (Jurney 2017, p. 33) Reasons Database Management System (DBMS) is the essential requirement to reduce repeated data processing by storing previous results to improve execution time. Figure 2 shows the differences between the standard form in the relational schema dataset and the denormalised form in the schemaless dataset. The relational schema dataset is vertically scalable, but the schemaless dataset is horizontally scalable. Hence, this project data mining is horizontally scalable. Fig. 2. Normal Form in Relational Schema Dataset (Left – Blue) and Denormalized Form in Schemaless Dataset (Right – Green) Relan (2019, pp. 28-29) mentions that relational databases use Structured Query Language (SQL) for data manipulation and consist of a table of rows and columns, which is excellent for vertically scalable data. While NoSQL stored the data without a structure (schemaless) and denormalised only when necessary. This capability is a superior option for distributed and horizontally scalable systems. Hence, this project will use a NoSQL database model. Mongo-DB is preferred as the document- based NoSQL DBMS (MongoDB, 2021). MongoDB uses the BSON (Binary JSON) format, which is faster for data searching and processing. MongoDB 5 has an Aggregation feature comparable to Hadoop Distributed File System’s (HDFS) Map-Reduce. Apache Hadoop 3.3 was selected for DFS because (HDFS) has a large block size, which is excellent for vast capacity data storage (Apache Software Foundation, 2021a). Using the default 128 Megabytes (MB.) HDFS block size is faster for indexing, reading, and writing large files than smaller block sizes in other file systems. For example, NTFS only has a 4 to 8 KB block size (see Figure 3 for comparison). This project will combine HDFS infrastructure for massive B D S Wibowo. XBRL Open Information Model for Risk Based Tax Audit using… | 24 data storage and the Local Machine file system for minor or temporary data storage. Fig. 3. Block size impact for file indexing, reading, and writing illustrated as puzzles. More extensive block size is faster read/write at the cost of efficiency Apache Spark 3.2 (Apache Software Foundation, 2021b) was selected as the Resilient Distributed Dataset (RDD) processor, which runs on top of Apache Hadoop in an RDD-DFS architecture. Flask 2.0 (Pallets, 2021) handles the overall system integration as a lightweight web application server (Relan, 2019). Anaconda 3 (Anaconda, 2021) package is selected for its extensive Python library in Machine Learning development, including Jupyter Notebook, ScikitLearn, NumPy, Pandas, Bokeh, Matplotlib library. Microsoft Visual Code 1.62 (Microsoft, 2021b) extensive libraries, paired with the Jupyter Notebook server and Microsoft Powershell 7 terminal, give a robust IDE for developing this Python project in Windows. The last software requirement is Java 11 (Oracle, 2021a) that required to support dependencies between applications and drivers. 2.2.3. User Interface XAFR is designed for the tax authority with a government employee as the end-user. Although smartphones are sometimes used, government employees mainly work with desktop computers or notebooks. The Bootstrap grid technology (Bootstrap et al., 2021) can be utilized for the responsive grid layout’s capability to display correctly on smartphones, tablets, or desktops, but with priority on the desktop environment. This project aims to present data insight and essential information. Thus, the interface should be minimalistic to avoid distracted users and missing understanding. Likewise, the colour nuance should be minimal but more informative and noticeable. A wrapper or section that could dynamically open and hide the content should be provided for pages that include a large amount of material. This project will leverage the existing certified XBRL software for its distinctive approach to utilising taxonomy for XBRL 2.1 then adapt it to the XBRL OIM 1.0 development. The XBRL OIM 1.0 provide the data tuples on the TSV files. TSV is more efficient than XML because TSV does not contain context and extra tags, but it is “blind”. TSV is “blind” because there is no machine-readable definition or relationship between fields, intra-file or inter-file. From prior financial statement fraud detection using artificial intelligence studies, this project leverages the importance of features selection to improve Machine Learning prediction accuracy. This project will use the Features Correlation Heatmap analysis and Random Forest features importance score to select the significant elements. Based on a prior study, the benchmark at the industry level improves any classifier’s algorithm’s accuracy. Hence, this project will utilize the Standard Industry Classification field to identify the industry cluster of each Taxpayer and compare their financial indicator value with the average in a similar industry. Furthermore, the risk rank will be sorted using risk score, current ratio value and total asset (RCA) to reveal the wealthy high financial risk Taxpayer. The prior studies mention that supervised machine learning is the most used, with SVM and RF as the best classifiers in different cases. This project will use supervised machine learning and do an iterative test to select the classifier model 25 | International Journal of Informatics Information System and Computer Engineering 3(1) (2022) 19-44 outperforming the overall average accuracy score. If both classifier models consistently have similar performance, both models will be used for risk classification in future prediction, but only selecting the highest accuracy score for the final result. 3. SCOPE AND IMPLEMENTATION 3.1. General Application Framework Design and Scope This project’s scope is from the XBRL OIM 1.0 documents extraction until the user visualizes the wealthy high financial risk taxpayers (see the blue area in Figure 4). This project did not implement the complete Tax Audit Case Management System. Furthermore, the General Application Framework is split into smaller frameworks: XBRL, AI, and Web framework to be allocated into the project development roadmap. Fig. 4. General Application Framework Diagram and the scope of the project that covered in the blue area 3.2. Project Development Roadmap The project development roadmap divided the project development into stages, sprints and iterations. Gantt chart Timeline map the project development roadmap sprints into a timeline. Both diagrams are attached in Appendix B. 3.3. Constraints Hardware limitation significantly impacts massive data implementation and makes development processes consume a high amount of time. For example, load the extracted TSV dataset to the database. Pre-optimization loading (HDFS directly to the database) consume 437 minutes to load a part of TSV documents containing four million tuples (148 tuples/second). Post-optimization (file chunk algorithm and HDFS - Local Storage collaboration) speeds up the load time to 7175 documents/second, 27 million tuples in 64 minutes (see Figure 5) only for data loading, while the total dataset used is around forty million tuples. B D S Wibowo. XBRL Open Information Model for Risk Based Tax Audit using… | 26 Fig. 5. Pre (left) and post (right) optimization data loading comparison Concerning the hardware and time limitation, this project utilizes the US- SEC dataset from 2019-Q1 to 2021-Q3, targeting the 10-K Annual Audited Financial Statement Form. The US-SEC dataset usage purpose in this project is complying with the US-SEC aim to analyze and compare corporate disclosure information from time to time and across entities (SEC 2021, para. 2). Likewise, the disclaimer from the US-SEC of this dataset also applies to this project. 3.4. Software Implementation 3.4.1. XBRL Framework XBRL Framework is the most prioritized since it provides data sources, split into several features: XBRL Repository Feature HTMLParser.py module collects the XBRL OIM 1.0 documents from the source. Continue with the ZIPParser.py module that extracts the document to the temporary folder and allocate it to Local Storage and HDFS. Each document hashed using the MD5 or SHA1 algorithm, and the file name is modified to follow this structure to avoid duplication (see Figure 6). originalfiletype_hashtype_hashvalue.original extensionname MD5 and SHA1 are widely used, fast, relatively secure from collision and brute force attacks, with 128 bit (MD5) and 160 bit (SHA1) binary value length, converted to 32 digits (MD5) and 40 digits (SHA1) in hexadecimal format, relatively suitable for filename length (see Figure 6). Fig. 6. Hashed filename structure convention for content anti-duplication If a similar-content ZIP document exists in the repository, the extraction process is halted by default, and the error log is created, as illustrated in Figure 7. The process can be forced to continue if the forcedContinue parameter is set as True. 27 | International Journal of Informatics Information System and Computer Engineering 3(1) (2022) 19-44 Fig. 7. XBRL Repository feature diagram The complete ZIP, JSON, HTM and TSV files are stored on the Local Repository, while the big-sized TSV files are stored in HDFS. Finally, the XBRLLog.py module logged all activities and information for future analysis. XBRL Validator feature MongoDB validation keys are collected by the HTMLParser.py module parallelly. The JSONParser.py module reads the XBRL JSON file, contains schema, table, and column keys (see Figure 8), and stores it with the MongoDB validation keys. These reserved-key collections are used as reserved keys for the conversion. Fig. 8. Captured XBRL JSON-LD schema file sample The conversion is essential to build document creation validation in the database. XAFR must validate and clean the XBRL documents at each stage to retain correct data types and formats, eliminate null and infinite values, and balance all categories and numerical variables (Figure 9). B D S Wibowo. XBRL Open Information Model for Risk Based Tax Audit using… | 28 Fig. 9. XBRL Validator feature diagram This procedure is vital for avoiding Garbage-In and Garbage-Out and preserving Great-In and Great Out dataset subsample (Štěpánek et al., 2021). The MongoDB validation is the primary filter to ensure the loaded data is valid. The multi-step transformation procedure is applied (see the yellow highlights box in Figure 9), generating multiple MongoDB Validation files, one for each collection, stored in the Local Storage and used to MongoDB as a document creation validator. XBRL Dataset Loader feature TSVParser.py module handles the TSV documents to MongoDB loading. The critical step is to chunk the big-sized file into a smaller row (see Figure 10). The best practice is around 5000-10000 rows per chunk file and avoids total big-sized file row counting ahead. Fig. 10. XBRL Dataset Loader Feature Diagram The TSVparser.py module also provides documents anti-duplication in the database by hashing each tuple content and storing the hash value as _id using the SHA-256 algorithm (see Figure 11). The _id field is a unique indexed document identifier in MongoDB. Finally, TSVParser.py stores the dataset using batch mode documents creation, faster than single-mode document creation. 29 | International Journal of Informatics Information System and Computer Engineering 3(1) (2022) 19-44 Fig. 11. Store the content SHA-256 hash value as _id for anti-duplication XBRL Denormalizer feature The XBRL dataset consists of eight collections (tables): sub for the XBRL report submission and filer identity, tag for all tag used in the request, dim for dimensional tags, num for numeric XBRL facts presented on the financial statements, txt for each non-numeric XBRL plain-text fact, ren for the filing a summary, pre for each line item text, assigned by the filer, in the financial statements, and cal for relationships among the tags in a filing. The table relationship diagram in Figure 12 shows that sub, tag, dim or ren is candidates for the pivot point depending on the requirement. XAFR uses Taxpayer as a target module. Thus, the sub is selected as the pivot point to denormalise the others because only the sub contains the filer entity information. Fig. 12. Table Relationship and Primary Key diagram based on the readme.htm and JSON schema Complete denormalisation is heavy resources consumption process, while XAFR only focuses on the financial indicator, not semantic analysis. Hence, based on thorough research, the collection reduction strategy used only B D S Wibowo. XBRL Open Information Model for Risk Based Tax Audit using… | 30 essential collections: sub, tag, num, and pre. From the sub, Central Index Key (CIK) is extracted. CIK is the SEC filing unique identifier, used as an entity identifier key to collect and denormalise data from the tag, num and pre using the correct primary key (see Figure 13). XBRLDenormalizer.py module processed it entity by an entity, stored it in the memory, and worked together with XBRLDimension.py module to store the denormalised data into the database until all entities were denormalised (see Figure 13). This procedure ensures the system memory is not overwhelmed and can denormalise entire entities successfully in a faster execution time. Fig. 13. XBRL Denormalizer feature diagram The XBRLDenormalizer.py module collaborates with the XBRLDimension.py module as one comprehensive unity. XBRLDenormalizer.py gathered all CIK, created a single entity as a single object, denormalised other attributes and temporarily stored the data in memory. XBRLDimension.py catch and clean the denormalised data from null, duplicate tuples (a consequence of the denormalisation process), calculate the financial ratios and trends, collect the nonfinancial values, horizontally merge and store them in the database (see Figure 14). The financial indicator used in XAFR was the most used in a prior study representing the Taxpayer profitability, liquidity, solvability and efficiency performance. XBRLDimension.py module calculates the logarithmic trend of financial values, ratios and trends from the specified and previous year. The logarithmic trend is selected because of its “honesty” in representing the value fluctuation (see Table 2). 31 | International Journal of Informatics Information System and Computer Engineering 3(1) (2022) 19-44 Table 2. Arithmetic and Logarithmic Trend Comparison. The logarithmic trend is more “honest” in trend fluctuation magnitude and summarisation Year Value Arithmetic Trend Logarithmic Trend 2018 - - 2019 -90.00% -230.26% 2020 500 -50% -69.31% 2021 1000 100% 69.31% XBRL Analyzer feature XBRLAnalyzer.py module collaborates with AI Features-Extraction feature in AI Framework. The XBRLAnalyzer.py module build aggregates pipeline and the view from the stored denormalised entities (see Figure 14). XBRL Analyzer feature enables financial indicator value and ratio benchmarking by comparing and calculating deviation with its competitor average statistic in a similar industry. Fig. 14. XBRL Analyzer and AI Features-Extraction feature XBRL Visualizer feature The XBRLVisualizer.py module visualizes the data insight into a chart, diagram, plot and table presented for thorough analysis using an interactive graph from Bokeh (Ven, 2021) and a rich feature table from Data tables (SpryMedia, 2021). Figure 15 illustrates how the interactive RCA analysis can help the Tax Officer reveal the high-risk Taxpayer. B D S Wibowo. XBRL Open Information Model for Risk Based Tax Audit using… | 32 Fig. 15. XBRL Visualizer feature sample: Interactive Risk Status in Current Ratio vs Total Assets (RCA) Indicator Analysis Scatter Plot 3.4.2. Artificial Intelligence (AI) Framework AI Framework is the core of XAFR Machine Learning classification. AI Framework consists of two features: AI Features-Extractor feature The AIFeatures.py module executes the rule-based risk scoring. Financial indicators values, trends, and deviations from the average in a similar industry or known standard are used to calculate the risk score (see Figure 16). AI Machine-Learning feature The extracted features by AIFeatures.py were treated to eliminate the insignificant features that could reduce the Machine Learning performance handled by the AIMachine.py module. 3.4.3. Web Framework Finally, the calculation and classification result from XBRL and AI Framework is presented in the web application handled by Web Framework. XAFR implements Model View Controller (MVC) pattern for web framework presentation by utilizing Flask for web service and MongoDB for the data model provider. The WebIndex.py define all the routes and logic controller function, while the view function would be presented on HTML files in the templates folder. XAFR implement a responsive web view to adapt to any device display dynamically. 4. TESTING 4.1. Unit Testing Unit testing is essential to ensure each method, class, module, and package works as expected. This report conducts unit tests in two approaches. First, use internal iterative testing in xbrl, ai, and web packages in each module’s primary scope function (see Figure 16). 33 | International Journal of Informatics Information System and Computer Engineering 3(1) (2022) 19-44 Fig. 16. Unit testing in AIFeatures.ipynb module’s leading scope example Second, use the pytest 6.25 (Krekel, 2021) library and create three-unit test modules: test_Ai.py, test_Xbrl.py and test_Web.py (see Figure 17) to develop critical checkpoints test for each necessary module functionality. Fig. 17. Unit testing using pytest 4.2. User Acceptance Test The User Acceptance Test (UAT) is designed with the Story Cards and MoSCoW analysis. There are thirty UAT for seven-story cards designed. This report conducts UAT in two approaches. First, use a tickmark element checking to ensure the thirty UAT elements exist and working correctly. The detail is in Appendix A. Second, using the User Survey approach. The survey conducting by demonstrating XAFR to the participant for 10 to 15 minutes then each participant was asked to answer the questionnaire for around 15 minutes. The survey used Google Form (Google, 2021a) from 15 participants. The participant is come from various backgrounds and experiences but is still related to tax, finance, and IT with multiple ranges of age, gender and experience in using Financial Analytical Tools. The participant demography detail is presented in Figure 18. B D S Wibowo. XBRL Open Information Model for Risk Based Tax Audit using… | 34 Fig. 18. Participants Demography Statistic Based on the user perspective, in Likert Scale from 1 – Highly Unuseful to 7 – Highly Useful, XAFR usability is 6.24 or helpful in overall features and from 1 – Highly Inaccurate to 7 – Highly Accurate, XAFR accuracy is 5.87 or slightly accurate to accurate in general parts, see table 3 for more detail per feature score. While for the awareness survey, with the scale of 1 for aware, 0 for not sure and -1 for not familiar, the average value is 0.65. From the user perspective, they are 65,6% aware of overall XAFR features. Table 3. Usability and Accuracy Survey Statistic Result Criteria 1 2 3 4 5 6 7 Usability - Likert Scale from Highly Unuseful (1) to Highly Useful (7) Ten Companies diagram Risk 6.7% 13.3% 40% 40% Ten Industries diagram Risk 6.7% 6.7% 60% 26.7% Risk Distribution Diagram 7.1% 7.1% 7.1% 50% 28.6% RCA (Risk, Current Ratio and Asset) Diagram 7.1% 14.3% 42.9% 35.7% Taxpayer List 6.7% 13.3% 40% 40% Taxpayer Detail 6.7% 13.3% 46.7% 33.3% XBRL Report - EDGAR Archive 6.7% 13.3% 13.3% 46.7% 20% Benchmark Financial Indicator Diagram 7.1% 7.1% 14.3% 35.7% 35.7% Benchmark Financial Indicator Table 6.7% 20% 33.3% 40% Risk Explanation 6.7% 6.7% 6.7% 46.7% 33.3% Financial Indicator Sorting 6.7% 20% 40% 33.3% Machine Learning Prediction 13.3% 33.3% 20% 33.3% Average Score 7.1% 6.8% 7.9% 14.6% 41.8% 33.3% 35 | International Journal of Informatics Information System and Computer Engineering 3(1) (2022) 19-44 Weighted Total Score for overall Features 6.24 Accuracy – Likert Scale from Highly Inaccurate (1) to Highly Accurate (7) Risk Classification of the Blue Chips Taxpayers 6.7% 26.7% 20% 33.3% 13.3% Risk Explanation 6.7% 13.3% 13.3% 33.3% 33.3% Industries List 6.7% 13.3% 46.7% 33.3% Industries Risk 6.7% 33.3% 26.7% 33.3% Financial Indicator Sorting 6.7% 13.3% 40% 40% Machine Learning Prediction 26.7% 20% 33.3% 20% Average Score 6.7% 14.4% 18.8% 35.5% 28.8% Weighted Total Score for overall Features 5.87 4.3. System Integration Test The system integration test is conducted to ensure data integrity is preserved, extracted wholly and correctly, extracted in the correct taxonomy context resulting in a high Non-Zero value, thus generating a high-quality risk score, explanation, benchmark, and rule-based risk classification. XBRL ETL Succession Rate Test This test measures the completeness of data extracted, transformed and loaded to the database. The integrationTest.py - testXbrlEtlSuccessionRate() method handle this test by comparing the total tuples from each collection TSV with the entire valid document stored in the database. The succession rate is 100% for num, pre and sub, but the rate of tag collection is only 91.01% (see Figure 19). Fig. 19. XBRL ETL Succession Rate test The tag collection can duplicate because it contains the available XBRL taxonomy tag, and when two or more entities have the same event context, they will use an equal tag. Hence, the 91.01% ETL succession rate is acceptable (see figure 20). B D S Wibowo. XBRL Open Information Model for Risk Based Tax Audit using… | 36 Fig. 20. XBRL Dataset to Database Load Process Log Non-Zero Value (NZV) Extraction Rate Test This test measures how many non-zero extracted financial indicators are for each Taxpayer. If the number of extracted features is 31, the non-zero value rate varies between 0 to 31. The origin of the zero value can come from the objective 0 value or the null value converted to 0. The null value was generated because the algorithm failed to read the taxonomy context. Pre- optimization, XAFR had a poor NZV extraction rate, see Figure 21. Fig. 21. Non-Zero-Value Distribution Comparison, before custom dictionary keywords optimisation (left-red border) and after optimisation (right-green border) There are three essential fields to read XBRL taxonomy context: stmt, crdr and tag fields. For example, locating operating income is started by querying the stmt value with “IS” for Income Statement. The crdr value with “C” for Credit (normal balance for operating 37 | International Journal of Informatics Information System and Computer Engineering 3(1) (2022) 19-44 income) and tag with regex keyword value indicate operating income. An in-depth evaluation was conducted to improve the rate by comparing the Taxpayer that used the reserved accounting standard taxonomy with the custom one and creating a dictionary to capture the keyword for custom taxonomy context optimisation. Finally, the NZV Extraction rate testing significantly increased using the custom taxonomy context optimisation (see Figure 20 and Table 4) Table 4. NZV Extraction Rate Comparison before and after optimisation Condition NZV Above Threshold Entities Counts Total Entities NZV Extraction Rate Before Optimization 39 5178 0.6% After Optimization 3217 5178 62.1% Feature Reduction Rate Test XAFR calculate the matrix correlation and display the information in a heatmap diagram. The threshold is ninety per cent. If features correlate equal to or greater than ninety per cent, only one part is preserved (Table 5). Table 5. High Correlation Features List, feature with red colour is eliminated Feature Feature Pair Correlati on Financial_trend_opm Financial_trend_gm_opm 94% Financial_trend_roa Financial_trend_assetTurnover 99% Financial_benchmark_quickRatio Financial_deviation_quickRatio 92% Financial_benchmark_der Financial_deviation_der 94% Financial_trend_benchmark_ebitM argin Financial_trend_ebitMargin 96% Financial_trend_benchmark_roa Financial_trend_benchmark_assetT urnover 97% Financial_trend_benchmark_opm Financial_trend_gm_opm 90% Financial_trend_benchmark_gm Financial_trend_gm 94% Financial_trend_benchmark_opm Financial_trend_opm 95% Financial_trend_benchmark_ebitM argin Financial_trend_ebitMargin 96% Financial_trend_benchmark_npm Financial_trend_npm 96% Financial_trend_benchmark_basic EPS Financial_trend_basicEPS 95% Financial_trend_benchmark_quick Ratio Financial_trend_quickRatio 93% Financial_trend_benchmark_curre ntRatio Financial_trend_currentRatio 94% B D S Wibowo. XBRL Open Information Model for Risk Based Tax Audit using… | 38 Financial_trend_benchmark_roa Financial_trend_roa 92% Financial_trend_benchmark_roe Financial_trend_roe 95% Financial_trend_benchmark_der Financial_trend_der 95% Financial_trend_benchmark_roa Financial_trend_assetTurnover 91% Financial_trend_benchmark_asset Turnover Financial_trend_roa 93% Financial_trend_benchmark_asset Turnover Financial_trend_assetTurnover 94% Table 6 informs that mostly the industry-level benchmark’s features remain. Besides the prior study that reveals the importance of the industry level benchmark for machine learning performance, the other rationale is the Random Forest Feature Importance test. While performing 100 iterations to find the best Random Forest model, the Features Importance test was also executed. From 100 iterations, it was found that the benchmark features are consistently in the top ten ranks of the most crucial part (see Figure 22). Fig. 22. Random Forest Feature Importance Test screenshot Target Class Distribution Test XAFR classifies the Taxpayers into three classes: Low Risk labelled as 1, Medium Risk labelled as two and Low Risk marked as 1. The class name and label is defined based on the risk score calculation and threshold. The risk score threshold should classify the class to mimic the actual distribution in reality as close as possible. After conducting several tests, it was found that the best threshold is the 15% Medium Risk threshold from -0.15 to 0.15 (see Table 6). 39 | International Journal of Informatics Information System and Computer Engineering 3(1) (2022) 19-44 Table 6. Rule-based risk scoring technique Risk Score Range Risk Status Risk Label Explanation Flag -1 <= x < -0.15 High Risk 3 Red Flag -0.15 <= x <= 0.15 Medium Risk 2 Yellow Flag 0.15 < x <= 1 Low Risk 1 Green Flag The 15% Medium Risk threshold will classify every indicator as a medium risk if the standard or average industry level benchmark deviation is between -15% to 15%. For example, if the average Net Profit Margin in a similar industry is 20%, the medium risk is located between 5% to 35%. If less than 5%, then classified as High Risk and if greater than 35%, classified as Low Risk. This formula is the XAFR rule-based risk scoring approach (Figure 23). Fig. 23. Target class distribution Accuracy Score Test Finally, the accuracy score test is the ultimate test to measure Machine Learning performance. XAFR uses z- score normalisation and Hyperparameter to select the best SVM and RF model parameters to improve the Machine Learning accuracy score. The result of the Hyperrparameter test is in Figure 24 and applied in each classifier model parameter. Fig. 24. Hyperparameter Test Result. Random Forest model (Left) and Support Vector Machine (Right) B D S Wibowo. XBRL Open Information Model for Risk Based Tax Audit using… | 40 Furthermore, the test to find the best classifier model is conducted by iterating a hundred times for both SVM and RF models (Figure 25). Fig. 25. Find best classifier model test The model is dumped into the pickle file at the end of each iteration. While the accuracy score is put to the end of the filename using this structure: classifiertype_ModelPreedictor_accuracyscor e.pkl At the end of the test, the best model can be carried out by selecting the pickle file with the highest accuracy score number and using that model for Machine Learning risk classification prediction (see Figure 26). Fig. 26. Selecting The Best Classifier Model 5. EVALUATION XAFR passes all Unit Test using two approaches, completed all the User Acceptance Test and from the survey participant perspective, XAFR features is proper and slightly accurate to accurate. XAFR features had a high awareness from the survey participant. System integration tests show that XAFR has a high XBRL ETL Succession Rate Test, and significantly improves the Non-Zero Value (NZV) Extraction Rate Test and found the root cause of the problem. The high Feature Reduction Rate Test shows that XAFR can recognize and eliminate insignificant features and reduce multicollinearity. The Target Class Distribution Test can reveal the best threshold to produce a risk distribution class that mimics the reality as possible, and The Accuracy Test shows Random Forest as the best classifier model. XAFR can significantly improve the overall performance by enriching the Taxonomy Context Dictionary to recognize more financial indicators, especially for Taxpayer that uses custom taxonomy. 6. CONCLUSION AND FUTURE WORKS CONCLUSION XAFR successfully expose the high financial risk taxpayers capable of paying their tax using the RCA analysis and SIC 41 | International Journal of Informatics Information System and Computer Engineering 3(1) (2022) 19-44 utilization using rule-based risk scoring and machine learning risk classification and can explain the risk classification for Risk-Based Tax Audit Selection. It can model the latest XBRL OIM 1.0 specification processing to generate meaningful information for Tax Authorities. XAFR can process forty million tuples of data to produce insightful information in a minimalist but informative, responsive and interactive display. FUTURE WORKS XBRL OIM 1.0 dataset contain rich structured and semi-structured meaningful data. It is a treasure for semantic analysis using the Natural Language Processing (NLP) model, providing better risk explanation and risk classification performance. For example, the financial indicator might calculate as medium risk. However, the NLP report analysis found a significant number of sentiment negative words: “fraud”, “loss” which located near words: “exposed”, “mitigated”, “recognized”, “followed”, and “year “then it could be classified as high-risk semantically. Furthermore, predicting the range of possible tax debt paid from the specified XBRL financial statements using the regression predictor model is significant for the future development of Machine Learning in this area. For example, at every start of the year, based on the last Taxpayer’s financial indicator, the Tax Officer could predict using Machine Learning how much is the tax debt for the recent year if everything is in similar condition. Then, at the end of the year, the tax debt prediction is compared with the prediction using the actual Financial Statement indicator and the actual tax paid. If the difference is significant, it will trigger a warning alarm sent to Taxpayer and Tax Authority. If there is no additional tax payment or clarification from the Taxpayer in any regulated period, then the Taxpayer is automatically selected for Tax Audit. REFERENCES Abbasi, A., Albrecht, C., Vance, A., & Hansen, J. (2012). Metafraud: a meta-learning framework for detecting financial fraud. Mis Quarterly, 1293-1327. Amazon (2021) Cloud Services - Amazon Web Services (AWS), Amazon Web Services, Inc. Amunategui, M., & Roopaei, M. (2018). Displaying Predictions with Google Maps on Azure. In Monetizing Machine Learning (pp. 195-235). Apress, Berkeley, CA. Anaconda (2021) Anaconda with Python 3 on 64-bit Windows — Anaconda documentation. Anggia, P. (2019). Achieving of income tax with awareness of taxation in Indonesia's tax law system. Yustisia Jurnal Hukum, 8(2), 292-308. Apache Software Foundation (2021a) Apache Hadoop. Available at: https://hadoop.apache.org/ (Accessed: 31 October 2021). Apache Software Foundation (2021b) Apache SparkTM - Unified Engine for largescale data analytics. Available at: https://spark.apache.org/ (Accessed: 31 October 2021). Ashtiani, M. N., & Raahemi, B. (2021). Intelligent fraud detection in financial statements using machine learning and data mining: a systematic literature review. IEEE Access. B D S Wibowo. XBRL Open Information Model for Risk Based Tax Audit using… | 42 Bootstrap et al. (2021) Bootstrap. Available at: https://getbootstrap.com/ (Accessed: 31 October 2021). Canonical (2021) Enterprise Open Source and Linux, Ubuntu. Available at: https://ubuntu.com/ (Accessed: 31 October 2021). Carroll, J., & Morris, D. (2015). Agile project management in easy steps. In Easy Steps. Central Bureau of Statistics of the Republic of Indonesia (2021) Economic and Finance Publication. Central Bureau of Statistics of the Republic of Indonesia. Available at: https://www.bps.go.id/site/resultTab (Accessed: 30 August 2021). El-Bannany, M., Dehghan, A. H., & Khedr, A. M. (2021, March). Prediction of Financial Statement Fraud using Machine Learning Techniques in UAE. In 2021 18th International Multi-Conference on Systems, Signals & Devices (SSD) (pp. 649-654). IEEE. GmbH (2021) The QR Code Generator, The QR Code Generator. Available at: https://www.the-qrcode-generator.com/ (Accessed: 17 December 2021). Gomaa, M. I., Markelevich, A., & Shaw, L. (2011). Introducing XBRL through a financial statement analysis project. Journal of Accounting Education, 29(2-3), 153-173. Google (2021a) Google Forms. Available at: https://docs.google.com/forms/ (Accessed: 17 December 2021). Google (2021b) YouTube. Available at: https://www.youtube.com/ (Accessed: 17 December 2021). HashiCorp (2021) Vagrant by HashiCorp, Vagrant by HashiCorp. Available at: https://www.vagrantup.com/ (Accessed: 31 October 2021). Hidayattullah, S., Surjandari, I., & Laoh, E. (2020, October). Financial Statement Fraud Detection in Indonesia Listed Companies using Machine Learning based on Meta-Heuristic Optimization. In 2020 International Workshop on Big Data and Information Security (IWBIS) (pp. 79-84). IEEE. Hooda, N., Bawa, S., & Rana, P. S. (2020). Optimizing fraudulent firm prediction using ensemble machine learning: a case study of an external audit. Applied Artificial Intelligence, 34(1), 20-30. Joblib (2021) Joblib: running Python functions as pipeline jobs — joblib 1.2.0.dev0 documentation. Available at: https://joblib.readthedocs.io/en/latest/ (Accessed: 31 October 2021). Jurney, R. (2017). Agile data science 2.0: Building full-stack data analytics applications with Spark. " O'Reilly Media, Inc.". Khwaja, M. S., Awasthi, R., & Loeprick, J. (Eds.). (2011). Risk-based tax audits: Approaches and country experiences. World Bank Publications. Kotsiantis, S., & Kanellopoulos, D. (2008, November). Multi-instance learning for predicting fraudulent financial statements. In 2008 Third International Conference on Convergence and Hybrid Information Technology (1), 448-452. IEEE. 43 | International Journal of Informatics Information System and Computer Engineering 3(1) (2022) 19-44 Krekel, H. (2021) pytest: helps you write better programs — pytest documentation. Available at: https://docs.pytest.org/en/6.2.x/ (Accessed: 31 October 2021). Microsoft (2021a) Explore Windows 11 OS, Computers, Apps, & More Microsoft, Windows. Available at: https://www.microsoft.com/en-gb/windows (Accessed: 31 October 2021). Microsoft (2021b) Visual Studio Code - Code Editing. Redefined. Available at: https://code.visualstudio.com/ (Accessed: 31 October 2021). MongoDB (2021) MongoDB: the application data platform, MongoDB. Available at: https://www.mongodb.com (Accessed: 31 October 2021). Oracle (2021a) JDK 11. Available at: https://openjdk.java.net/projects/jdk/11/ (Accessed: 31 October 2021). Oracle (2021b) Oracle VM VirtualBox. Available at: https://www.virtualbox.org/ (Accessed: 31 October 2021). Pallets (2021) Welcome to Flask — Flask Documentation (2.0.x). Available at: https://flask.palletsprojects.com/en/2.0.x/ (Accessed: 31 October 2021). Relan, K. (2019). Beginning with flask. In Building REST APIs with Flask (pp. 1-26). Apress, Berkeley, CA. SEC, S. (2021) SEC, Financial Statement Data Sets. Available at: https://www.sec.gov/dera/data/financial-statement-data-sets.html (Accessed: 25 November 2021). Singh, P. (2018). Machine Learning with PySpark: With Natural Language Processing and Recommender Systems. Apress. SpryMedia (2021) DataTables Table plug-in for jQuery. Available at: https://datatables.net/ (Accessed: 31 October 2021). Štěpánek, L., Habarta, F., Malá, I., & Marek, L. (2021, July). “Great in, great out” is the new “garbage in, garbage out”: subsampling from data with no response variable using various approaches, including unsupervised learning. In 2021 International Conference on Computing, Computational Modelling and Applications (ICCMA) (pp. 122-129). IEEE. Ven, B.V. de (2021) Bokeh. Available at: https://bokeh.org/ (Accessed: 31 October 2021). Venters, C., & Mikkilineni, R. (2020, September). Representation and Evolution of Knowledge Structures to Detect Anomalies in Financial Statements. In 2020 IEEE 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE) (pp. 58-63). IEEE. XBRL International (2021a) An Introduction to XBRL, An Introduction to XBRL. Available at: https://www.xbrl.org/the-standard/what/an-introduction-to- xbrl/ (Accessed: 7 December 2021). XBRL International (2021b) XBRL & Big Data, XBRL & Big Data. Available at: https://specifications.xbrl.org/big-data.html (Accessed: 28 November 2021). B D S Wibowo. XBRL Open Information Model for Risk Based Tax Audit using… | 44 XBRL International (2021c) XBRL Certified Software, XBRL Certified Software. Available at: https://software.xbrl.org/ (Accessed: 28 November 2021). Yao, J., Zhang, J., & Wang, L. (2018, May). A financial statement fraud detection model based on hybrid data mining methods. In 2018 international conference on artificial intelligence and big data (ICAIBD) (pp. 57-61). IEEE.