Data Science as a Catalyst for Audit Transformation

Authors: Colombo Gardey Julieta and Kugler María Paula, Auditoría General de la Nación Argentina


“In a world deluged by irrelevant information, clarity is power”

Yuval Noah Harari, 2018

In the context of information society, digital transformation has led to an exponential growth in the production and storage of data, giving birth to what is known as data science, to address the need for new tools capable of smartly processing great volumes of data and transforming them into actionable information for decision-making in multiple environments. 

This scenario suggests that the old adagio by Hobbes (1651), “Information is Power” will soon be replaced by the newer “Clarity is Power” (Noah Harari, 2018) to more accurately convey the implications of a new model involving the management of data, information and knowledge. 

This is a unique opportunity for Superior Audit Institutions (SAIs) to exploit the potential offered by new technologies and the great amount of data generated within public agencies, and integrate them into their auditing processes to ensure a better management of public funds. Data science works as a catalyst for audit transformation, which strengthens SAIs’ independence while increasing public confidence and accountability

Data science methodologies can optimize audit processes, leading to audit reports of greater value, accuracy and scope, and more timely and relevant recommendations. High-quality reports foster a more effective and efficient public administration, having a significant impact in the improvement of citizens’ quality of life. Integrating data science into the auditing processes requires a roadmap that can guide SAIs in the use of these new tools.

Data Science in the Auditing Process

Data science, tools, and techniques can be integrated to any instance of the auditing process. The analysis below is based on the INTOSAI guidelines for Performance Audits and the CRISP-DM (Cross-Industry Standard Process for Data Mining) Model. Both processes constantly feed into each other in an iterative manner. Their stages are comparable and can be combined as data science is introduced in audit activity: 


Source: Own, based on GUID 3920 (INTOSAI, 2019); Han, Kamber y Pei (2011). 

CRISP-DM Model includes the three core dimensions of data science: 1) Database management; 2) Creation of machine learning models through algorithms that enable computers to learn a task, such as automatically recognizing complex patterns and improving their performance over time through the use of data, and; 3) Data analytics, to explore, clean and transform data to extract and present useful information for smart decision-making.


  • Database Management

In the planning stage, both the selection of the topic and the audit design are vital, because these determine what the audit object will be. 

In this sense, data science is a key tool that ensures the topics to include in the report planning are strategically and efficiently selected. It also enables a thorough initial assessment of the universe of possible audit objects through statistical models applied to great volumes of data. This allows SAIs to identify audit critical points and risks more accurately, and select the most relevant, auditable objects in line with the SAI’s mandate.

An audit plan design begins with a thorough search of relevant information, thus rendering data access vital. Today it is possible to access open data in multiple public and private platforms (scraping/crawling, APIs, GPT). These tools facilitate and speed up access to the information needed for audits.

Planning begins with an initial assessment of the structure and composition of the database, and this assessment informs how it will be cleaned up and transformed to adjust it to the audit project goals. This entails estimating the number of registries, types of variables, summary measures and the presence of outliers, noise (error characters) and duplicate data, as well as any missing data. Visually presenting this exploratory assessment enables a better interpretation of raw data. Visualization tools are an excellent resource to create summaries, graphics and reports quickly and with a wide variety of designs. 

Data quality impacts the results of the models, their analysis and the conclusions drawn. Although organizations have advanced in digitalizing, standardizing and structuring data, agencies usually obtain databases that need cleaning to be used. Thus, during this stage, raw data are cleaned and refined through various techniques, in order to obtain an adequate dataset to perform the job based on the goals of the audit. Data structure and features are vital aspects when it comes to defining the relevant statistical models. 

As relates to sampling, as a general rule, the whole universe of data is considered, due to the big potential of data science tools. One or more samples are obtained to create and train the algorithms. This way, one dataset is used to create and train the algorithm, and others are used to assess the predictive ability of the model.

The software options available for each process in data science are virtually endless. It is recommended to use tools that enable interaction with computer thinking and that do not limit the user. The most comprehensive and widely used software utilized for data processing and analysis are Python and R. Both are open source, free, and their language is high level. They offer toolkits known as libraries and functions used in every stage of data science, from simple visualization to the construction of more complex algorithms. One of the main advantages these high level software options offer is that one can create their own function with all its action items and rules to apply to a database and then use this same function with other datasets without the need to duplicate processes manually or rewrite any codes.


  • Creation of machine learning models

During the execution and modelling stage, models are created and assessed to find evidence that will support future findings. The chosen models will depend on the audit’s goal, the scale of the available data and the type of issue that will be addressed. These can be classified into two distinct types, based on how much their variables depend on one another and the peculiarities of the issue addressed: Supervised learning models, used to predict new cases (regression) or unsupervised learning models (used to sort and group cases). The following chart contains examples of these learning nodels as they relate to the type of task to be performed: 


Source: Own, based on Han, Kamber and Pei (2011).

Every model should be evaluated against the validation data to determine their predictive or classifying ability. To this end, there are various techniques to measure variance, bias, errors and the cost of detecting those errors. The model is then applied to the remaining data in order to find relevant and accurate evidence that can support recommendations.


  •  Data analytics

In the reporting stage, visualization tools have an extremely relevant role. The variety and number of visualization options offered by data science represent a meaningful improvement since they clearly convey information through high quality graphs and videos with the possibility of selecting different aesthetic parameters and easily creating reports. Moreover, there are several intuitive tools (Power BI and Tableau) which enable the creation of dashboards to inform decision-making (known as business intelligence).


Data science techniques allow automatization of audit processes. By maintaining criteria unchanged and adding or replacing data (input) the model can enable the detection of continuity and/or disruptions (anomalies) in the analysed information.


Source: Own.


Source: Own.

Strategic Action Items


Source: Own.


The strategic and progressive deployment of information technologies for auditing activities has the potential of driving significant changes in the auditing processes.

The advantages of integrating data science outweigh, by far, the risks. This is why it is strongly advised that SAIs begin reimagining their auditing and monitoring activities to include such practice.

A series of strategic guidelines have been outlined to ensure the understanding that integrating data science into the auditing processes should not be considered as an isolated measure, but rather as part of a set of steps towards gradual escalation.

Special emphasis has been put on the unique opportunity the SAIs have to enhance their role and capitalize their cross-sectional and multidisciplinary activities to spearhead the culture change that digital transformation requires. It is an enormous challenge, but overcoming it is not only necessary and timely, but also viable and feasible. Rather than considering technology as a limitation or as an end in itself, it is important to understand the positive impact of data science, to avoid “putting the cart before the horse”, or technology before knowledge.

Digital transformation should be approached through specific measures and a strong political will oriented to the fight against corruption. To this end, data governance should be widely and effectively addressed.

A timely, accurate and efficient governmental control based on data science adds value to the administration, optimizing public expenditure. Furthermore, leveraging the potential of technology to implement data science can contribute to the reduction of development gaps and lay the foundation for a more solid and sustainable growth worldwide.

Back To Top