3.1 Revisiting Data Quality Characteristics for Master Data
Since master data repositories are a particular type of data repository with specific business rules and a unique environment, we grounded our investigation on the works on the quality of regular data based on ISO/IEC 25012 and ISO/IEC 25024. In our investigation, we comprehensively analyse master data quality and consider their specific unique features that differentiate master data from other types of data (see Section
2.2) to provide interpretations of the effect of master data quality for business and IT. In the following paragraphs, we introduce some results of this analysis. The analysis is made by first recalling the definition of the data quality characteristics according to ISO/IEC 25012 presented in Table
2.
Accessibility. Typically, this concern of accessibility is mainly referred to the user interfaces and users’ disabilities. Master data is generally accessed using web services. Furthermore, this type of service would not essentially differ from others for regular data. So, this characteristic does not necessarily need to be tailored for master data.
Accuracy. Master data records must contain valid values (i.e. semantically or syntactically encoded reference data) that represent the entities in real life (Allen and Cervo,
2015). In Master Data Systems, as mentioned in Section
2.1, these reference values are typically stored and accessible in data dictionaries (i.e. eOTD by ECCMA in the “
product data” domain or ICD9/ICD10 in the “
healthcare” domain), some of which may be publicly available or through a pre-set fee (ISO,
2009). Thus, inadequate levels of accuracy may be a symptom that the coding processes need to be fixed, regardless of whether the process is manual or has been automated to the maximum extent possible. The more relevant concerns can be measured on the master data values (through “
Semantic Accuracy” or “
Syntactic Accuracy” properties) or the data model (using the “
master data model accuracy” property). Low levels of this data quality characteristic are symptomatic of an inefficient standardization process or preliminary design and implementation of the ETL process. This characteristic is strongly related to others, such as Understandability and Credibility; in fact, they maintain a direct relationship: the higher the Understandability, the higher the Accuracy and the higher the Credibility.
Availability refers to the fact that master data, as a reference for different business processes, should be available and accessible at all times for stakeholders to use whenever they need it (Li
et al.,
2013). It should also be mentioned that this data quality characteristic is related to the ability of systems to provide master data records in an acceptable response time (Allen and Cervo,
2015; Loshin,
2010). This data quality characteristic can be measured through properties like the “
Data Availability Ratio” or “
Architectural Data Element Availability”, which can help to detect when the master data records or some of the elements of the architecture are not available (i.e. discontinuities of the web services providing master data). In the context of master data, low levels for measuring this data quality characteristic can be symptomatic of inadequate implementation of the ETLS process.
Completeness. Completeness can be measured from both points of view of the master data model and the master data values. To meet the requirement, it would be desirable that the master data design performed on a conceptual master data model be able to solve all the data demands of the organization. Therefore, this master data model should be agreed upon and complete as possible, considering as many contexts of using master data within the organization as possible (Allen and Cervo,
2015). As stated in Section
2.1, master data records are built from the integration and consolidation of master data attributes from the different data sources integrated as part of the design and implementation of the master data model (Allen and Cervo,
2015). During master data integration processes (performed through ETL processes), it should be possible to generate values for all attributes defined for master data entities (i.e. entity resolution, Talburt,
2011). Completeness measurements can be used to diagnose some relevant concerns, such as (1) missing values for an entire attribute of a data item in the master data repository, (2) missing values for some values of master data records, or (3) missing certain records in master data of the data file. Considering that the conceptual model is adequate, having null values in those data elements may be a symptom of (1) the ETLs processes are not adequately extracting values from all data sources for all master attributes specified in the conceptual model; and (2) that the data sources are not sufficiently complete. In the first case, an exhaustive testing process of the ETLs processes is to be used; additionally, it may happen that, in the process for master data record consolidation, the entity resolution algorithms do not generate any values for specific attributes. On the other hand, in the second case, the idea is to resort to a complete analysis of the completeness of each source involved in generating the master data values.
Compliance. Compliance can be measured from an inherent and a system-dependent point of view. For example, the master data values for the master data of the disease registry in the healthcare domain must comply with ICD10 codes (Alonso
et al.,
2020), or in the case of product data management, the GTIN to describe how to construct such codes. The level of compliance with a normative value or format can be measured by the “
Regulatory compliance of value and/or format” property. It is common to use specific technologies (i.e. JSON, XML) to support master data exchange. “
Regulatory compliance due to technology” property can determine the level of compliance of a given technology. Low levels of compliance can be symptomatic of an inefficient observation of required standards or even regulations.
Confidentiality. It is essential to distinguish between availability (being able to access the master data record) and confidentiality (being able to read, interpret and use the master data record if the user permissions are adequate to the sensitiveness of the data). This characteristic can be measured using “Encryption usage” and “non-vulnerability” properties. The foundations of confidentiality measurements for master data do not differ much from those established for regular data. It is a matter of the user access setup of the master data repository.
Consistency can be measured between attributes of the same target entity or comparable target master data entities. Consistency can be measured for a single system or more than one system in the same environment. Considering the master data repository as an authoritative source containing a single source of truth, inconsistencies should be avoided using reference data to prevent the master data repository from becoming cluttered with inconsistent values. This characteristic’s quality level can be determined by using several properties: “
Master data format consistency”, “
Master data semantic consistency”, “
Referential Integrity”, or “
Risk of master data inconsistency”. Standardization and propagation of data reference values across data sources are good practices. In this sense, the apparition of values in the master data records for attributes that are not calculated and do not exist in the source tables can be a symptom of two possible problems. These problems are (1) referential integrity mechanisms have not been conveniently enabled in the design of the master tables, and (2) the ETLs processes are not well designed and do not consider all sources that can minimally provide values to create a minimum master data record. It is interesting to note that reference data can collaterally increase the accuracy and consistency levels. On the other hand, it is possible to include additional data quality characteristics under the ISO 25012’s consistency umbrella, such as completeness, uniqueness, conformity, and validity, as Allen and Cervo (
2015) did.
Credibility. Organizations that consider their master data repositories as the sole source of truth establish internally that the master data must be inherently credible. This statement can be made implicitly through agreements or by creating specific data policies that must be implemented and enforced. In this way, top management is responsible for any shortcomings that any stakeholder may cause when using the data in its specific context. To some extent, it could be said that the credibility of the master data repository, if any policy is in place, is as high as possible. The “
Master data repository credibility” property can measure the level of credibility. On the other hand, most master data records contain reference data. They are generated through the integration and entity resolution of data values from various sources using specific algorithms (Talburt,
2011). These algorithms are supposed to observe rules of thumb, as well as possible exceptions, to create a master data record that is credible and believable by users in a specific context (Otto
et al.,
2010). This characteristic can be measured by the “
Credibility of the master data value” property. Low levels of Credibility, combined with low levels of Consistency, are typically symptomatic that the credibility of the data sources is inadequate, and some inspection works must be done to isolate and fix the corresponding ones. The most crucial challenge in measuring this characteristic of data quality in master data regarding regular data is the need for solid integration and cohesion of the master data records.
Currentness. By their nature, master data repositories, once master data records are consolidated, tend to have a minimal rate of change, so they are typically expected to change little over time (Dahlberg
et al.,
2011; Hüner
et al.,
2011). However, they are not exempt from changes, being some of the reasons for such changes are some planned update operations, the need to execute some standardization process of the master data, or some update of the reference data (Loshin,
2010) as in the case of the updates from ICD9 to ICD10 in clinical data coding (Santos
et al.,
2021). That implies that to maintain adequate levels of the Currentness of master data records, these changes to the values of the attributes of master data records should be made and propagated as soon as possible (Fan
et al.,
2013). Currentness can be measured using two properties: “
Timeliness of update” and “
Update frequency”. Low levels of Currentness are symptomatic of inefficient update processes. For this reason, Currentness is typically included as a relevant characteristic in master data quality management (Allen and Cervo,
2015).
Efficiency. One of the most common concerns in the literature on master data quality is the duplicity of master data records (Berson and Dubov,
2011; Fan
et al.,
2013; Haneem
et al.,
2017; Loshin,
2010) due to the inability or inefficiencies in entity resolution (Talburt,
2011), or to the election of a format type for master attribute values that do not optimize the space occupied or the performance of operations (Allen and Cervo,
2015). Optimizing Efficiency requires avoiding duplicate master data entries to solve redundancy issues, avoiding wasted storage space, and ensuring each master attribute has a usable format. The Efficiency can be measured through three properties: “
Efficient master data item format” to ensure that the data format of master data records is correct, “
Risk of wasted space” to ensure that the size of master data records is correct, and no unnecessary memory space is wasted, and finally, the “
Space occupied by master data records duplication” is used to ensure that there are no duplicate master data records that may affect the efficiency of the master data repository. It is important to note that one of the worst consequences of low levels of Efficiency in the case of potentially duplicated records is the probability of propagating inaccurate values to other transactional systems.
Portability. Organizations need to exchange data with other partners during the execution of their business process (Hüner
et al.,
2011; ISO,
2009; Rivas
et al.,
2017). Different systems can have other implementations of the same master data, being less portable. Therefore, the more portable the data, the lower the costs of exchange and integration (Silvola
et al.,
2011). For example, when backing up data for recovery purposes, it would be necessary to allow maximum portability of the master data to lose information. Portability can be measured with the properties “
Master Data Portability ratio”, “
Prospective data portability”, or “
Architecture element portability”. Low levels of portability will thus anticipate low chances of success when installing or moving master data from one system to another, with the significant risk of not being able to preserve existing levels of quality of master data.
Precision. This data quality characteristic should be differentiated from Accuracy. Precision is more related to concerns related to the ability to understand that 5.0001 and 5.0002 are different values. It can be measured by “Precision of data values” or “Precision of data format” properties. Low levels of Precision are symptomatic of not adequately propagating the corresponding precision policies to data source repositories. The study of this characteristic does not differ much from that of regular data, although it is still vital to study the quality of the master data.
Traceability measures the extent to which it is possible to create and follow evidence for audits about the access and any other changes made to the master data record values or master data models. In the specific case of master data, it is desirable to have high levels of traceability to validate the master data life cycle. This characteristic can be measured through the properties “Traceability of data values”, “Users access traceability”, and “Data Value traceability”. Low levels of traceability are symptomatic of a not well-documented master data life cycle.
Understandability. To conveniently exploit master data, users must understand the meaning of the metadata in dictionaries and other metadata sources that characterize the data (ISO,
2009). Understandability covers several concerns that can be measured through several properties like
“Symbol understandability”, “
Semantic understandability”, “
Master data understandability”, and “
Data model understandability”, to cite a few. Low levels of Understandability are symptomatic of an inefficient design of the master data system. The most challenging consequence of low levels of understandability is the risk of producing inadequate master data values that can hurt the design.