4.1 Methodologies
Table 2
Overview of methodologies for assessing data, metadata and schema quality, as identified in the reviewed literature.
| Reference |
Method |
Scope |
Data quality dimensions |
Use case examples |
| Wentzel et al. (2023) |
assess |
metadata |
5 |
EU portal data.europa.eu |
| Lämmel et al. (2020) |
harvest, assess |
metadata |
3 |
OGD (Open government data) in Germany |
| Šlibar and Mu (2022) |
assess |
metadata |
2 |
OGD in Canada, USA, New Zealand |
| Nogueras-Iso et al. (2021) |
assess |
metadata |
6 |
OGD in Spain |
| Hafidz et al. (2023) |
assess |
metadata |
3 |
OGD of Indonesian Local Governments |
| Krasikov and Legner (2023) |
screen, assess, prepare |
metadata, schema, data |
7 |
/ |
| Fadlallah et al. (2023) |
prepare, assess |
data |
8 |
Radiation dataset from Lebanese Atomic Energy Commission |
| Yan et al. (2023) |
assess |
data, monitoring, processed data |
15 |
/ |
| Wang et al. (2020) |
assess |
data |
6 |
OGD in China and USA |
| Álvarez Sánchez et al. (2019) |
assess |
data |
8 |
health data from Northern Ireland |
| Kusnirakova et al. (2022) |
assess |
data, schema |
5 |
Czech Open Data |
| Alogaiel and Alrwais (2023) |
assess |
data |
9 |
OGD in Saudi Arabia |
| Bouchelouche et al. (2022) |
assess |
data |
1 |
OGD in USA |
| Raca et al. (2021) |
prepare, assess |
data |
5 |
6 OGD portals in Western Balkans |
| Ferradji and Benchikha (2022) |
assess |
data |
2 |
Wikidata |
| Molodtsov and Nikiforova (2024) |
assess |
portal |
9 |
33 national portals in EU and GCC |
The 16 selected papers, listed in Table
2, propose different methodologies, frameworks and/or metrics. While most of these methods focus exclusively on data assessment, some go beyond this by introducing complementary phases that contribute to data preparation and monitoring. For example, Krasikov and Legner (
2023) not only presented a methodology for data assessment, but also introduced a method for screening and preparing the assessed data for later use, which adds significant value to the process. In addition, Fadlallah
et al. (
2023) and Raca
et al. (
2021) included a preparation phase to ensure that the data are adequately prepared for the assessment.
The scope of these methods primarily revolves around the assessment of metadata and data content. This focus ensures that both the structure and content of the datasets are considered in the assessment. However, one paper, Molodtsov and Nikiforova (
2024), stands out for directing attention to the design of portals. The design of data portals can have a significant impact on accessibility, usability and the overall experience of data users. Ensuring that the portal itself meets high standards of design and accessibility can have a far-reaching impact on how users interact with the data. In addition, Krasikov and Legner (
2023) and Kusnirakova
et al. (
2022) have proposed specific metrics for evaluating the schema of datasets, which adds an important layer to the structural integrity of the data.
A recurring theme in many methodologies is the use of three levels of abstraction: categories, dimensions and metrics. According to Debattista
et al. (
2016), a category represents a group of qualitative dimensions where a common type of information serves as an indicator. This categorization enables organizing and simplifying the review of all aspects of data quality, especially when dealing with a large number of dimensions. Grouping dimensions into categories makes the review process more manageable and ensures that no aspect of data quality is overlooked. Metrics, on the other hand, serve as concrete measures of quality. Each metric is usually linked to a specific measurement method that provides a value—either numeric or Boolean—that can be used to assess the quality of an individual indicator. It is important to note that a single dimension can encompass multiple metrics, allowing for a more nuanced and detailed assessment of data quality.
Despite the common goal of providing tangible information about the state of data portals, usually in the form of a numerical or descriptive score, methodologies differ significantly in the way they define and name the dimensions they use for assessment. This lack of standardization poses a challenge for comparing the results of different methods and makes it difficult to integrate the findings from multiple studies. Nonetheless, some works attempt to overcome these challenges by refining existing approaches. For example, Ferradji and Benchikha (
2022) proposed an improved version of two formulas introduced in earlier methods.
There were five papers that dealt exclusively with metadata and emphasized its crucial role in ensuring the usability of data. Wentzel
et al. (
2023) assessed the quality of metadata within the DCAT-AP standard. They defined their assessment dimensions based on the four FAIR principles and added a fifth dimension, contextuality, to better capture the impact of metadata. These dimensions were used together with the adapted FAIR and 5-star principles to define metrics. They also developed a scalable metrics pipeline and implemented their methodology in the form of the Piveau Metrics service. In their comparison with other tools (Sem-Quire, Open Data Portal Watch, FAIR Evaluator, FAIR Checker and F-UJI), Piveau Metrics was identified as the only tool that combines data validation with support for FAIR principles, the 5-star model and DCAT-AP, while also offering a user interface, API access, export functionality, notifications and score comparison (Wentzel
et al.,
2023). Lämmel
et al. (
2020) described their process for collecting metadata and evaluating the collected metadata. Their assessment focused on the completeness of the required information, the availability of the URL and the overall conformance to the schema. They emphasized that findability, an important FAIR principle, can be significantly improved by the completeness and quality of metadata. While their quality assurance approach is presented in technical detail, it has not been released as a finalized or publicly available tool. Šlibar and Mu (
2022) examined the compliance of metadata with the publication guidelines for OGD portals and calculated scores for the completeness and consistency of metadata to assess conformity. Nogueras-Iso
et al. (
2021) evaluated the quality of geographic metadata using six dimensions from the ISO 19157 standard: completeness, logical consistency, temporal accuracy, thematic accuracy, positional accuracy and the quality of free text. They used the Data Quality RDF Vocabulary for the representation of evaluation results. Hafidz
et al. (
2023) assessed the Open Data portal quality by relying on the Open Data Portal Quality (ODPQ) framework (Kubler
et al.,
2018), focusing on two main quality categories: data openness and transparency. These categories were broken down into three key measurement dimensions: existence, conformance and open data, each divided into relevant sub-dimensions based on the DCAT standards. The existence dimension covers aspects of access, discovery and preservation; conformance includes accessURL, license and file format elements; and open data focuses on open formats and machine readability.
A particularly valuable contribution in the larger scope came from Krasikov and Legner (
2023), who proposed a set of metrics based on a comprehensive research review. Their methodology included steps such as use case ideation, identification of relevant open data, high-level metadata assessment, schema-level assessment, content analysis of datasets, semantic documentation and integration of open datasets with internal data. Their three-tiered approach included both traditional data quality dimensions and context-aware assessments to ensure that open data can be used effectively for predefined purposes. Their proposed dimensions also overlap with the ISO characteristics in the completeness and compliance dimensions.
Fadlallah
et al. (
2023) presented the BIGQA model, which was developed for quality assessment of large datasets based on 8 ISO characteristics (accuracy, completeness, consistency, credibility, currentness, compliance, precision and understandability). BIGQA uses parallel processing to efficiently process large data files and the model was demonstrated with custom data quality reports in an application. In the area of multi-source open data, Yan
et al. (
2023) went beyond data quality and proposed evaluation indicators for data monitoring and processing methods to ensure that the value of the data is preserved during processing. They assessed data quality based on five key dimensions: integrity, relevance, availability, timeliness and legitimacy. Wang
et al. (
2020) examined methods for assessing the security, openness, comprehensiveness, sustainability and availability of data and emphasized that high-quality metadata contributes to the discoverability and usability of open government data. Álvarez Sánchez
et al. (
2019) developed the TAQIH tool for assessing the quality of tabular data, which focuses on the dimensions of accuracy, completeness, accessibility, consistency, redundancy, readability, usefulness and trust, the first four of which are also in line with ISO standards. Kusnirakova
et al. (
2022) adapted an existing method and applied it to a specific use case, using five quality categories, accuracy, completeness (of data and schema) and consistency. These categories overlapped with ISO standards, as well as the formal requirements of the Czech government’s open formal standards. Alogaiel and Alrwais (
2023) introduced nine dimensions for assessing OGD in Saudi Arabia: completeness, granularity, timeliness, machine-readability, reusability, consistency, accuracy, understandability, usage and redundancy. Each dimension is thoroughly explained, including its sub-dimensions, and accompanied by equations for score calculation. The final dataset score is presented on a 0–100 scale, with weighted values assigned to each dimension. RapidMiner software was used to process algorithms and compute scores for certain dimensions. Although Raca
et al. (
2021) indicate a focus on two dimensions, the term “dimensions” is used to denote specific components of the dataset—namely, the format and the data content itself. In fact, the study assesses five distinct quality dimensions related to the openness of OGD. They evaluate availability, accessibility, discoverability and timeliness, as well as the openness of formats according to the Berners-Lee scale.
On the other hand, Bouchelouche
et al. (
2022) focused on the evaluation of only one characteristic. They proposed a percentage grading scale for metrics to assess the accessibility of OGD portals, examining access options, available formats, licensing and timeliness. Similarly, Ferradji and Benchikha (
2022) focused only on an upgrade for time-related metrics, namely currency and volatility, to make them more efficient and suitable for evaluating linked data.
Compared to others, focused on data or metadata, Molodtsov and Nikiforova (
2024) proposed a unique approach by emphasizing portal evaluation. This is important, as the design and functionality of the portal can be equally crucial for enabling effective data reuse. The proposed framework comprises 72 sub-dimensions organized into nine key dimensions: multilingualism, navigation, general performance, data understandability, data quality, data findability, public engagement, feedback mechanisms and service quality, and portal sustainability and collaboration. Most sub-dimensions are scored using a binary method, while accessibility is assessed using the Accessibility Checker web tool. 16 sub-dimensions are evaluated on a sample basis with a 70% threshold required to score 1 point. If sorting by both relevance and modification date is supported, the sample includes the first 4 and last 3 records; otherwise, it includes the first 8 and last 6 records. For update accuracy, at least 70% of the records must match the declared update frequency (e.g. monthly). Because of overlaps between dimensions, a priority-based weighting system is applied, assigning importance levels of low, medium and high weights of 1, 2 and 3, respectively, according to its importance in relation to the central concepts of the framework.
Across these methodologies, completeness proves to be the most frequently used dimension, followed by consistency, accuracy, timeliness and findability. Accessibility, compliance and understandability are also frequently included. The five most frequently used dimensions shared across the analysed studies are summarized and defined in Table
3.
Table 3
Definitions of most frequently used dimensions based on ISO/IEC (
2008), Fadlallah
et al. (
2023).
| Dimension |
Definition |
| Completeness |
Data completeness refers to the degree to which an entity has values for all expected attributes and related entity instances in a specific context of use. |
| Consistency |
Refers to the degree to which data has attributes that are free from contradiction and are coherent with other data in a specific context of use. It can be either or both among data regarding one entity and across similar data for comparable entities. |
| Accuracy |
Refers to the degree to which data correctly reflects the true value of an intended attribute in a specific context. It has two main aspects: syntactic and semantic accuracy. Syntactic accuracy refers to the syntactical correctness of the values themselves. Semantic accuracy refers to the closeness of the data values to a set of values defined in a domain considered semantically correct. |
| Timeliness |
(also Currentness) Refers to the degree to which data has attributes that are of the right age in a specific context of use. |
| Accessibility |
Refers to the degree to which data can be accessed in a specific context of use, particularly by people who need supporting technology or special configuration because of some disability. |
4.2 Use Cases
In the selected studies, numerous methods were applied to specific datasets, which are listed in Table
2. In the process, the authors identified several challenges and proposed practical solutions.
A common challenge in many studies is insufficient data quality, which can lead to increased costs and loss of time in scientific projects. Wentzel
et al. (
2023) found that despite efforts to improve the FAIRness of the portals on data.europa.eu by integrating their Piveau Metrics tool, progress over the course of a year was minimal. One of the biggest challenges is the need for better engagement from data providers. Problems with the completeness of metadata were also identified as a major challenge. Šlibar and Mu (
2022) also pointed out the problems with the American OGD portal, where almost 20% of the required fields were missing from the published datasets, and the New Zealand portal performed poorly in terms of consistency of required and optional fields. A major obstacle to improving metadata quality is the time-consuming assessment of compliance with publication guidelines.
Addressing challenges of working with different metadata models, Hafidz
et al. (
2023) developed a mapping between CKAN and DCAT so that their methodology works across formats in the context of Indonesian local government open data. They used the Open Data Portal Quality (ODPQ) framework (Kubler
et al.,
2018), which is based on the Analytic Hierarchy Process and integrates multiple quality dimensions and user preferences. This approach combines metadata collection (via Open Data Portal Watch (Neumaier
et al.,
2016)) with a web-based dashboard and RESTful APIs to create quality rankings. Extending practical quality assessment techniques, Nogueras-Iso
et al. (
2021) has implemented a two-part assessment approach—automated and manual—of a national open data portal. The manual assessment by two metadata experts revealed problems with thematic classification and low-quality dataset titles, while the access URLs and descriptions showed better results. Automated checks, on the other hand, highlighted structural problems, including missing mandatory properties (e.g. dcat:mediaType), incorrect or misused URIs and poor readability of free text fields. In addition to these approaches, Álvarez Sánchez
et al. (
2019) applied their TAQIH tool to two use cases—Northern Ireland General Practitioners Prescription Dataset and a glucose monitoring system dataset. TAQIH detected issues such as missing values, unnamed columns, redundant variables and outliers. While automated profiling and visualization improved the completeness of the dataset and reduced noise, some steps (e.g. editing combined date and time fields) required manual intervention. Bouchelouche
et al. (
2022), who also focused on evaluation, validated their Assessment Scale of Marks on a selection of datasets from the American OGD portal. Their scoring system, based on the presence of specific accessibility criteria, showed the strengths of the portal in terms of structural openness, but also pointed to the need for improved updating practices to ensure that the data remains current and usable. Finally, Ferradji and Benchikha (
2022) used the DBpedia Live Extraction Framework, specifically the Infobox Extractor, to retrieve semi-structured data from Wikipedia using SPARQL queries and APIs, focusing on frequently updated facts. Temporal metadata such as the start time and last modified time were used to analyse the update patterns.
In the field of big data, Fadlallah
et al. (
2023) faced challenges related to the volume, processing speed and variety of data. The architecture of their solution is divided into two main modules: a data preparation module, which is responsible for profiling, storing metadata and creating a quality assessment plan, and a data quality assessment module, which executes these plans in a configurable way. The core of the system includes a design layer (logical quality assessment plan), a mapping layer (validation) and an execution layer. Such a modular structure enables a scalable and automated quality assessment. The authors tested their methodology with both Stack Overflow data on a single machine and a large-scale radiation dataset from the Lebanese Atomic Energy Commission, running the methodology on multiple distributed machines. One of the problems they identified is the generalization of contexts and workflows by leading technology providers in quality assessment, which is often insufficient to meet the unique requirements of big data.
The need for improved metadata standards and policy frameworks was highlighted in several studies. For example, Wang
et al. (
2020) identified numerous problems with US forest data, including unclear data authorization agreements and inadequate data security measures. The authors recommended the development of open source platforms, improved machine-readable data and compliance with international metadata standards. In this way, data quality could be significantly improved, according to the authors.
Furthermore, Kusnirakova
et al. (
2022) applied an evaluation framework that assigns points from 0 to 100 for five quality dimensions: file format, schema accuracy, schema completeness, data type consistency and data completeness. When applied to datasets from six Czech municipalities, the assessment revealed large deficiencies in schema accuracy. For example, while data completeness was generally high, schema accuracy scored poorly due to inconsistent feature naming that often deviated from national standards.
Raca
et al. (
2021) describe their implementation process as a web service that continuously collected and stored metadata from six national OGD portals, which formed the basis for the quality assessment. Due to the structural differences between the portals, a separate data preparation phase was carried out using custom SQL queries to remove duplicates, incomplete records and inconsistent values. This was followed by a validation step to standardize elements such as date formats, license descriptions and character encoding. Their study showed that differences in metadata practices and file formats between portals significantly influenced the comparative quality assessment.
Finally, Alogaiel and Alrwais (
2023) examined the effectiveness of existing methods for Saudi Arabia’s OGD portal. They found that the current framework is ineffective because there are no clear indicators of what constitutes high-quality data. The authors suggested that continuous monitoring and evaluation, improved search capabilities of the portal, the inclusion of visual representations (such as maps and charts) and better timeliness and comprehensiveness of data could improve the portal’s performance.