Disclosing Architecture: 18 Stories of Heritage and Innovation
The accessibility of digital information depends on its discoverability and therefore on the quality of the (meta)data. Better registration makes it easier to make connections within the collection, but also to connect different collections. This is one of the aims of the new collection platform being developed as part of Disclosing Architecture. Improving data quality is therefore also an important part of the programme.
Text Nora Abdelmageed and Inge van Stokkom
A collection needs to be registered in order to be managed and used. The systems in which this is done have changed over time, from paper to digital and from private to public and networked. The type of data and the way it is registered is also changing, due to new knowledge and more (technical) possibilities. The trend is to register more and more data, such as information about copyright holders and detailed descriptions of the physical condition of objects. There is also experimentation with ways of recording knowledge about the collection from outside the organisation.
These changing insights, desires and opportunities, combined with different registration methods, input from legacy systems and simple typos, can result in corrupted, inconsistent, missing or impractical registered data. As a result, archives and objects are difficult to find and search results are incomplete or unreliable.
The usability of the collection information found depends on the accuracy, completeness, clarity and unambiguity of the information recorded. Better registration makes it easier to make connections within the collection, but also to connect different collections with each other, which is one of the goals of the new collection platform being developed as part of Disclosing Architecture.
Our approach
We take a two-pronged approach to improving data quality. The first is manual review and correction. The aim of this method is to improve the current data quality, especially in Axiell Collections, and to formulate future guidelines to help specialists to fill in data fields. This is directly related to the internal use of the data within the Nieuwe Instituut. At the same time, we want to identify data quality issues and data quality problems in our linked open data (LOD). We focus only on the publicly available data, which complements the first task. We have proposed and followed a systematic approach to detect data quality problems in our LOD and to provide (semi)-automated solutions for them. In this article, we explain the two strategies to improve data quality.
Improving internal data
Modifying data requires a lot of coordination with the people involved (collection managers, data importers, domain experts) to arrive at a clear and consistent plan. What information do we want to collect and how? What guidelines do we follow so that information is always collected in the same way? What will be experienced as disruptive in daily use?
This leads to different types of requests, such as moving data to a different, reducing options in drop-down lists, or removing fields. This applies both to the collection database itself and to the ‘help lists’ – the authority data sources – such as the Thesaurus and Persons & Institutions. While the collection database is at the heart of collection management and therefore receives a lot of attention, these lists are often invisibly polluted.
The authority data sources contain more information about the terms (for example personal data), have been checked and manually compiled. They are sources of information to be used in collection descriptions and increase the consistency of the data used and thus searchability.
Problems in the collection database usually arise from information ending up in the wrong fields, for example due to different understandings, errors, or changing standards. There are many possible problems with the authority data sources:
- Noise from legacy data: data from an old system that was once transferred into the registration system, from a time when different manuals and other options were available.
- Terms/names that are not used anywhere.
- Data errors: errors according to the software.
- Missing information that makes it impossible to clearly determine which term is meant. For example, persons with a generic surname and a single initial, without birth/death dates.
- Terms used only once or a few times. Searching with thesauri is only useful when terms are used more frequently. Such terms are often written in several ways. They should therefore be merged.
- Developments such as the availability of a geographical thesaurus. The geographical data originally included in the regular thesaurus will need to be moved or deleted.
This inventory results in many small sub-projects, some of which can be tackled using tools such as search/replace or OpenRefine, but some of which require a lot of manual work. Some steps are quick and easy to make, such as deleting thesaurus records that are not linked to any other record. Others require more work, such as transferring information from several fields in the collection database in order to delete a thesaurus term. Obviously, there is a lot of manual work involved in adjusting all the non-standardised data in order to achieve a standardised situation.
Enrichment – Linked Open Data
External sources sometimes have more information about the terms in our collection database. By linking to these sources, we can add information to our internal collection registration system. For example, places of birth, dates of death and family relationships can be added to the personal records of RKDartists. Scope notes from the Art and Architecture Thesaurus can be added to keywords. Other potential external sources to link to are the TGN or Geonames for places and Wikidata. The additional information reduces the number of input errors, as the term is more unambiguous. Information from an external source that is not in our internal system can be linked directly to the collection platform.
Enrichment through OpenRefine
Linking to external sources has become much easier in recent years thanks to the improved functionality of OpenRefine and the RCE Network of Terms . We load the dataset into OpenRefine and the ‘reconcile’ function can access all the sources in the Network of Terms and automatically match them. Our experience is that not all matches are correct. We therefore reconcile promising subsets, for example the set of persons for whom a date of birth is already available. Some information about the person is needed to be able to say whether the match is correct, the name alone is not enough – except for very specific names. Each match is checked against the URI and any additional information is imported into our collection registration system. If we come across a set where a large sample shows that the automatically assigned matches are (almost) all correct, then importing without a full check is certainly an option. So far, this has not been the case.
Enrichment through AI (Closed Beta)
The Closed Beta AI is an initiative of the Axiell Group (our collection management system supplier), together with five museums and institutes, to enrich heritage data. Nieuwe Instituut is the only Dutch institution to have participated in the Closed Beta AI. The project ran from October 2023 to July 2024 and consisted of four phases. The goal of the Closed Beta AI is to automatically extract named entities from collections and link them to entities from Wikidata. The enrichment was done using the authority data sources and, as part of the enrichment of the archival database, entities were extracted from the ‘title’ field of archival records and linked to Wikidata. Axiell Group performed the linking task using various AI techniques, including named entity detection and automatic linking.
In order to validate the results of the developed tool, Axiell Group prepared a validation set that was manually verified by the Nieuwe Instituut team (domain experts). Only records that have been manually verified as ‘correctly linked’ are allowed to be written back to our collection management system. This strategy ensures that no noise is introduced into the existing records. In addition, the records that are written back are marked as ‘AI-generated’ in the notes field. This allows all AI-generated fields to be deleted if necessary, and also makes the distinction between manually entered fields and automatically generated fields clear.
The Data Cleaning Initiative (DCI)
As improving data quality involves different tasks with different scopes, we launched the Data Cleaning Initiative (DCI) project in late 2023. This project aims to standardise and manage all data quality tasks through a unified framework with clear activities, scope, and approach.
The DCI is a systematic approach that aims to discover and resolve data quality problems at scale (all at once) by exploiting patterns in the occurrence of a particular problem. This approach not only fixes data breaks, but also uses semantic web technologies to extend and enrich current data catalogues. The DCI explores data and provides solutions on both sides: Axiell Collections and LOD.
The DCI currently covers four types of tasks, ranging from data cleansing to entity linking, but can be extended in the future. The DCI involves the interaction of different team players with different backgrounds (domain experts and technical backgrounds). We set a limit of 12 sprints to be executed in 2024, but the concept has no time limit. The limit was set to facilitate measurements such as productivity and tracking.
The DCI has two specific objectives:
- A proposal to split the catalogues containing more widely related data into smaller, closely related data sets. This will allow the original catalogues to be semantically grouped in the in the collection management system. A semantic group is a set of records that have the same meaning, such as persons, books and articles.
- Cleansing and enrichment strategies that we apply to the resulting semantic categories to improve the data quality of the heritage collections.
We propose four groups of data cleansing and enrichment tasks in the context of the DCI. We define these main categories as:
- Data cleaning: This category includes all tasks related to basic data cleaning. For example, dealing with inconsistencies such as the use of different formats or dealing with missing values. The latter may influence the guidelines for the completion of this metadata by domain experts, for example by discovering a potentially mandatory field.
- Entity resolution: This category focuses on the discovery and grouping of similar entities. As all data is manually entered by domain experts, they may use different representations to describe the same entity, for example ‘Doesburg, Theo van’ and ‘Doesburg, Th. Van’.
- Entity linking: This group links internal records of our heritage collections to external sources or knowledge graphs (KGs), such as Wikidata. For example, ‘Doesburg, Theo van’ would be linked to 'wd:Q160422'.
- Entity enrichment: This category aims to retrieve external properties and pieces of information that exist in external sources but not in the local collection management system. For example, adding the image of Van Doesburg from Wikidata to our Axiell Collections.
Results and conclusions
The DCI has helped us to identify 18 data issues that fall into one of the predefined DCI categories, either data cleansing or data enrichment, by the end of Q2 2024. Approximately three issues have been deferred following consultation with domain experts, four issues have already been resolved and the remainder are in progress. We are also using the DCI to explore the possibility of automatically linking to external sources, such as the National Archives, the Amsterdam City Archives and the Rijksmuseum.
In addition, as a side effect of the DCI, we submitted a poster paper to the SEMANTiCS conference to present the DCI approach at the Nieuwe Instituut. We believe that this will benefit the entire heritage community, as the generalised unified framework can be adopted by any institution and the topic of data cleansing is not yet widely discussed, but is emerging and attracting the current interest of the community. Our submission has been accepted and we will present our paper in September 2024.