Publisher normalization for OpenAlex data
Introduction
Even in times of alternative publishing platforms, publishers of journals, series, and books are known for their contribution to science: they publish renowned journals, give disciplines a voice, provide access, and thus structure a constantly growing range of offerings. Ensuring high-quality selection processes for contributions is also an advantage for many publishers, both from the perspective of readers and authors.
OpenAlex claims to provide comprehensive coverage of scientific publications that goes beyond that of comparable databases with stricter selection criteria, like Web of Science and Scopus. This also includes publications from disciplines that have received little or no attention in traditional citation databases. A reliable knowledge of the publishers in OpenAlex is therefore an important prerequisite for assessing the possibilities and limitations of specific research questions.
Starting point
As part of the database comparison of OpenAlex, Web of Science, and Scopus1, an analysis of publishers was conducted using a core data set of around 26 million publications from 2010 to 2022 that are included in all three databases2. Initially, it was noticeable that the number of publishers identified in OpenAlex appeared to be lower than those in Web of Science and Scopus (Table 1). However, the reason for this distorted result is a larger number of spelling variations of some publisher names in the Web of Science and Scopus. OpenAlex does a much better job of consolidating different variants into standardized names, but it is also incomplete. The number of publications outside the core dataset examined is highest in OpenAlex, at around 44 Million3. The analysis of the publishers in this subset shows a completely different result than in the core dataset (Table 1): the spellings are very varied, although only 69 percent of these publications contain any information about a publisher. By comparison, 97 percent of the records for core publications in OpenAlex include a publisher. The reason for the significant difference in spelling variants between the two datasets is that the subset of non-core publications contains significantly more varied data from sources that are not included in the core dataset, such as repositories or from non-english sources.
Figure 1 illustrates that adjustments based on different criteria can sometimes lead to very different results: although the three databases OpenAlex, Web of Science, and Scopus compare the same publications, almost all figures for the largest publishers differ from one another.
A publication analysis on publishers therefore always requires the adjustment of non-standardized spellings in order to obtain correct results. The different procedures for dealing with publishing structures, in which imprints or sub-publishers are either listed separately or under the main publisher, and changes in ownership are not always traceable, are also a reason for the results shown in Figure 1. The latter two points are relevant for licensing and acquisition decisions by libraries, among other things: Can the publishers and imprints of a large publishing company be distinguished and recognized individually or as part of the main publisher? Are publishers listed exclusively under their current name and main publisher, or can changes be recorded retrospectively? The answers to these questions, combined with, for example, usage figures from your own institution and price developments over several years, can provide important arguments in contract negotiations.
The further development of publisher information in OpenAlex has been reviewed in more recent database extractions (August 2024 and February 2025), but without cross-database comparison and thus without distinction between core data sets and exclusive data sets. However, the proportion of records with publisher information remains stable at around 60 percent for the period 2010-2023 (Table 2), although it rises over the years in both data extractions from a good 52 percent in the early years to well over 70 percent in the years from 2021 onwards. In the more recent years, there is even a tendency for over 80 percent of publications to have a publisher in the data set.
Taking into account the heterogeneous data sources and publication types, the data picture is significantly more differentiated (Table 3): Source types with the highest proportion of publisher information (90-100 percent) are book series and ebook platforms. However, these account for only 10.5 percent of all publications in OpenAlex since 2010. Journals have the highest share of the total stock at around 57 percent, of which just under 80 percent contain publisher information. For publications of the repository type (just under 10 percent share) and conference type (1.2 percent), publisher information is available in around 51 and 19 percent of cases, respectively.
Objective
The most important prerequisite for analyzing publishers is their clear identification through standardized spellings. In many applications, it is also important to be able to differentiate between relationships (publishing hierarchy) and changes (changes in ownership, renaming). These requirements were already met in 2017 with an internal project by the competence network, which at the time focused on cleaning up and disambiguating publishers from Web of Science and Scopus4. However, it has not been updated since then or applied to other data sources. The current project to standardize publishers is related to the OpenBib project.
The limiting factor for a project that aims to meet the mentioned requirements is the source data: as explained above, OpenAlex has already cleaned and modified imported raw data using its own method, so that not all necessary information is available.
Procedure
For the project presented here, all publisher names were taken from OpenAlex5 and entered into a table with the respective number of publications since 2014, supplemented by information helpful for identification (if available in OpenAlex). This includes links to the reference data in Wikidata or ROR and the URL of the publisher’s website. The table was reduced to publishers in Latin script.
Starting from the publishers with the highest number of publications, further spelling variants were determined in a semi-automatic process and assigned to a standard publisher name. In doing so, different spellings of a publisher name were primarily merged (e.g., Multidisciplinary Digital Publishing Institute or MDPI AG to MDPI), taking into account the publisher standard data already created by OpenAlex. In most cases, the standard names were determined in accordance with the English Wikidata names. Before merging the existing publisher names with the standard names, manual checks were carried out to ensure correctness, and the names were also checked using partial string matching. The result is the publishers table, which is now available as a tool for analyzing publishers in OpenAlex (Table 4).
Due to the dynamic nature of the data, the table does not claim to be exhaustive, but rather offers a refined selection of the most prolific publishers with a view to balancing effort and benefit and future development plans.
The table provides original data from OpenAlex (publisher_id_orig
, publisher_name
), information added by the KB schema (publisher_id
), and new fields containing the normalized name and associated key (standard_name
, unit_pk
). Identifiers from Wikidata6 or ROR7 and a link to the publisher’s website (url
) either also originate from OpenAlex or were added manually. The table can be used in the context of publisher analysis to identify all variants of a publisher’s name or to check questionable publisher names or assignments using the IDs.
Another table shows publisher relationships and is still under construction. However, the current status has also been published under the name publishers_relations
. Publishers can be recorded multiple times as child or parent publishers, depending on the publisher structure or changes. The example (Table 5) shows the relationship between Nature Portfolio as a sub-publisher of Macmillan Publishers for the period from 1999 to the end of April 2015. Since May 2015, Nature Portfolio has been part of Springer Nature: there is a new entry for this relationship. Data records with the same child publisher can be distinguished by different identification numbers in the p_relation_id
field. Depending on the period under review, it makes sense to take the first_date
and last_date
columns into account in order to include only affiliations that are valid for the period under review.
The connection between the two table is ensured by the keys child_unit
and parent_unit
. Updates to the tables with newer data and further changes are in preparation and will also be made freely available in the future.
For testing purposes, the data can be downloaded from Zenodo8.
Footnotes
https://bibliometrie.info/forschung/#interne_entwicklungsprojekte; https://open-bibliometrics.de/↩︎
The data was collected in 2023 and the comparison was made using the DOI.↩︎
A further 1,931,868 publications are also included in WoS, and 10,902,754 in Scopus.↩︎
Bruns, A., Lenke, C., & Rimmert, C. (2019). Publisher. Disambiguierung und Historisierung. Projektbericht. doi:10.4119/unibi/2935159↩︎
A data extraction from the KB was used, data status: August 31, 2024.↩︎
https://www.wikidata.org/wiki/Wikidata:Main_Page↩︎
https://ror.org/↩︎
https://zenodo.org/records/15308680↩︎
Citation
@online{scheidt2025,
author = {Scheidt, Barbara},
title = {Publisher Normalization for {OpenAlex} Data},
date = {2025-09-03},
url = {http://www.open-bibliometrics.de/posts/20250903-PublisherNormalization/},
langid = {en}
}