Towards open citation data from research project reports
Open science and open bibliographic metadata
Scientific research and its results, verified objective knowledge about the world around us and our societies, are widely seen as common goods that deserve public financial support. It is becoming more and more accepted that to facilitate the greatest possible use and benefit for society, the knowledge and data generated from publicly supported research should be as publicly, widely and easily accessible as possible. This open science premise includes open access to scientific publications, data, and code and can also extend to open bibliographic metadata about publications and data. In this context, recent initiatives have advocated for and worked towards open abstracts, open cited reference data, and open institutional affiliation and personal identifiers for scientific research publications.
The availability of open cited reference data is a crucial condition for more transparent and reproducible bibliometric research and services (Peroni and Shotton 2020). Large corpora of open bibliographic metadata and citation data are now being curated, for instance, by CrossRef, OpenCitations, and OpenAlex. These include predominantly regular scientific publications. For some research questions within science studies, types of documents which are not themselves research publications but more or less closely related to research are also relevant. For example, policy and think tank documents, trade literature (Dobre, Herbert, and Hicks 2024), technical standards (Blind and Fenton 2022), patents, and research project reports all frequently cite scientific literature, embedding them into the citation network and knowledge space of global research. Such para-scientific documents of the ‘bibliometric hinterlands’1 have significant potential for documenting impact beyond the science system and informing policy, including impact analysis of science policy.
There is a large existing corpus of research project completion reports available from German federal ministries that have funded research for which the cited scientific literature has not been made openly available in a machine-readable format (Gabrys-Deutscher and Lütjen 2022). We wanted to test how well an open source out-of-the-box reference extraction would perform on this source of reference data – which might be quite challenging as we discuss below. We report some first findings here.
Open source reference extraction
Open source tools have been developed to identify, extract, and parse bibliographic data from PDFs of scientific texts, including cited references. We have decided to look only at the tool GROBID (Lopez 2009) since it is probably the most widely used option. Its users include OpenAlex, scite.ai, and ResearchGate. GROBID can extract numerous bibliographic elements from the full-text of documents and is specialized for typical journal articles with reference lists at the end. This is why despite its verified good performance in this domain (Backes et al. 2024), its performance for less standardized documents is not well known and worth looking into.
The challenge of heterogeneous non-standard research texts
The research project reports corpus that we are interested in is characterized by extensive heterogeneity in formatting and content. While there are some guidelines on what should be contained in such reports, the level of detail with which the required sections are populated varies and sections are frequently left empty. This is particularly notable for the sections on the scientific results. Sometimes these are limited to just a few general paragraphs, sometimes these comprise hundreds of pages and include other published or unpublished material such as scientific papers and university theses. We noticed in preliminary spot checks that there is little uniformity in formatting, including citation style and in-text referencing. Concerning reference lists we observed the following points:
There is no standard reference list format. References are listed in many different styles.
Frequently, there are no cited references nor any reference list.
These two points should not be difficult for GROBID to handle since this is also the case for more typical scientific papers. The following issues may be more challenging:
There are in general two different types of references, both as cited references in the text and as reference lists: there are, first, standard scientific sources and, second, references to literature that resulted specifically from the project. These may appear as distinct lists or mixed in a single list.
There can be more than a single reference list for further reasons. This also happens often when multiple project partners are responsible for individual project modules and each partner reports on their respective part in a combined report. We also often saw papers and theses with their own reference lists attached to the main report.
Many references concern outputs other than scientific publications, such as presentations for practice partners or clients, and patents.
Unlike in scientific papers, reference lists often appear before the end of a document.
Given these peculiarities and GROBID’s specialization on typical journal articles, we anticipated that using GROBID in its out-of-the-box configuration, without any additional training for this type of document, might result in performance quite worse than for classic English-language journal articles. Yet, the proof is in the pudding, which in scientific papers and in this blog post, is the Results section.
Results
At this preliminary stage it is not possible to do a statistical assessment of the reference extraction quality. But a qualitative look at a small number of examples may already give a rough first impression. This just needs to be enough material to tell us if this approach is hopeless, and does not work at all, or if we have some useful results already on which we can build further.
We will look at five examples selected to show the variety of results – they are not representative but illustrative. We will show some excerpts of the PDF reference lists and GROBID’s generated TEI XML annotation of these parts, in which we occasionally have elided some lengthy parts.
Example 1
This report cites a good number of references. The main reference list, with a length of about three pages, has an informative heading and lists cited literature alphabetically. GROBID recognizes all 48 cited references and seems to do a good job of parsing the reference string elements.

<biblStruct xml:id="b0">
<analytic>
<title/>
<author>
<persName><forename type="first">D</forename><forename type="middle">L</forename><surname>Bissett</surname></persName>
</author>
<author>
<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Hannon</surname></persName>
</author>
<author>
<persName><forename type="first">T</forename><forename type="middle">V</forename><surname>Orr</surname></persName>
</author>
</analytic>
<monogr>
<title level="j">Photochem Photobiol</title>
<imprint>
<biblScope unit="volume">50</biblScope>
<biblScope unit="page" from="763" to="769" />
<date type="published" when="1990">1990</date>
</imprint>
</monogr>
<note type="raw_reference">Bissett DL, Hannon DP, Orr TV: Photochem Photobiol 50: 763-769, 1990</note>
</biblStruct>Unfortunately, the end of the reference list was not correctly determined and two ‘phantom’ references created from the following section on cooperations with other researchers:
<biblStruct xml:id="b48">
<monogr>
<title level="m" type="main">Kooperation mit anderen Wissenschaftlern</title>
<imprint/>
</monogr>
<note type="raw_reference">Kooperation mit anderen Wissenschaftlern</note>
</biblStruct>
<biblStruct xml:id="b49">
<monogr>
<title level="m" type="main">Institut für Physiologische Chemie I der Heinrich-Heine Universität Düsseldorf Herr Professor Sies ist ein international anerkannter Experte auf dem Gebiet reaktiver Sauerstoffmetaboliten und deren Rolle bei verschiedenen physiologischen</title>
<author>
<persName><forename type="first">Herr</forename><surname>Professor</surname></persName>
</author>
<author>
<persName><forename type="first">Helmut</forename><surname>Sies</surname></persName>
</author>
<imprint/>
</monogr>
<note type="raw_reference">Herr Professor Helmut Sies, Institut für Physiologische Chemie I der Heinrich- Heine Universität Düsseldorf Herr Professor Sies ist ein international anerkannter Experte auf dem Gebiet reaktiver Sauerstoffmetaboliten und deren Rolle bei verschiedenen physiologischen</note>
</biblStruct>Later in the document there is another reference list, which GROBID overlooked:

The larger part of the document including this list was identified as a figure description instead.
Example 2
The next case is a report that cites no literature and has no reference list. GROBID misidentifies an ordinary text paragraph summarizing the project finances as a reference list with two entries:
<div type="references">
<listBibl>
<biblStruct xml:id="b0">
<analytic>
<title level="a" type="main">die im Projekt für Personalmittel [...]</title>
</analytic>
<monogr>
<title level="j">Workstations und den WAP-Cluster der Fachrichtung Chemie sowie die Nutzung diverser Software-Campuslizenzen im Netz</title>
<imprint>
<biblScope unit="volume">52</biblScope>
<date type="published" when="0782">1997 10.782</date>
</imprint>
</monogr>
<note>insgesamt also 38.742,98 DM ausgegeben wurden gegenüber 39.000,00 DM räumlich in den Gebäuden des Instituts für Physikalische Chemie</note>
<note type="raw_reference">und in 1997 10.782,52, [...]</note>
</biblStruct>
<biblStruct xml:id="b1">
<monogr>
<title level="m" type="main">Haberland Projektleiter Anlagen: Statistischer Bericht zum Zwischennachweis mit Anlage Stundennachweise interner und externer Lehrkräfte Formblatt Teilnahmenachweis "Endnutzerförderung) Kopien Personal-und Vorlesungsverzeichnis der Ernst-Moritz</title>
<author>
<persName><surname>Prof</surname></persName>
</author>
<author>
<persName><forename type="middle">D</forename><surname>Dr</surname></persName>
</author>
<imprint/>
<respStmt>
<orgName>Arndt-Universität</orgName>
</respStmt>
</monogr>
<note type="raw_reference">Prof. Dr. D. Haberland Projektleiter Anlagen: Statistischer Bericht zum Zwischennachweis mit Anlage Stundennachweise interner und externer Lehrkräfte Formblatt Teilnahmenachweis "Endnutzerförderung) Kopien Personal-und Vorlesungsverzeichnis der Ernst- Moritz-Arndt-Universität</note>
</biblStruct>
</listBibl>
</div>Example 3
This example has a well-formatted alphabetical reference list of three pages at the very end of the document. GROBID quite successfully identifies all 23 entries and the parsing looks good:

<biblStruct xml:id="b22">
<monogr>
<author>
<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Walters</surname></persName>
</author>
<author>
<persName><forename type="first">B</forename><forename type="middle">N</forename><surname>Powell</surname></persName>
</author>
<title level="m">Influence of Feldspar chemistry on Relative Rates of Dissolution. Fourth Int. Symposium on Water-Rock Interaction</title>
<imprint>
<date type="published" when="1983">1983</date>
<biblScope unit="page" from="582" to="584" />
</imprint>
</monogr>
<note type="raw_reference">WALTERS, J.P. & POWELL, B.N. (1983): Influence of Feldspar chemistry on Relative Rates of Dissolution. Fourth Int. Symposium on Water-Rock Interaction, 582-584</note>
</biblStruct>Example 4
The referencing in the next example is much more challenging. There is no reference list but a few scattered footnotes which cite literature.

GROBID was not able to recognize either of three such footnotes but mistakenly annotated a list of participants in the project as a reference list.
Example 5
This document has two subsequent reference list, one for the publications that resulted from the project and one for the other cited literature. GROBID does catch both of these lists. However, it does not accurately segment the first list into individual entries. Moreover, it incorrectly recognizes other parts of the document after the end of the reference list as belonging to that list.

<biblStruct xml:id="b0">
<analytic>
<title level="a" type="main">Publikationen Tolerances of Deinococcus geothermalis Biofilms and Planktonic Cells Exposed to Space and Simulated Martian Conditions in Low Earth Orbit for Almost Two Years</title>
<author>
<persName><forename type="first">C</forename><surname>Panitz</surname></persName>
</author>
[...]
<author>
<persName><forename type="first">R</forename><surname>Willnecker</surname></persName>
</author>
<idno type="DOI">10.1089/ast.2018.1913</idno>
<idno>doi: 10.3389/fmicb.2017.01533</idno>
</analytic>
<monogr>
<title level="m">Frontiers of Microbiology</title>
<imprint>
<date type="published" when="2017">2019. 2017. 2017. 2017</date>
<biblScope unit="volume">19</biblScope>
<biblScope unit="page">1533</biblScope>
</imprint>
</monogr>
<note>Survival of Deinococcus geothermalis in biofilms under desiccation and simulated space and Martian conditions Astrobiol. Astrobiology METHODS published: 15 August</note>
<note type="raw_reference">II.6.1 Publikationen Tolerances of Deinococcus geothermalis Biofilms and Planktonic Cells Exposed to Space and Simulated Martian Conditions in Low Earth Orbit for Almost Two Years. Panitz, C., Frösler, J., Wingender, J., Flemming, H.-C., & Rettberg, P. (2019). Astrobiol. 19 (7), 1-15, DOI: 10.1089/ast.2018.1913. Survival of Deinococcus geothermalis in biofilms under desiccation and simulated space and Martian conditions. Frösler J., Panitz C., Wingender J., Flemming H.-C., Rettberg P. (2017): Astrobiology, 17(5): 431-447 EXPOSE-R2: The Astrobiological ESA Mission on Board of the International Space Station. Rabbow E., Rettberg P., Parpart A., Panitz C., Schulte W., Molter F., Jaramillo E., Demets R., Weiß P., Willnecker R. (2017): Frontiers of Microbiology, METHODS, published: 15 August 2017doi: 10.3389/fmicb.2017.01533, Volume 8 | Article 1533</note>
</biblStruct>Summary
Considering the complexity and diversity of the structure and content of these project reports, the preliminary results achieved by GROBID’s reference extraction are quite encouraging. GROBID, without any additional training data for such reports, did not usually outright fail. It identified both more typical and more unusual reference lists quite well. However, it also overlooked reference list and misrecognized parts of documents that are not reference lists. As one might have expected, the more conventional the referencing, the more successful was the extraction. The most difficult part seems to be the correct identification of reference lists. When correctly identified, the segmentation into individual references and the parsing into reference elements worked reasonably well. Overall, these results indicate potential to extend the use of open source cited reference extraction to para-scientific texts with non-standard formatting. However, there is much room for improvement in extraction quality by supplying custom training data.
References
Footnotes
A concept coined by Alex Fenton. This metaphor attempts to bring across the asymmetry of citation relationships between the ‘busy center’ of scientific research publication activity with its dense inter-citation network and those peripheral hinterland areas of published output which cite scientific research but are rarely cited by scientific research. It also evokes a certain notion of these areas being bibliometrically neglected.↩︎
Citation
@online{donner2025,
author = {Donner, Paul},
title = {Towards Open Citation Data from Research Project Reports},
date = {2025-10-30},
url = {http://www.open-bibliometrics.de/posts/20251030-TowardsOpenCitationData/},
langid = {en}
}