Encoding texts in TEI-XML format is critical for research projects dealing with copyright and open access issues. The PubCiNET project reconstructs the social network of Italian intellectuals in publishing and film between the 1950s and 1980s. During the three decades, exchanges and collaborations between professionals in these creative industries increased, affecting the convergence of literature and film, film professionals’ engagement in publishing, and the perception of publishing as a prestigious field for filmmakers. The project utilizes an XML-TEI encoded corpus of archival correspondence to map the intellectual network. However, key challenges arise, including the complex retrieval of data from vertically structured archives, copyright issues due to the contemporary timeframe, and the sustainability of handling vast volumes of documents. This study proposes a first attempt at these challenges by applying automated text encoding through large language models (LLMs). The research explores automated encoding using ChatGPT-4 and Claude 3.5 Sonnet, analyzing their capabilities in enhancing access to archives and automating the labor-intensive encoding process. Initial findings indicate varying success rates: while both LLMs efficiently extract metadata, they differ in their ability to recognize information in the text of the letters. Improving their efficiency in terms of information recognition and the reliability of reference materials could contribute to more efficient and faster encoding, allowing for greater sustainability in research.

Automating XML-TEI Encoding of Unpublished Correspondence

Marco De Cristofaro;
2025-01-01

Abstract

Encoding texts in TEI-XML format is critical for research projects dealing with copyright and open access issues. The PubCiNET project reconstructs the social network of Italian intellectuals in publishing and film between the 1950s and 1980s. During the three decades, exchanges and collaborations between professionals in these creative industries increased, affecting the convergence of literature and film, film professionals’ engagement in publishing, and the perception of publishing as a prestigious field for filmmakers. The project utilizes an XML-TEI encoded corpus of archival correspondence to map the intellectual network. However, key challenges arise, including the complex retrieval of data from vertically structured archives, copyright issues due to the contemporary timeframe, and the sustainability of handling vast volumes of documents. This study proposes a first attempt at these challenges by applying automated text encoding through large language models (LLMs). The research explores automated encoding using ChatGPT-4 and Claude 3.5 Sonnet, analyzing their capabilities in enhancing access to archives and automating the labor-intensive encoding process. Initial findings indicate varying success rates: while both LLMs efficiently extract metadata, they differ in their ability to recognize information in the text of the letters. Improving their efficiency in terms of information recognition and the reliability of reference materials could contribute to more efficient and faster encoding, allowing for greater sustainability in research.
File in questo prodotto:
File Dimensione Formato  
AIUCD2025_De Cristofaro.pdf

accesso aperto

Tipologia: Versione editoriale
Licenza: Creative commons
Dimensione 2.19 MB
Formato Adobe PDF
2.19 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11587/570051
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact