the oupoco database of french sonnets from the 19th century data paper corresponding author: thierry poibeau lattice, cnrs & ecole normale supérieure/psl and université sorbonne nouvelle, paris, fr thierry.poibeau@ens.psl.eu keywords: french poetry; sonnets; poetry generation; oulipo; oupoco to cite this article: mélanie-becquet, f., grunspan, c., maignant, m., plancq, c., & poibeau, t. (2022). the oupoco database of french sonnets from the 19th century. journal of open humanities data, 8: 25, pp. 1–5. doi: https://doi. org/10.5334/johd.89 the oupoco database of french sonnets from the 19th century frédérique mélanie-becquet claude grunspan mylène maignant clément plancq thierry poibeau abstract the oupoco database is a collection of 4,872 french sonnets developed in the framework of the oupoco project. it is mainly composed of poems from the 19th and early 20th century. the sonnets come from different sources from the internet and from a collaboration with the bibliothèque nationale de france. every sonnet has a specific license (depending on the source it comes from), but the whole collection can be reused for free (under the creative commons attribution 4.0 international license). *author affiliations can be found in the back matter of this article mailto:thierry.poibeau@ens.psl.eu https://doi.org/10.5334/johd.89 https://doi.org/10.5334/johd.89 https://orcid.org/0000-0003-3669-4051 2mélanie-becquet et al. journal of open humanities data doi: 10.5334/johd.89 (1) overview repository location https://doi.org/10.5281/zenodo.5646939. context the oupoco database is a collection of 4,872 french sonnets developed in the framework of the oupoco project (poibeau et al., 2020). the database is mainly composed of poems from the 19th and early 20th century. we have identified 760 authors: 4,414 sonnets written by men (660 authors), 439 sonnets written by women (107 authors), which leaves 19 sonnets to which we have not been able to assign an author. the sonnets come from different sources from the internet, or not: we especially want to thank the bibliothèque nationale de france (bnf) (french national library) that gave us access to a large corpus, from which we were able to extract an invaluable number of french poems. to all the sonnets is attached a specific license related to the source they come from, but all are freely available and can be reused for free (under the creative commons attribution 4.0 international license). this database has initially been developed for the oupoco project (l’ouvroir de littérature combinatoire),1 which consists in producing new sonnets by recombining verses from existing ones from the french literature, following the idea put forward by queneau in his famous conceptual book cent mille milliards de poèmes (queneau, 1961). different scripts have been developed for the oupoco project (to analyse the rhymes and recombine the verses), which are available on the github repository linked to the project.2 beyond oupoco, this database can be used for various purposes, for teaching and for research, especially in the following domains: literature studies, corpus linguistics, digital humanities, arts and technology. (2) collection method sources the sonnets were collected from five different sources: bnf directly provided us a collection of books containing poetry, in the xml alto format.3 we selected the books with an ocr quality score above 98% and used the bnf api to collect metadata about the books. we then identified and automatically retrieved the sonnets from these books (mainly through the pattern: two quatrains followed by two tercets). there was no exhaustive verification of the extraction process, so this corpus may contain poems that are not sonnets.4 metadata were then collected from the bnf, except the title of the sonnet and the page number, which were retrieved automatically. here again, errors can probably be found, as the process was automatic with no comprehensive quality check. texts from wikisource were taken as is, with no added information. sonnets coming from the web were manually collected at the start of the project. we made sure these sonnets are not subject to copyright. 1 https://oupoco.org/ (last accessed: 14/10/2022). 2 https://github.com/lattice-8094/oupoco-api (last accessed: 14/10/2022). 3 https://www.loc.gov/standards/alto/ (last accessed: 14/10/2022). 4 errors can be reported to info@oupoco.org. source number of sonnets license/comments bnf 3,979 cc-by-sa-nc wikisource 772 cc by-sa 3.0 web 67 source (blog) cited in the database books (anthology) 37 manual collections of sonnets from different anthologies malherbe project 7 no explicit license (https://git.unicaen.fr/malherbe/corpus) https://doi.org/10.5281/zenodo.5646939 https://oupoco.org/ https://github.com/lattice-8094/oupoco-api https://www.loc.gov/standards/alto/ mailto:info@oupoco.org https://git.unicaen.fr/malherbe/corpus 3mélanie-becquet et al. journal of open humanities data doi: 10.5334/johd.89 anthologies have been used more recently to augment the corpus, with a specific focus on female authors, as wikisource, for example, is highly unbalanced towards male authors). five anthologies have been used, and the identified sonnets were also all copyright free. these four anthologies are: • h. blanvalet (1856). femmes poëtes de la france – anthologie. paris: j. kessmann éd. • rachilde (1908). le missel de notre-dame des solitudes. e. sansot, paris. • le comte de saint-jean (mme eugène riom) (1892). les femmes poètes bretonnes. nantes: société des bibliophiles bretons et de l’histoire de bretagne. • alphone séché (1908). les muses françaises – anthologie des femmes poètes (1200 à 1891). paris: louis-michaud éd. (two volumes). the last source is the malherbe corpus. its contribution is marginal as only seven sonnets come from this database (that covers other kinds of french poems, most of them not being sonnets). the malherbe database has been developed in the framework of the malherbe project by éliane delente et richard renault. the goal of the project was to provide an overview of the diversity of french versification as well as automatic tools related to this issue. the corpus and related information can be found here: https://git.unicaen.fr/malherbe/corpus. the poems in this collection are also copyright free and although there is no license attached to the project at the time of writing, the authors confirmed that the database can be used free of charge, as long as the original repository is mentioned. sampling strategy the collection is provided as is. no sampling strategy has been used. the corpus is thus highly unbalanced, especially between authors (a few authors provided lots of sonnets, lots of other authors provided only a few of them). quality control the process to collect the sonnets and the metadata has been highly automated, therefore errors can be found (poems that are in fact not sonnets, errors in the metadata or due to the ocr). we are however confident that these errors remain marginal, from an extensive quality check performed randomly on a sample of the data. (3) dataset description object name oupoco.dtd and sonnets_oupoco_tei.xml format names and versions xml unicode files. creation dates the project began in 2018. the current version has been published 2022-06-29. dataset creators frédérique mélanie-becquet: conceptualization, data curation and supervision; claude grunspan: data curation; mylène maignant: conceptualization and data curation; clément plancq: conceptualization; thierry poibeau: funding acquisition, supervision and writing. language french. license creative commons attribution 4.0 international. https://git.unicaen.fr/malherbe/corpus 4mélanie-becquet et al. journal of open humanities data doi: 10.5334/johd.89 repository name zenodo (https://doi.org/10.5281/zenodo.5646939). publication date 2022-06-29. (4) reuse potential this database can be used by anyone interested in french poetry, for teaching and for research, especially in the following domains: literature studies, corpus linguistics, digital humanities, arts and technology (for this purpose, a short video has been released to explain the oupoco project, see lattice and riva illustrations, 2021). the corpus is large enough to be analysed with machine learning methods for stylometric studies, for example. one direct perspective would be to apply authorship attribution methods to the anonymous sonnets, so as to propose potential authors to these poems. acknowledgements we want to thank the bibliothèque nationale de france that provided us with a corpus related to french poetry, from which we were able to extract an invaluable number of french poems. funding statement this work has received support of translitteræ (ecole universitaire de recherche, program “investissements d’avenir” anr-10-idex-0001-02 psl* and anr-17-eure-0025) and of cnrs through the irn (international research network) cyclades. this work was also supported in part by the french government under management of agence nationale de la recherche as part of the “investissements d’avenir” program, reference anr19-p3ia-0001 (prairie 3ia institute). competing interests the authors have no competing interests to declare. author contributions frédérique mélanie-becquet: conceptualization, data curation and supervision; claude grunspan: data curation; mylène maignant: conceptualization and data curation; clément plancq: conceptualization; thierry poibeau: conceptualization, funding acquisition, supervision and writing. publisher’s note the link to the repository location of this dataset has been updated to include the doi. author affiliations frédérique mélanie-becquet lattice, cnrs & ecole normale supérieure/psl and université sorbonne nouvelle, paris, fr claude grunspan lattice, cnrs & ecole normale supérieure/psl and université sorbonne nouvelle, paris, fr mylène maignant lattice, cnrs & ecole normale supérieure/psl and université sorbonne nouvelle, paris, fr clément plancq lattice, cnrs & ecole normale supérieure/psl and université sorbonne nouvelle, paris, fr thierry poibeau orcid.org/0000-0003-3669-4051 lattice, cnrs & ecole normale supérieure/psl and université sorbonne nouvelle, paris, fr https://doi.org/10.5281/zenodo.5646939 https://orcid.org/0000-0003-3669-4051 5mélanie-becquet et al. journal of open humanities data doi: 10.5334/johd.89 to cite this article: mélanie-becquet, f., grunspan, c., maignant, m., plancq, c., & poibeau, t. (2022). the oupoco database of french sonnets from the 19th century. journal of open humanities data, 8: 25, pp. 1–5. doi: https://doi. org/10.5334/johd.89 published: 15 november 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. references lattice and riva illustrations. (2021). oupoco, la boîte à poésie (video). retrieved from https://odhn.ens. psl.eu/newsroom/oupoco-la-boite-poesie, on the observatoire des humanités numériques (odhn) de l’ens-psl web site (last accessed: 16 october 2022). poibeau, t., maignant, m., mélanie-becquet, f., plancq, c., raffard, m., & roussel, m. (2020). sonnet combinatorics with oupoco. proceedings of the the 4th joint sighum workshop on computational linguistics for cultural heritage, social sciences, humanities and literature, 133–137. queneau, r. (1961). cent mille milliards de poèmes. paris: gallimard. https://doi.org/10.5334/johd.89 https://doi.org/10.5334/johd.89 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://odhn.ens.psl.eu/newsroom/oupoco-la-boite-poesie https://odhn.ens.psl.eu/newsroom/oupoco-la-boite-poesie text recognition for nepalese manuscripts in pracalit script data paper corresponding author: alexander james o’neill department of east asian languages and cultures, soas university of london, london, uk ao34@soas.ac.uk keywords: handwritten text recognition; pylaia; transkribus; sanskrit; newar; manuscripts to cite this article: o’neill, a. j., & hill, n. (2022). text recognition for nepalese manuscripts in pracalit script. journal of open humanities data, 8: 26, pp. 1–6. doi: https://doi.org/10.5334/ johd.90 text recognition for nepalese manuscripts in pracalit script alexander james o’neill nathan hill abstract this dataset is a model for handwritten text recognition (htr) of sanskrit and newar nepalese manuscripts in pracalit script. this paper introduces the state of the field in newar literature, newar manuscripts, and htr engines. it explains our methodology for developing the requisite ground truth consisting of manuscript images and corresponding transcriptions, training our model with a pylaia engine, and this model’s limitations. this dataset shared on zenodo can be used by anyone working with manuscripts in pracalit script, which will benefit the fields of indology and newar studies, as well as historical and linguistic analysis. *author affiliations can be found in the back matter of this article mailto:ao34@soas.ac.uk https://doi.org/10.5334/johd.90 https://doi.org/10.5334/johd.90 https://orcid.org/0000-0001-9982-2589 https://orcid.org/0000-0001-6423-017x 2o’neill and hill journal of open humanities data doi: 10.5334/johd.90 (1) overview repository location https://doi.org/10.5281/zenodo.6967421. context newar (also referred to as nepāl bhāṣā) is the indigenous language of the kathmandu valley. in its pre-print phase, this highly literate and creative culture produced thousands of works that have remained mainly unstudied in either western or nepalese scholarship. much of newar literature is a mixture of newar, sanskrit, and maithili (malla, 1981, 6–9). while newar literature is written in various scripts, the most common by far is the pracalit script, which has thus also come to be known as newar lipi (newar script) (pandey, 2012). thus, for both indological interest in nepalese manuscripts written in sanskrit and for students of newar language and culture, a means to compile a digital corpus more quickly through optical character recognition (ocr) becomes apparent. ocr engines have gradually become more effective in recent decades. handwritten text recognition (htr) has proven to be far more problematic. deep learning neural networks have made it possible to build htr models based on images of handwritten text linked with corresponding transcriptions (called “ground truth”). a character error rate (cer) under 10% allows for effective automatic transcription (muehlberger et al., 2019). advances in computing power and storage made by the transkribus platform developed by read-coop have enabled the training of large data sets involving multiple hands, allowing for generalised htr models for particular writing styles (hodel et al., 2021). transkribus hosts two htr engines: citilab-htr+ (michael et al., 2018) and pylaia, a pytorch-based model (mocholí calvo et al., 2018). in principle, models for htr of indic texts can be developed similarly to those in roman scripts. transkribus already has two publicly available htr+ models for printed 19th and 20th century devanagari developed by nicole merkel-hilf (2022). this project focused on expanding the abilities of htr models to indic texts in pre-print and non-devanagari sources, focusing on sanskrit and newar (nepāl bhāṣā) manuscripts in pracalit script from the 16th to 19th centuries. (2) method an htr trainer requires diplomatic transcriptions of pracalit manuscripts to line up with text in manuscript photographs. critically edited editions can speed up transcription and ground truth generation through de-correction. databases like gretil, from which we sourced the published transcriptions, make it possible to bootstrap a non-existent htr model by using texts from other scripts (georg-august-universität göttingen, 2020). to this end, transcriptions were prepared based on the following four nepalese manuscripts, each with different varieties of pracalit script. for each entry in the list below, in order, the manuscript title is given in italics followed by call numbers in parentheses, deposit location, manuscript languages and date, and sources of the corresponding transcriptions: 1. hitopadeśa (mik i 4851) staatsbibliothek zu berlin mixed newar and sanskrit, 1561 ce original transcription by alexander james o’neill 2. vetālapañcaviṃśati (hs. or. 6414) staatsbibliothek zu berlin newar, 1675 ce adapted transcription based on unpublished materials by felix otter (otter, n.d.a) 3. avalokiteśvaraguṇakāraṇḍavyūha (ms add. 1322) cambridge digital library sanskrit, 18th century adapted transcription based on an edition by lokesh chandra (chandra, 1999) 4. madhyamasvayaṃbhūpurāṇa (ras hodgson ms 23) royal asiatic society online collection mixed newar and sanskrit, c. 1800 https://doi.org/10.5281/zenodo.6967421 3o’neill and hill journal of open humanities data doi: 10.5334/johd.90 adapted transcription based on unpublished materials by felix otter (otter, n.d.b) and the published nagarjuna institute transcription (shakya & bajracharya, 2001) while the htr+ engine appeared to have difficulty working with the lack of word division, pylaia produced better results, and we used it for the rest of the training. we trained the model on 441 pages of manual transcriptions of the above four manuscripts, with validation performed on 242 pages that were not part of the training set. it was further tested and continues to be used on pages that were not part of the training or validation sets. we decided it would be most appropriate and culturally sensitive to transcribe into unicode pracilt (unicode, inc., 2021), see figure 1. using 250 epochs, transkribus trained a model with a cer on the training set of 2.6% and 0.1% on the validation set. this discrepancy may signify little more than that the latter had fewer complex characters to recognise. therefore, the model produces accurate results when transcribing the same or similar hands to those responsible for these four manuscripts, see figure 2. quality control the model has a higher cer when applied to irregular forms of pracalit script, including more ornate or rougher hands (figure 3) however, with a trained base model, new hands require figure 2 screenshot of the model’s learning curve on transkribus. figure 1 screenshot of a completed transcription of a folio of hitopadeśa (mik i 4851) in transkribus. 4o’neill and hill journal of open humanities data doi: 10.5334/johd.90 significantly fewer pages, ranging from ten to thirty pages of new ground truth. we will update and refine the model with new ground truth as we encounter variant hands. the main limitation of this model’s initial and continued training is the lack of transcriptions. however, bootstrapping existing editions and transcriptions and feeding corrected machinegenerated transcriptions back into the model are workable solutions. in transcription, the model encounters difficulties with damaged or soiled manuscripts, irregular spacing, punctuation, and illustrations interrupting the text. it is worth noting that while the vast majority of pracalit manuscripts are written in a scriptio continua, occasional spacing and irregular punctuation conventions produce mixed results for the model. while mistakes in ground truth produce incorrect transcriptions, a larger mass of correct ground truth reduces the impact of any one mistake. (3) dataset description object name – ocr model for pracalit for sanskrit and newar mss 16th to 19th c., ground truth format names and versions – png and xml creation dates – 2022-04-01 – 2022-08-04 dataset creators – alexander james o’neill, soas university of london, data curation, formal analysis, investigation, methodology, validation, visualization language – sanskrit and newar license – creative commons attribution 4.0 international repository name – zenodo publication date – 2022-08-05 (4) reuse potential while it is possible to share models within transkribus, this has limited potential for the shared creation of ground truth. as modelled by the github collection “htr united,” which combines the ground truth of french documents (chaqué & clérice, 2021), it is possible to make ground truth data sets available in ways that others can use within platforms such as transkribus and elsewhere. we have therefore made our dataset publicly available on zenodo in the form of png and xml files that can be used on htr platforms (o’neill, 2022). for the future, in collaboration with the centre of asian and transcultural studies (cats) bibliothek at the university of heidelberg, we are participating in the development of a south asian studiesspecific ground truth database in a fid4sa (fachinformationsdienst für südasien: specialised information service for south asia) dataverse, called “ground truth data for htr on south asian scripts,” as part of the university of heidelberg’s research data archive heidata (universität heidelberg, 2022). as the most labour-intensive part of philological practice, the ability to quickly produce machinereadable transcriptions of various witnesses of an indic text is of great value to indology and other disciplines. this enables high-speed searches and comparisons of corpora, as well as linguistic analysis through machine-learning methods (meelen et al., 2021). in disciplines such as newar studies, where there is both a paucity of trained scholars and a profusion of manuscripts, this tool can contribute to easing the burden of compiling and editing a digital corpus, which will benefit linguistic, literary, and historical analysis of the newar language by easing the burden of work with primary manuscript sources. figure 3 an example of a cruder form of pracalit, from vetālapañcaviṃśati (hs. or. 6414), transcribed on transkribus. 5o’neill and hill journal of open humanities data doi: 10.5334/johd.90 acknowledgements we would like to extend our thanks to felix otter (philipps-universität marburg) for providing us with transcriptions. funding information this work was funded by the arts and humanities research council (ahrc), ukri, as part of the project “the emergence of egophoricity: a diachronic investigation into the marking of the conscious self.” project reference: ah/v011235/1. principal investigator: nathan hill, soas university of london. competing interests the authors have no competing interests to declare. author contributions alexander james o’neill: data curation, formal analysis investigation, methodology, validation, visualization, writing – original draft, writing – review & editing. nathan hill: conceptualization, funding acquisition, methodology, project administration, supervision, writing – review & editing. author affiliations alexander james o’neill orcid.org/0000-0001-9982-2589 department of east asian languages and cultures, soas university of london, london, uk nathan hill orcid.org/0000-0001-6423-017x department of east asian languages and cultures, soas university of london, london, uk; trinity centre for asian studies, trinity college dublin, dublin, ireland references chandra, l. (ed.) (1999). guṇakāraṇḍavyūhasūtram. international academy of indian culture. chaqué, a., & clérice, t. (2021). htr-united. github. https://github.com/htr-united/htr-united (last accessed: 9 november 2022). georg-august-universität göttingen. (2020). gretil: göttingen register of electronic texts in indian languages and related indological materials from central and southeast asia. gretil. retrieved from http://gretil.sub.uni-goettingen.de/gretil.html (last accessed: 22 august 2022). hodel, t., schoch, d., schneider, c., & purcell, j. (2021). general models for handwritten text recognition: feasibility and state-of-the-art. german kurrent as an example. journal of open humanities data, 7(13), 1–10. doi: https://doi.org/10.5334/johd.46 malla, k. p. (1981). classical newari literature. nepal study centre. meelen, m., roux, e., & hill, n. (2021). “optimisation of the largest annotated tibetan corpus combining rule-based, memory-based, and deep-learning methods.” acm transactions on asian and lowresource language information processing, 20(1), 1–11. doi: https://doi.org/10.1145/3409488 merkel-hilf, n. (2022). ground truth data for printed devanagari [dataset]. in fid4sa@heidata. doi: https://doi.org/10.11588/data/egokei michael, j., weidemann, m., & labahn, r. (2018). htr engine based on nns p3: optimizing speed and performance htr+ [deliverable 7.9 for read project funded by eu horizon 2020 project 674943]. read-coop. retrieved from https://readcoop.eu/wp-content/uploads/2018/12/del_d7_9.pdf (last accessed: 8 november 2022). mocholí calvo, c., vidal ruiz, e., & puigcerver i pérez, j. (2018). development and experimentation of a deep learning system for convolutional and recurrent neural networks [degree final work]. universitat politècnica de valència. retrieved from https://riunet.upv.es/bitstream/ handle/10251/107062/mochol%c3%8d%20-%20desarrollo%20y%20experimentaci%c3%b3n%20 de%20un%20sistema%20de%20aprendizaje%20profundo%20para%20redes%20neuronale.... pdf?sequence=1&isallowed=y (last accessed: 8 november 2022). muehlberger, g., seawrd, l., terras, m., ares oliveira, s., bosch, v., bryan, m., culluto, s., déjean, h., diem, m., fiel, s., gatos, b., greinoecker, a., grüning, t., hackl, g., haukkovaara, v., heyer, g., hirvonen, l., hodel, t., jokinen, m., … zagoris, k. (2019). transforming scholarship in the archives https://orcid.org/0000-0001-9982-2589 https://orcid.org/0000-0001-9982-2589 https://orcid.org/0000-0001-6423-017x https://orcid.org/0000-0001-6423-017x https://github.com/htr-united/htr-united http://gretil.sub.uni-goettingen.de/gretil.html https://doi.org/10.5334/johd.46 https://doi.org/10.1145/3409488 https://doi.org/10.11588/data/egokei https://readcoop.eu/wp-content/uploads/2018/12/del_d7_9.pdf https://riunet.upv.es/bitstream/handle/10251/107062/mochol%c3%8d%20-%20desarrollo%20y%20experimentaci%c3%b3n%20de%20un%20sistema%20de%20aprendizaje%20profundo%20para%20redes%20neuronale....pdf?sequence=1&isallowed=y https://riunet.upv.es/bitstream/handle/10251/107062/mochol%c3%8d%20-%20desarrollo%20y%20experimentaci%c3%b3n%20de%20un%20sistema%20de%20aprendizaje%20profundo%20para%20redes%20neuronale....pdf?sequence=1&isallowed=y https://riunet.upv.es/bitstream/handle/10251/107062/mochol%c3%8d%20-%20desarrollo%20y%20experimentaci%c3%b3n%20de%20un%20sistema%20de%20aprendizaje%20profundo%20para%20redes%20neuronale....pdf?sequence=1&isallowed=y https://riunet.upv.es/bitstream/handle/10251/107062/mochol%c3%8d%20-%20desarrollo%20y%20experimentaci%c3%b3n%20de%20un%20sistema%20de%20aprendizaje%20profundo%20para%20redes%20neuronale....pdf?sequence=1&isallowed=y 6o’neill and hill journal of open humanities data doi: 10.5334/johd.90 to cite this article: o’neill, a. j., & hill, n. (2022). text recognition for nepalese manuscripts in pracalit script. journal of open humanities data, 8: 26, pp. 1–6. doi: https://doi.org/10.5334/ johd.90 published: 30 november 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. through handwritten text recognition: transkribus as a case study. journal of documentation, 75(5), 954–976. doi: https://doi.org/10.1108/jd-07-2018-0114 o’neill, a. (2022). ocr model for pracalit for sanskrit and newar mss 16th to 19th c., ground truth [dataset]. in zenodo. doi: https://doi.org/10.5281/zenodo.6967421 otter, f. (n.d.a). vetālapañcaviṃśati [unpublished transcription]. otter, f. (n.d.b). madhyamasvayaṃbhūpurāṇa [unpublished transcription]. pandey, a. (2012). proposal to encode the newar script in iso/iec 10646 [proposal from the script encoding initiative]. escholarship. https://escholarship.org/uc/item/50c8w93x shakya, m. b., & bajracharya, s. h. (eds.) (2001). svayambhū purāṇa. nagarjuna institute of exact methods. unicode, inc. (2021). newa range: 11400–1147f [excepted character code tables for the unicode standard, version 14.0]. unicode. retrieved from https://www.unicode.org/charts/pdf/u11400.pdf (last accessed: 8 november 2022). universität heidelberg. (2022). ground truth data for htr on south asian scripts. fid4sa@heidata. retrieved from https://heidata.uni-heidelberg.de/dataverse/fid4sa-gt (last accessed: 9 november 2022). https://doi.org/10.5334/johd.90 https://doi.org/10.5334/johd.90 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1108/jd-07-2018-0114 https://doi.org/10.5281/zenodo.6967421 https://escholarship.org/uc/item/50c8w93x https://www.unicode.org/charts/pdf/u11400.pdf https://heidata.uni-heidelberg.de/dataverse/fid4sa-gt assessing attitudes towards wood in the context of family habits: potential for reuse of raw data from a questionnaire survey data paper corresponding author: istván fekete faculty of culture, media and psychology, macromedia university, cologne, de istvfekete@gmail.com keywords: environmental awareness; wood products; trees; surveys to cite this article: fekete, i., & kendöl, j. n. (2022). assessing attitudes towards wood in the context of family habits: potential for reuse of raw data from a questionnaire survey. journal of open humanities data, 8: 17, pp. 1–5. doi: https://doi. org/10.5334/johd.82 assessing attitudes towards wood in the context of family habits: potential for reuse of raw data from a questionnaire survey istván fekete jutka nmarné kendöl abstract the dataset covers a statistically representative quantitative questionnaire survey conducted with 430 children with the aim of assessing their knowledge, habits and attitudes towards wood and trees. the goal of the study was to examine the variables that affect environmental awareness and the use of wood in the future. the dataset contains 430 rows (participants) and 49 columns (variables). data has a reuse potential in the fields of childhood pedagogy and environmental education research but also opens new ways for demonstrating the formula of statistical representativeness or machine-learning analyses such as clustering techniques, associative learning, or decision trees. *author affiliations can be found in the back matter of this article mailto:istvfekete@gmail.com https://doi.org/10.5334/johd.82 https://doi.org/10.5334/johd.82 https://orcid.org/0000-0002-6734-3660 https://orcid.org/0000-0003-1478-4312 2fekete and kendöl journal of open humanities data doi: 10.5334/johd.82 (1) overview repository location link to the dataset: https://doi.org/10.6084/m9.figshare.20055500. context the dataset was produced at the university of sopron, as part of the doctoral project of jutka nmarné kendöl at the roth gyula doctoral school of forestry and wildlife management, and has not been used in another paper yet. findings of the survey study are highly relevant in the field of environmental education research, as they investigate the primary predictors of future use of wood. such predictors included variables such as knowledge about trees and wood, attitude towards trees and wood, knowledge about traditions related to trees, extracurricular activities, habits related to trees and wood in both school and family contexts. (2) method steps 1. pilot phase: the questions of the survey were first piloted on five respondents. the questions and the length of the questionnaire were tailored to the level of the age groups under investigation. 2. software: questionnaires were administered online using google docs during normal classes. 3. variables: after the sociodemographic variables in the questionnaire, questions followed related to the respondents’ habits in school and family, traditions, feelings, and willingness to use wood in the future. 4. randomisation of items: the questions were not (pseudo-)randomised across respondents because we did not expect any order-effects (e.g., the tendency of some questions affecting response behaviour with questions appearing later in the questionnaire). 5. debriefing: participants did not report any inconsistencies in the questionnaire in debriefings after completing the questionnaire. 6. statistical analysis: google docs generated an excel file which was submitted to statistical analyses employing the r software (r core team, 2021). sampling strategy our questionnaire was completed by 230 male and 200 female students using non-random sampling to ensure that the relevant sociodemographic variables (gender, age, school type, and size of settlement) were counterbalanced using stratification weighting. given the high number of participants and counterbalancing, the county-wide questionnaire survey is statistically representative of győőr-moson-sopron county. based on data from the central statistical office in hungary, in győr-moson-sopron county there were 33.996 pupils at primary schools and 7.507 at secondary schools enrolled in the year of 2014 (központi statisztikai hivatal, 2015). we relied on the data from 2014 and counted with a target population of around 41.500, a margin error of 5%, and a confidence interval of 95%. given these conditions, relying on the formula for statistical representativeness, 381 participants are needed for the study to be statistically representative (daniel, 1999). quality control two raters performed a post-hoc plausibility check. an implausible questionnaire sample would be, for example, one that contains considerable missing data, or one completed by a participant without serious interest, a sign of which could be, for instance, the same response pattern across the questions. no such respondent was identified. implausible values (e.g., age of 36) were removed with negligible data loss due to implausibility. https://doi.org/10.6084/m9.figshare.20055500 3fekete and kendöl journal of open humanities data doi: 10.5334/johd.82 (3) dataset description object name the dataset is called “survey data of children’s attitudes towards trees and the use of wood”. the dataset can be cited as follows: fekete, istván; nmarné kendöl, jutka (2022): survey data of children’s attitudes towards trees and the use of wood. figshare. dataset. https://doi. org/10.6084/m9.figshare.20055500. format names and versions the dataset is available in .csv creation dates the survey was carried out in april, may, and june in 2021. dataset creators the doctoral candidate jutka nmarné kendöl created the questionnaire and collected the data. colleagues of the roth gyula doctoral school were consulted in methodological questions. language the names of the variables as well as the levels of the factor variables have been translated into english. given that the variable names are abbreviated, there is a list of the variable names on figshare with the descriptive statistics under the name “descriptive statistics”. license data has been deposited under cc by 4.0 license. repository name the data that support the findings of this study are openly available in the figshare repository at https://figshare.com/projects/assessing_attitudes_towards_wood_in_the_context_of_family_ habits_a_large-scale_quantitative_study_in_hungary/132230. the owner of the dataset is jutka nmarné kendöl. publication date the dataset was published in the figshare repository on 10 february 2022. (4) reuse potential given the high number of 49 variables, the sample size of 430 participants, and the statistically representativeness of our survey, the data allows further advanced statistical analysis: 1. data can be analysed using data-mining approaches such as associative learning algorithms (e.g., market-basket analysis) to reveal participant groupings based on the variables (patwary, eshan, debnath & sattar 2021). association chains can be extracted, for example, between sociodemographic variables, habits, behaviour, and attitudes (for the variables, see appendix). such an analytical framework could reveal hidden associations between variables and could help generate further research questions (e.g., what is the intimate relationship between family, school, knowledge, habits, and various aspects of attitude towards wood and trees?). 2. data can be used as an example dataset in statistics to demonstrate multiple cluster analysis techniques. cluster analyses can be run either on the variables or on the participants to explore how (i) the variables and (ii) the participants group together, respectively. for instance, via clustering, groups of outliers (participants) can be identified as well as the reasons of being an outlier. thereby, interventions in environmental education can be proposed. further, it can be asked which variables group together (e.g., do variables related to family and school cluster together?). https://doi.org/10.6084/m9.figshare.20055500 https://doi.org/10.6084/m9.figshare.20055500 https://figshare.com/projects/assessing_attitudes_towards_wood_in_the_context_of_family_habits_a_large-scale_quantitative_study_in_hungary/132230 https://figshare.com/projects/assessing_attitudes_towards_wood_in_the_context_of_family_habits_a_large-scale_quantitative_study_in_hungary/132230 4fekete and kendöl journal of open humanities data doi: 10.5334/johd.82 3. given the high number of variables, data can be submitted to decision tree models such as conditional inference trees or random forests (e.g., hothorn, hornik & zeileis 2006; katuwal, suganthan & zhang 2020). the idea is to select an outcome variable (e.g., “the importance of wood and trees”) and use a high number of independent variables to explain or predict the outcome variable. 4. dataset is suitable for teaching statistical representativeness. 5. dataset allows for further analyses mainly in the fields of childhood pedagogy and environmental education. specifically, new aspects of environmental pedagogy, environmental education, sustainable development, climate protection, sylviculture, environmental awareness of families, adult environmental education, and education policies can also be investigated from the perspective of environmental awareness. for instance, research into climate protection can benefit from further analyses on the dataset to gain insights about the amount of wood and trees used in families. this issue is highly relevant, as the use of wood as a raw material can bind coal for decades and even centuries. second, adult environmental education can profit from further analysis by examining the level of environmental awareness about wood and trees in adults. in light of the findings, new steps in environmental education can be implemented via advertisement, media, and environmental programs to raise the environmental awareness of adults, and to spread or increase the use of wood. third, information about the amount of wood and trees used in families could be useful for sylviculture, as the amount of trees to be planted has to be planned in advance. one of the limitations can be that socioeconomic status was not verified. however, we claim that given the relative high number of participants, the possible effect of this confounder has been partialled out. additional files the additional files for this article can be found as follows: • appendix. the appendix contains descriptive tables illustrating the factor and the numeric variables in the survey. numbers indicate frequencies per level of the factors. doi: https://doi.org/10.5334/johd.82.s1 • a summary analysis of all the ordinal variables in the dataset. the number of responses are represented by “n”. sd designates standard deviation of the mean. doi: https://doi.org/10.5334/johd.82.s2 acknowledgements we are indebted to the participants of the study and to katalin molnár, associate professor at the benedek elek faculty of pedagogy of the university of sopron for her advice. competing interests the authors have no competing interests to declare. author contributions istván fekete: formal analysis, software, visualization, data curation, methodology, validation, writing – original draft, writing – review & editing jutka nmarné kendöl: conceptualization, investigation, methodology, project administration, resources, supervision, writing – original draft, writing – review & editing author affiliations istván fekete orcid.org/0000-0002-6734-3660 faculty of culture, media and psychology, macromedia university, cologne, de jutka nmarné kendöl orcid.org/0000-0003-1478-4312 roth gyula doctoral school of forestry and wildlife management, faculty of forestry, university of sopron, sopron, hu https://doi.org/10.5334/johd.82.s1 https://doi.org/10.5334/johd.82.s2 https://orcid.org/0000-0002-6734-3660 https://orcid.org/0000-0002-6734-3660 https://orcid.org/0000-0003-1478-4312 https://orcid.org/0000-0003-1478-4312 5fekete and kendöl journal of open humanities data doi: 10.5334/johd.82 to cite this article: fekete, i., & kendöl, j. n. (2022). assessing attitudes towards wood in the context of family habits: potential for reuse of raw data from a questionnaire survey. journal of open humanities data, 8: 17, pp. 1–5. doi: https://doi. org/10.5334/johd.82 published: 18 july 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. references daniel, w. w. (1999). biostatistics: a foundation for analysis in the health sciences. john wiley & sons: hoboken. doi: https://doi.org/10.1002/1097-0258(20010130)20:2<324::aid-sim635>3.0.co;2-o hothorn, t., hornik, k., & zeileis, a. (2006). unbiased recursive partitioning: a conditional inference framework. journal of computational and graphical statistics, 15(3), 651–674. doi: https://doi. org/10.1198/106186006x133933 katuwal, r., suganthan, p. n., & zhang, l. (2020). heterogeneous oblique random forest. pattern recognition, 99. doi: https://doi.org/10.1016/j.patcog.2019.107078 patwary, a. h., eshan, p., debnath, t., & sattar, a. (2021). market basket analysis approach to machine learning, 12th international conference on computing communication and networking technologies (icccnt). pp. 1–9. doi: https://doi.org/10.1109/icccnt51525.2021.9580175 r core team. (2021). r: a language and environment for statistical computing. r foundation for statistical computing. available from: https://www.r-project.org/ statisztikai hivatal [hungarian central statistical office]. (2015). győr-moson-sopron megye számokban. available from: https://www.ksh.hu/docs/hun/xftp/idoszaki/regiok/mesz/08_gy_14.pdf (last accessed: 31 march 2022). https://doi.org/10.5334/johd.82 https://doi.org/10.5334/johd.82 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1002/1097-0258(20010130)20:2<324::aid-sim635>3.0.co;2-o https://doi.org/10.1198/106186006x133933 https://doi.org/10.1198/106186006x133933 https://doi.org/10.1016/j.patcog.2019.107078 https://doi.org/10.1109/icccnt51525.2021.9580175 https://www.r-project.org/ https://www.ksh.hu/docs/hun/xftp/idoszaki/regiok/mesz/08_gy_14.pdf corpus of the epigraphy of the italian peninsula in the 1st millennium bce (ceipom) data paper keywords: corresponding author: reuben j. pitts faculty of arts, ku leuven, leuven, be reuben.pitts@kuleuven.be corpus linguistics; language contact; linguistic area; italic; epigraphy to cite this article: pitts, r. j. (2022). corpus of the epigraphy of the italian peninsula in the 1st millennium bce (ceipom). journal of open humanities data, 8: 1, pp. 1–4. doi: https://doi.org/10.5334/ johd.65 corpus of the epigraphy of the italian peninsula in the 1st millennium bce (ceipom) reuben j. pitts abstract the corpus of the epigraphy of the italian peninsula in the 1st millennium bce (ceipom) is a linguistic database which covers the oscan, umbrian, old sabellic, messapic and venetic languages, as well as epigraphic latin up to 100 bce. the database is hosted on github and zenodo, and provides manually annotated linguistic information on all levels of language structure, ranging from phonology to syntax. in providing a high-resolution digital dataset for language varieties that have until now been largely restricted to printed reference works, this corpus opens up new avenues for research into this unique ancient linguistic area. mailto:reuben.pitts@kuleuven.be https://doi.org/10.5334/johd.65 https://doi.org/10.5334/johd.65 https://orcid.org/0000-0002-3960-1490 2pitts journal of open humanities data doi: 10.5334/johd.65 (1) overview repository location https://reubenjpitts.github.io/corpus-of-the-epigraphy-of-the-italian-peninsula-in-the-1st-millennium-bce/ current version (1.2): https://zenodo.org/record/5602978#.yxkw8z5bw2w doi: https://doi.org/10.5281/zenodo.4759134 context this database was created in the context of a phd project on language contact in ancient italy, entitled the interplay between language contact and language change in a fragmentary linguistic area: the italic peninsula in the first millennium bce. https://www.kuleuven.be/onderzoek/portaal/#/projecten/3h190594?lang=en&hl=en (2) methodology steps most of the data was entered manually by the author, based on standard reference works for the languages in question. in some cases, basic forms of automation were used to create an initial dataset which was then corrected. for instance, an initial morphological analysis for venetic was created by linking the attested tokens to a digitised version of lejeune’s (1974: 315–341) venetic word list, and the result was then systematically checked and corrected by the author. the method used for any given field is described in the accompanying documentation on github. a few fields were generated automatically using python modules. these include, for instance, the field “token_clean”, which uses the unidecode package to generate a version of the token stripped of special characters, intended for ease of searching. once again, the documentation on github describes in detail which fields are automatic and how they are generated. sampling strategy the aim of the database is to include all texts in oscan, umbrian, old sabellic, messapic and venetic, as well as epigraphic latin texts before 100 bce. the corpus does not include etruscan, due to the additional complexities of incorporating a non-indo-european language into the structure of the database. within the languages encompassed by the database, however, the primary aim is exhaustivity, and the corpus currently contains over 36,000 tokens. quality control data was entered manually and checked multiple times by the author. (3) dataset description object name corpus of the epigraphy of the italian peninsula in the 1st millennium bce (ceipom) format names and versions csv creation dates 2017–2021 dataset creators reuben j. pitts language metadata are provided in english. https://doi.org/10.5334/johd.65 https://reubenjpitts.github.io/corpus-of-the-epigraphy-of-the-italian-peninsula-in-the-1st-millennium-bce/ https://zenodo.org/record/5602978#.yxkw8z5bw2w https://doi.org/10.5281/zenodo.4759134 https://doi.org/ 10.5281/zenodo.4759134 https://www.kuleuven.be/onderzoek/portaal/#/projecten/3h190594?lang=en&hl=en 3pitts journal of open humanities data doi: 10.5334/johd.65 license creative commons attribution-sharealike 4.0 international license repository name a continually updated version of the corpus is hosted on github. each old version of the corpus is permanently stored at zenodo. in traditional publications ceipom should be cited as this paper, where relevant also specifying the version of the corpus used to achieve any given research result. publication date 2021–05–13 (4) reuse potential this database has a wide range of applications in linguistic research on the languages of ancient italy. currently, such research is hampered by the absence of searchable digital information, as the description of these languages is mostly spread over disparate written reference works (e.g. bakkum, 2009; lejeune, 1974; santoro, 1982; untermann, 2000; wachter, 1987). this database aims to address that research need head-on. the salience of digital and corpus-based approaches to ancient languages has increased in recent years (e.g. adamik, 2016; eckhoff et al., 2018; mambrini et al., 2020; qiu et al., 2018), and these methods have proven their effectiveness even in relatively poorly attested languages. it goes without saying that a digital dataset is more easily and more efficiently queried than a written corpus, facilitating research results that would otherwise be difficult or impossible to achieve. moreover, the use of a digital dataset means any research results thus obtained can be replicated by other researchers, conferring a key advantage in terms of academic transparency. these advantages hold true in fragmentary languages such as venetic or messapic as much as in large corpus languages such as classical latin or greek. since annotation is provided on multiple levels of description, this corpus can serve as a tool for linguistic research of various kinds, including research on the syntax, word order, morphology, lexicon, semantics, phonology and orthography of the ancient languages in question. to give an example of a simple linguistic query in ceipom, if one is researching the usage of syntactic objects in these languages, one can simply use spreadsheet software to search for instances of obj in the field relation, and thus obtain a list of all tokens in the corpus with a syntactic analysis containing this value. the github documentation offers considerable detail on how each of these features are annotated, and how the different levels of linguistic description can be related to one another to formulate more complex queries. in addition to the strictly linguistic annotation, chronological and geographical information (including longitude and latitude) is integrated into the data throughout, allowing the evolution and distribution of these linguistic features to be tracked through time and space. although the focus of the corpus does not lie on epigraphical metadata, the texts in the corpus are linked to their id in the trismegistos database (depauw & gheldof, 2014), which means they can easily be linked to further metadata and bibliography, as well as to other epigraphic databases (such as edr or edcs). in addition to its linguistic uses, therefore, the database also holds promise for related fields such as history, epigraphy and onomastics. the corpus focuses strongly on ensuring that the information provided for the languages of ancient italy is intercomparable. this makes it particularly well adapted for the study of convergence, language contact and other cross-linguistic typological trends in ancient italy. this region has sometimes been described as a linguistic area (zair, 2016: 311–312), a geographic region where prolonged language contact is responsible for grammatical similarities across distantly related languages (friedman & joseph, 2017: 55). since the data is (with a few clearly signalled exceptions) annotated in the same way for all six languages currently in the corpus, this makes it possible to track the evolving differences and similarities between these languages, and to test hypotheses on contact-based change in this region. the main current limitation of the database lies in the fact that, inevitably, its data is not fully complete. in particular, the emphasis until now has been on providing a single plausible 4pitts journal of open humanities data doi: 10.5334/johd.65 to cite this article: pitts, r. j. (2022). corpus of the epigraphy of the italian peninsula in the 1st millennium bce (ceipom). journal of open humanities data, 8: 1, pp. 1–4. doi: https://doi.org/10.5334/ johd.65 published: 03 january 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. linguistic analysis for each token, even when the scholarly literature offers multiple possible interpretations. since this is frequently the case in disputed fragmentary texts, this may cause queries to miss potentially relevant and interesting forms. however, since the state of the data in each field is described in detail in the documentation on github, researchers can take these limitations into account and adjust their use of this research tool in line with their research aims. future updates to the corpus will continue to improve and fine-tune the quality of the data offered, as well as expanding the coverage of alternative analyses for individual tokens. acknowledgements i would like to thank toon van hal, freek van de velde, mark depauw and tom gheldof for their help and advice in making this corpus. funding information this research was carried out with a grant from the fonds wetenschappelijk onderzoek (fwo) – vlaanderen (research foundation – flanders) (grant no. 1150720n). competing interests the author has no competing interests to declare. author affiliation reuben j. pitts orcid.org/0000-0002-3960-1490 faculty of arts, ku leuven, leuven, be references adamik, b. (2016). computerized historical linguistic database of the latin inscriptions of the imperial age: search and charting modules. in á. szabó (ed.), from polites to magos: studia györgy németh sexagenario dedicata (pp. 13–27). budapest: debrecen. bakkum, g. c. l. m. (2009). the latin dialect of the ager faliscus: 150 years of scholarship. amsterdam: amsterdam university press. depauw, m., & gheldof, t. (2014). trismegistos: an interdisciplinary platform for ancient world texts and related information. in p. goodale & n. houssos (eds.), theory and practice of digital libraries—tpdl 2013 selected workshops (pp. 40–52). cham: springer. eckhoff, h., bech, k., bouma, g., eide, k., haug, d., haugen, o. e., & jøhndal, m. (2018). the proiel treebank family: a standard for early attestations of indo-european languages. language resources and evaluation, 52(1), 29–65. doi: https://doi.org/10.1007/s10579-017-9388-5 friedman, v. a., & joseph, b. d. (2017). reassessing sprachbunds: a view from the balkans. in r. hickey (ed.), the cambridge handbook of areal linguistics (pp. 55–87). cambridge, uk: cambridge university press. doi: https://doi.org/10.1017/9781107279872.005 lejeune, m. (1974). manuel de la langue vénète. heidelberg: carl winter universitätsverlag. mambrini, f., cecchini, f. m., franzini, g., litta, e., passarotti, m. c., & ruffolo, p. (2020). lila: linking latin: risorse linguistiche per il latino nel semantic web. umanistica digitale, 8, 63–78. doi: https:// doi.org/10.6092/issn.2532-8816/9975 qiu, f., stifter, d., bauer, b., lash, e., & ji, t. (2018). chronologicon hibernicum: a probabilistic chronological framework for dating early irish language developments and literature. in m. ioannides, e. fink, r. brumana, p. patias, a. doulamis, j. martins, & m. wallace (eds.), digital heritage. progress in cultural heritage: documentation, preservation, and protection (pp. 731–740). cham: springer international publishing. doi: https://doi.org/10.1007/978-3-030-01762-0_65 santoro, c. (1982). nuovi studi messapici. galatina: congedo editore. untermann, j. (2000). wörterbuch des oskisch-umbrischen. heidelberg: winter. wachter, r. (1987). altlateinische inschriften: sprachliche und epigraphische untersuchungen zu den dokumenten bis etwa 150 v. chr. lausanne: lang. zair, n. (2016). vowel weakening in the sabellic languages as language contact. indogermanische forschungen, 121(1), 295–315. doi: https://doi.org/10.1515/if-2016-0016 https://doi.org/10.5334/johd.65 https://doi.org/10.5334/johd.65 https://doi.org/10.5334/johd.65 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://orcid.org/0000-0002-3960-1490 https://doi.org/10.5117/9789056295622 https://doi.org/10.1007/s10579-017-9388-5 https://doi.org/10.1017/9781107279872.005 https://doi.org/10.6092/issn.2532-8816/9975 https://doi.org/10.6092/issn.2532-8816/9975 https://doi.org/10.1007/978-3-030-01762-0_65 https://doi.org/10.1515/if-2016-0016 reddit dataset on meme stock: gamestop data paper corresponding author: jing j. han klein college of media and communication, temple university, philadelphia, us jing.han@temple.edu keywords: gamestop; reddit; online social movement; meme to cite this article: han, j. j. (2022). reddit dataset on meme stock: gamestop. journal of open humanities data, 8: 20, pp. 1–5. doi: https://doi. org/10.5334/johd.85 reddit dataset on meme stock: gamestop jing j. han abstract this dataset includes one-year reddit posts, post metadata, post sentiments, 57 post meta-features and post comments threads from several subreddits around a meme stock: gamestop. these subreddits are dedicated to the discussion of gamestop stock and the social movement of global wealth transfer that the event gamestop stock short squeeze initiated in january 2021. the subreddits included are r/gme, r/ superstonk, r/ddintogme, and r/gme jungle. the whole dataset is stored in eight csv files and four html files. mailto:jing.han@temple.edu https://doi.org/10.5334/johd.85 https://doi.org/10.5334/johd.85 https://orcid.org/0000-0003-3251-6549 2han journal of open humanities data doi: 10.5334/johd.85 (1) overview the collection of this dataset was inspired by the short squeeze event on gamestop stock initiated by retail investors in january 2021. a short squeeze is an unusual condition that triggers rapidly rising prices in a stock or other tradeable financial instruments. for a short squeeze to occur, the financial instrument must have an unusual degree of short sellers holding positions in it. the short squeeze is triggered when short sellers coincidentally cut losses and exit their positions (mitchell, 2021). at its height, the pre-market value for gamestop stock was more than $500 per share (wikipedia, 2022). gamestop stock is a meme stock that refers to the shares a company has gained online following through social media platforms. these online communities can build hype around a meme stock through narratives and conversations, which reflect public opinions of the stock (hayes, 2022). reddit has been the primary platform retail investors use to communicate with each other, including sharing and discussing news from social media and mainstream media, personal trading histories, memes, technical analyses, and strategies to facilitate global wealth transfer. the goal of the reddit community online movement was colloquially named “mother of all short squeezes” (moass) (anand & pathak, 2021; betzer & harries, 2022). moass exemplifies the populist intent of an online social movement observed in the dataset. the realization of moass’s goals requires participation from every community member. however, differing opinions about how their goal should be achieved and what kind of community culture should be constructed have split the community into four subreddits. specifically, the creation of r/ superstonk was born of frustration about r/wallstreetbets, which received mainstream media attention at the beginning of the short squeeze. the subreddit r/superstonk was created, driven by the lack of focus on achieving the common goal and the concern on the intention and conduct of moderators on r/wallstreetbets. its profile banner, “power to the shareholders” distils its populist belief in achieving the common goal. however, the integration with meme culture on r/superstonk has distanced community members who are motivated to achieve the common goal with a more serious and less memetic attitude. the community migration into r/gme, r/ddintogme, r/gmejungle was the result of this cultural disagreement. furthermore, community migration does not follow a linear progression nor does it suggest that the community is conflicted and divisive. instead, it reflects the influence of reddit’s features on the organization of the community: individuals’ content curation on reddit is structured by topics. users on reddit curate their content by following different subreddit communities. thus, this dataset will help study online social movements and its relationship with online culture. the collection of data was motivated by the continuous actions of community members pursuing the realization of moass. during the data collection period, several changes in communication patterns and communication tactics occurred, driven by both internal and external events, such as community disagreement on ways of realizing common moass goals, and episodic mainstream media attention. the dataset on r/superstonk has 560,125 posts with an average word count of 15 and a standard deviation of 13 rounded to the nearest integer. the dataset on r/gme has 1,033,236 posts with an average word count of 14 and a standard deviation of 13 rounded to the nearest integer. the dataset on r/gmejungle has 39,634 posts with an average word count of 15 and a standard deviation of 12 rounded to the nearest integer. the dataset on r/ddintogme has 5,498 posts with an average word count of 16 and a standard deviation of 13 rounded to the nearest integer. the four html files on explorative data analyses demonstrate the first 12 variables (id, title, url, score, author, number of comments, date, flair, negative sentiment, positive sentiment, neutral sentiment, and compound sentiment), their interactions, and correlations from the dataset files ending with “features.” repository location context this dataset was produced as part of an ongoing research project1 that studies the communication patterns of subreddit communities around meme stocks and their belief in using meme stocks to facilitate a global wealth transfer movement. it has not been used in any publication yet. 1 coding notebooks from this project will be shared publicly in the future. 3han journal of open humanities data doi: 10.5334/johd.85 (2) method the post id, title, url, score, author, number of comments, date, and flair (community-defined content filter) were collected by using pushshift reddit api (baumgartner, 2018). the post comments were collected by using the python reddit api wrapper, praw (boe, 2021). each post’s sentiment scores were calculated using vader (hutto & gilbert, 2014) with a customized dictionary that reflects the common emojis used in these subreddits. 57 meta-features on post titles were produced by using the spacy large english model (honnibal et al., 2020). the explorative data analyses are generated by pandas profiling (brugman, 2019) and sweetviz (bertrand, 2022). steps i used pushshift to collect post titles and post metadata. next, i used praw to collect post comments. the customized vader dictionary assigned the “gem stone”, “gorilla”, different skin tones of “raising hands”, “rocket”, different versions of “moon”, and different skin tones of “open hands” emojis to score four, which is the highest score in vader, signifying high positive sentiment. the emoji “crayon” was assigned a score of one, reflecting a moderately positive sentiment. the distinctive emoji uses reflect the communication and language patterns in these subreddits. for example, the “gem stone” emoji means “diamond hands”, which describes an investor who refrains from selling an investment despite downturns or losses. the combination of “rocket” emoji and “moon” emoji means “going to the moon”, which describes when the price of a financial instrument is rising off the charts. quality control the values collected from pushshift, such as scores and number of comments, only reflect the values when the data was collected. there might be a discrepancy between the values collected and the real-time values. the customized update on vader dictionary only includes commonly agreed-on emoji used by the gamestop retail investors. these particular emoji uses are also shared by the larger communities associated with the mentality of the meme culture. the pre-processing results on post titles are included in 57 meta-features, which are viable for future analyses, such as creating further features. (3) dataset description object name reddit dataset on meme stock: gamestop https://doi.org/10.7910/dvn/tumipc format names and versions csv; html; version 2.0 creation dates 2022–02–15 — 2022–04–26 dataset creators jing han language english license cc0 https://doi.org/10.7910/dvn/tumipc 4han journal of open humanities data doi: 10.5334/johd.85 repository name dataverse publication date 2022–07–09 (4) reuse potential the 74 variables in this dataset provide opportunities for future analyses, such as creating further features during exploratory analysis and future studies. for example, the variable post flairs can be used as post labels for text classification research. researchers who are interested in understanding online communication patterns could use this labeled dataset to train a classifier and apply multiclass or multilabel inference on the comment threads. the results of text classification research could also be used to understand the communication processes of these subreddits. the relationship between communication processes and the effects of the online social movement (moass) could be studied by performing a time series analysis on the dataset and analyzing mainstream media’s attention on the movement. furthermore, word count, stop word count, word count after cleaning, and speech tagging would be useful for named-entity recognition and online language studies. the results of studying online language use contained in the dataset would be helpful understanding the community culture of these subreddits, which could contribute to the studies on meme culture and broadly, online culture. using public sentiment to harness the power of public opinion, research has outlined methods for analyzing commercial interests. for example, researchers have studied the relationship between public sentiment on social media platforms and market impact (nguyen & shirai, 2015; audrino et al., 2020). s & p dow jones indices includes a social media sentiment factor (s&p global, n.d.). sentiment annotation on this reddit dataset using vader with a customized dictionary could provide a baseline comparison for researchers interested in using sentiment as a variable to study the processes and effects of public sentiment. specifically, the sentiment annotation could assist studies on the relationship between public sentiment and price fluctuations of stock, between public sentiment and public opinion. the reddit dataset generated during the gamestop short squeeze stands out from other reddit corpus because of its socio-economic relevance. the social movement following the event demonstrates the power of people and the long-term economic impact their actions had. additionally, reddit allows access to data via its api terms of use, which is more generously than other social media platforms (reddit, 2016). reddit’s data structure and limited restrictions on posting content provide opportunities to study online language use, communication processes, public opinions, online culture, online communities, and online social movements. acknowledgements i gratefully acknowledge dr. ryan omizo’s guidance and encouragement in creating this dataset. publication of this article was funded in part by the temple university libraries open access publishing fund. competing interests the author has no competing interests to declare. author contributions jing han is responsible for conceptualization, data curation, methodology, and writing. author affiliation jing j. han orcid.org/0000-0003-3251-6549 klein college of media and communication, temple university, philadelphia, us https://orcid.org/0000-0003-3251-6549 5han journal of open humanities data doi: 10.5334/johd.85 to cite this article: han, j. j. (2022). reddit dataset on meme stock: gamestop. journal of open humanities data, 8: 20, pp. 1–5. doi: https://doi. org/10.5334/johd.85 published: 24 august 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. references anand, a., & pathak, j. (2021). wallstreetbets against wall street: the role of reddit in the gamestop short squeeze. iim bangalore research paper, 644. https://repository.iimb.ac.in/handle/2074/20101. doi: https://doi.org/10.2139/ssrn.3873099 audrino, f., sigrist, f., & ballinari, d. (2020). the impact of sentiment and attention measures on stock market volatility. capital markets: asset pricing & valuation ejournal. doi: https://doi.org/10.2139/ ssrn.3188941 baumgartner, j. m. (2018). pushshift api. retrieved from https://github.com/pushshift/api (last accessed: 9 may, 2022). betzer, a., & harries, j. p. (2022). how online discussion board activity affects stock trading: the case of gamestop. financial markets and portfolio management, 1–30. doi: https://doi.org/10.1007/s11408022-00407-w bertrand, f. (2022). sweetviz. retrieved from https://pypi.org/project/sweetviz/ (last accessed: 22 july, 2022). brugman, s. (2019). pandas-profiling: exploratory data analysis for python. retrieved from https://github. com/pandas-profiling/pandas-profiling (last accessed: 22 july, 2022). boe, b. (2021). praw: the python reddit api wrapper. retrieved from https://praw.readthedocs.io/en/ v7.5.0/ (last accessed: 9 may, 2022). hayes, a. (2022). meme stock. retrieved from https://www.investopedia.com/meme-stock-5206762 (last accessed: 9 may, 2022). hutto, c., & gilbert, e. (2014). vader: a parsimonious rule-based model for sentiment analysis of social media text. proceedings of the international aaai conference on web and social media, 8(1), 216– 225). url: https://ojs.aaai.org/index.php/icwsm/article/view/14550 honnibal, m., montani, i., van landeghem, s., boyd, a., et al. (2020). spacy: industrial-strength natural language processing in python. zenodo. doi: https://doi.org/10.5281/zenodo.1212303 mitchell, c. (2021). short squeeze. retrieved from https://www.investopedia.com/terms/s/shortsqueeze. asp (last accessed: 26 april, 2021). nguyen, t. h., & shirai, k. (2015). topic modelling based sentiment analysis on social media for stock market prediction. proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing, 1, 1354–1364. doi: https://doi.org/10.3115/v1/p15-1131 reddit. (2016). api terms. retrieved from https://www.reddit.com/wiki/api-terms (last accessed: 22 july, 2022). s&pglobal. (n.d.). social media sentiment – indices. retrieved from https://www.spglobal.com/spdji/en/ index-family/strategy/factors/social-media-sentiment/#overview (last accessed: 22 july, 2022). wikipedia. (2022). gamestop short squeeze. retrieved from https://en.wikipedia.org/wiki/gamestop_ short_squeeze (last accessed: 22 july, 2022). https://doi.org/10.5334/johd.85 https://doi.org/10.5334/johd.85 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://repository.iimb.ac.in/handle/2074/20101 https://doi.org/10.2139/ssrn.3873099 https://doi.org/10.2139/ssrn.3188941 https://doi.org/10.2139/ssrn.3188941 https://github.com/pushshift/api https://doi.org/10.1007/s11408-022-00407-w https://doi.org/10.1007/s11408-022-00407-w https://pypi.org/project/sweetviz/ https://github.com/pandas-profiling/pandas-profiling https://github.com/pandas-profiling/pandas-profiling https://praw.readthedocs.io/en/v7.5.0/ https://praw.readthedocs.io/en/v7.5.0/ https://www.investopedia.com/meme-stock-5206762 https://ojs.aaai.org/index.php/icwsm/article/view/14550 https://doi.org/10.5281/zenodo.1212303 https://www.investopedia.com/terms/s/shortsqueeze.asp https://www.investopedia.com/terms/s/shortsqueeze.asp https://doi.org/10.3115/v1/p15-1131 https://www.reddit.com/wiki/api-terms https://www.spglobal.com/spdji/en/index-family/strategy/factors/social-media-sentiment/#overview https://www.spglobal.com/spdji/en/index-family/strategy/factors/social-media-sentiment/#overview https://en.wikipedia.org/wiki/gamestop_short_squeeze https://en.wikipedia.org/wiki/gamestop_short_squeeze the transcomp dataset of literary translations from 120 languages and a parallel collection of english-language originals data paper corresponding author: matt erlin germanic languages and literatures, washington university, st. louis, us merlin@wustl.edu keywords: translation studies; computational literary studies; world literature; natural language processing; text corpus; text collection to cite this article: erlin, m., piper, a., knox, d., pentecost, s., & blank, a. (2022). the transcomp dataset of literary translations from 120 languages and a parallel collection of englishlanguage originals. journal of open humanities data, 8: 29, pp. 1–6. doi: https://doi. org/10.5334/johd.94 the transcomp dataset of literary translations from 120 languages and a parallel collection of english-language originals matt erlin andrew piper douglas knox stephen pentecost allie blank abstract the transcomp dataset of literary translations is a collection of document-level word frequencies sampled from 10,631 translations into english of global literary fiction published since 1950, together with a historically matched parallel corpus of 10,682 fictional works originally published in english. we provide csv files with word frequency counts for 10,000-word samples taken from each text. the associated metadata is available in a separate csv. these data will be useful to literary scholars and linguists working in translation studies, and those interested in the linguistic, stylistic, and thematic specificity of translations from particular regions. *author affiliations can be found in the back matter of this article mailto:merlin@wustl.edu https://doi.org/10.5334/johd.94 https://doi.org/10.5334/johd.94 https://orcid.org/0000-0002-0536-7499 https://orcid.org/0000-0001-9663-5999 https://orcid.org/0000-0002-7168-7271 https://orcid.org/0000-0002-2093-6151 2erlin et al. journal of open humanities data doi: 10.5334/johd.94 (1) overview repository location doi.org/10.7910/dvn/itlgqv context this dataset consists of document-level word frequency samples drawn from a parallel corpus containing 10,631 translations of literary fiction into english from 120 different languages published since 1950 along with a comparable set of 10,682 works of fiction written originally in english during the same time period. all texts are contained in the hathi trust digital library and are derived from the ca. 176,000-volume noveltm collection created by underwood et al. (2020). the dataset was compiled as part of an ongoing research project into the unique linguistic, stylistic, and thematic features of translated fiction as compared to fiction written originally in english. following the precedent established by toury’s (1980) and baker’s (1993) pioneering work on translation universals, our aim has been to create two independent corpora that enable researchers to evaluate translated texts as they relate to target language texts in general, rather than to compile a corpus of translations and their corresponding source texts. while corpora designed for comparative translation studies do exist, including a number of parallel corpora, they are often focused on single pairs of languages and/or non-literary texts; moreover, they are not constructed to facilitate the kind of historical comparisons that interest computational literary scholars. to our knowledge, no existing collection of historically matched translated and original-language fictional texts even approaches the size or linguistic diversity of our corpus, and we hope that it will serve as a resource for additional research. (2) method steps on the basis of the metadata provided by underwood et al. (2020) regarding the noveltm dataset of english-language fiction, we first used a set of regular expressions such as “translated from the swedish,” “from the [language],” “tr. from,” “rendered into english,” etc. to identify an initial list of translated texts. next, if an author was included in this initial list, we included all titles by that author. for example, if one volume by leo tolstoy had “translated from” in one of its metadata fields, we included all works by leo tolstoy in our set of translations. original english-language works were identified by fuzzy matching against a large set of author names derived from wikipedia and the virtual international authority file (viaf), which consists of millions author names derived from 68 library catalogues from around the world. any names identified as english-language authors from this list were then removed from the translation data. we similarly used non-english-language author data to match with our translation data and reviewed all non-matching works by hand. information on a translation’s original language was taken from two primary sources: explicit references included in the titles of the works in hathi (e.g., translated from the swedish) and from the hathitrust extracted features metadata. these results were supplemented using fuzzy matching of author lists from wikipedia and viaf. the remaining missing data was manually retrieved using worldcat and other internet sources. to identify date of publication, we used underwood et al.’s “inferred date” (2020). because the holdings of translations in hathi are heavily skewed toward a rather small set of european authors and languages in the first part of the twentieth century, we subsetted our data down to the date range 1950–2008, which aligns with the period construct of “post-war” fiction used in literary studies (mcgurl, 2009). finally, we also removed all volumes where underwood et al.’s predicted probability of being non-fiction was greater than 85% (2020). given that the set of original language works was larger than the set of translations, we also randomly downsampled each year of our original publications to match the number of translations. we then processed the files to be extracted as word frequency data. working within the htrc capsule (plale et al., 2019), we first downloaded individual page files using the preloaded functions in the htrc workset toolkit to remove running page headers and footers. for each volume, we concatenated individual page files into a single document. after tokenizing with regular expressions, we next represented each document as ten randomly selected 1,000-continuoushttps://doi.org/10.7910/dvn/itlgqv 3erlin et al. journal of open humanities data doi: 10.5334/johd.94 word samples drawn from the middle 60% of the document to avoid paratextual content in the front and back matter. this sampling enables us to control for effects that might arise from the different lengths of the source texts. to mitigate problems related to low ocr quality, foreignlanguage passages, or the presence of other non-standard characters, only samples that had 90% of words in an english dictionary were kept. if a work did not have ten samples that met this criteria, it was removed. all of this work was completed in the hathi capsule. these samples were then converted into bags of words, which we are able to make accessible to the scholarly community in the form of two csv files, one for originals and one for translations, listing raw frequency counts by document for each of the words in each of the original document samples. while the final corpus of translated texts remains skewed towards european languages, it does include a significant number of works originally published in east asian and south asian languages and a smaller number of works originally published in middle eastern and african languages. figures 1–4 provide an overview of the dataset. figure 3 count of translations from the top 20 languages represented in the corpus. figure 1 count of works by decade, originals and translations. figure 2 count of translated works by decade, noneuropean and european. figure 4 total translations by subregion (classical literature as separate category). 4erlin et al. journal of open humanities data doi: 10.5334/johd.94 quality control to test the accuracy of our identification of translations in the noveltm dataset, we created a random sample of 100 works identified as translations and 100 works identified as originals from our data. we then manually checked each title to see whether our classification had been correct. we found that 99 were correctly labeled for an estimated precision of .99. we did not evaluate the accuracy of recall (i.e., translations in hathi that we missed). in addition to its impracticality, given the size of the original dataset, the results would simply have told us whether our sample was representative of translations in the hathi corpus. for the comparative work we envision, the key question is whether we have a randomly sampled set of translations that mirrors our english original corpus, not whether it accurately represents the distribution of texts in hathi. limitations one key limitation is the date range of our data (1950–2000). expanding this date range, however, leads to an overwhelming predominance of a few european languages, which runs counter to our goal of having a diverse set of source languages represented. as figure 2 reveals, even after 1950, translations in the hathi library skew european. whether this is true of the english-language market for fiction more generally or is an artifact of hathi we leave for future work. we note that the period 1950–2000 is considered a distinct period within literary history and thus our data aligns with this historical construct (mcgurl, 2009). an additional potential limitation is the presence of works in the dataset that were originally published prior to 1950 but which were translated or re-translated at a later date. on the basis of the (incomplete) information that we have on author birth and death dates, we estimate that such works constitute between 15–20% of the total (see python notebook in the repository). finally, our data is limited due to intellectual property restrictions that only allow us to export word frequencies and not the full text from hathi. we provide all hathi ids such that researchers can recreate our data inside of the hathi capsule system. (3) dataset description object name the dataset consists of three csv files: translation_samples.csv, original_samples.csv, and transcomp_metadata.csv. we have also included a python notebook addressing the question of original publication dates: 1950_boundary_question.ipynb format names and versions csv, ipynb creation dates 2021–04–28 – 2022–04–18 dataset creators allie blank douglas knox stephen pentecost language english license cc0 repository name dataverse 5erlin et al. journal of open humanities data doi: 10.5334/johd.94 publication date 2022–10–07 (4) reuse potential two primary areas of research will likely benefit from access to this data. recent scholarship in the sociology of translation (heilbron, 1999; bachleitner and wolf, 2004; sapiro, 2016; 2020) has helped reveal the structural asymmetries in the global flow of translations, often adopting a core—semi-periphery—periphery model to clarify the dominant role played by a small subset of european languages in this regard. to date, however, there has been virtually no effort to link these asymmetries to differences in the linguistic, stylistical, or thematic features of translations (piper and erlin 2022). to what extent, in other words, do translations from “peripheral” languages or language regions exhibit common features that might reinforce or challenge existing cultural biases or reflect the pressures imposed on “peripheral” authors in what pascale casanova (2004) has referred to as the “world republic of letters”? we believe that our data set will greatly facilitate the investigation of such questions. while only having bags of word frequencies places some limitations on what is possible in this regard, prior research has generated important cultural insights using such word distribution approaches (erlin, 2017; jockers and mimno, 2013; piper, 2016; underwood, 2016). in addition, the csv files include information on the page count for each work sampled as well as the mean sentence length for the samples, the latter of which we calculated in the hathi capsule. finally, we include metadata so that researchers can work on the full texts within the hathi data capsule system. with regard to translation studies more broadly, we believe that this historically matched collection of translations and originals can shed new light on questions of “translationese” (i.e. translation universals). corpus and computational linguists have long been identifying ways in which translation can be thought of as a distinct linguistic practice that consists of quasi-universal behaviors conditioned by the nature of moving between languages and the cognitive demands of doing so (volansky, orden, and winter, 2015). only a few studies, however, have focused on the specific qualities of literary translations, and certainly not at the scale made possible by this dataset. we think the collection is particularly well suited to investigations into the question of whether translations can be understood as a literary genre (piper and erlin, 2022). while the concept of genre is famously multivalent in literary studies (cohen, 2017, 86), we use the term in the most elementary sense as a set of works that exhibit “shared features” (reichert, 1978, 57) – translations in this case — that can be algorithmically classified on the basis of its relational distinctiveness visa-vis non-translated works as well as the ways it coheres as a category over time. additional file the additional file for this article can be found as follows: • supplementary material. about the transcomp dataset. doi: https://doi.org/10.5334/ johd.94.s1 acknowledgements we are grateful to hathitrust for the permission to release this data. competing interests the authors have no competing interests to declare. author contributions • matt erlin: conceptualization, methodology, writing, visualization • andrew piper: conceptualization, methodology, writing, visualization https://doi.org/10.5334/johd.94.s1 https://doi.org/10.5334/johd.94.s1 6erlin et al. journal of open humanities data doi: 10.5334/johd.94 to cite this article: erlin, m., piper, a., knox, d., pentecost, s., & blank, a. (2022). the transcomp dataset of literary translations from 120 languages and a parallel collection of englishlanguage originals. journal of open humanities data, 8: 29, pp. 1–6. doi: https://doi. org/10.5334/johd.94 published: 26 december 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. • douglas knox: conceptualization, methodology, writing, data curation, visualization • stephen pentecost: conceptualization, methodology, data curation • allie blank: conceptualization, methodology, data curation author affiliations matt erlin orcid.org/0000-0002-0536-7499 germanic languages and literatures, washington university, st. louis, us andrew piper orcid.org/0000-0001-9663-5999 languages, literatures, and cultures, mcgill university, montreal, canada douglas knox orcid.org/0000-0002-7168-7271 humanities digital workshop, washington university, st. louis, us stephen pentecost orcid.org/0000-0002-2093-6151 humanities digital workshop, washington university, st. louis, us allie blank humanities digital workshop, washington university, st. louis, us references baker, m. (1993). corpus linguistics and translation studies: implications and applications. in baker, m., francis, g., tognini-bonelli, e. (eds.), text and technology: in honour of john sinclair (pp. 233–250). amsterdam/philadelphia: benjamins. doi: https://doi.org/10.1075/z.64.15bak bachleitner, n., & wolf, m. (2004). auf dem weg zu einer soziologie der literarischen übersetzung im deutschsprachigen raum. internationales archiv für sozialgeschichte der deutschen literatur, 29(2), 1–25. doi: https://doi.org/10.1515/iasl.2004.2.1 casanova, p. (2004). the world republic of letters. cambridge: harvard university press. cohen, r. (2017). genre theory and historical change: theoretical essays of ralph cohen. charlottesville: university of virginia press. erlin, m. (2017). topic modeling, epistemology, and the english and german novel. journal of cultural analytics, 2(2), 11070. doi: https://doi.org/10.22148/16.014 heilbron, j. (1999). towards a sociology of translation: book translations as a cultural world-system. european journal of social theory, 2(4), 429–444. doi: https://doi.org/10.1177/136843199002004002 jockers, m. l., & mimno, d. (2013). significant themes in 19th-century literature. poetics, 41(6), 750–769. doi: https://doi.org/10.1016/j.poetic.2013.08.005 mcgurl, m. (2009). the program era: postwar fiction and the rise of creative writing. cambridge: harvard university press. doi: https://doi.org/10.2307/j.ctvjsf59f piper, a. (2016). fictionality. journal of cultural analytics, 2(2). doi: https://doi.org/10.22148/16.011 piper, a., & erlin, m. (2022). the predictability of literary translation. in proceedings of the 2nd international workshop on natural language processing for digital humanities, pp. 155–160. plale, b., dickson, e., kouper, i., liyanage, s. h., ma, y., mcdonald, r. h., walsh, j. a., & withana, s. (2019). safe open science for restricted data. data and information management, 3(1). doi: https:// doi.org/10.2478/dim-2019-0005 reichert, j. (1978). more than kin and less than kind: the limits of genre theory. in j. p. strelka (ed.), theories of literary genre, pp. 57–79. university park: pennsylvania state university press. sapiro, g. (2016). how do literary works cross borders (or not)?: a sociological approach to world literature. journal of world literature, 1(1), 81–96. doi: https://doi.org/10.1163/24056480-00101009 sapiro, g. (2020). the transnational literary field between (inter)-nationalism and cosmopolitanism. journal of world literature, 5(4), 481–504. doi: https://doi.org/10.1163/24056480-00504002 toury, g. (1980). in search of a theory of translation. tel aviv: porter institute for poetics and semiotics, tel aviv university. underwood, t., kimutis, p., & witte, j. (2020). noveltm datasets for english-language fiction, 1700– 2009. journal of cultural analytics, 5(2), 13147. doi: https://doi.org/10.22148/001c.13147 underwood, t. (2016). the life cycles of genres. journal of cultural analytics, 2(2). doi: https://doi. org/10.22148/16.005 volansky, v., ordan, n., & wintner, s. (2015). on the features of translationese. digital scholarship in the humanities, 30(1), 98–118. doi: https://doi.org/10.1093/llc/fqt031 https://doi.org/10.5334/johd.94 https://doi.org/10.5334/johd.94 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://orcid.org/0000-0002-0536-7499 https://orcid.org/0000-0002-0536-7499 https://orcid.org/0000-0001-9663-5999 https://orcid.org/0000-0001-9663-5999 https://orcid.org/0000-0002-7168-7271 https://orcid.org/0000-0002-7168-7271 https://orcid.org/0000-0002-2093-6151 https://orcid.org/0000-0002-2093-6151 https://doi.org/10.1075/z.64.15bak https://doi.org/10.1515/iasl.2004.2.1 https://doi.org/10.22148/16.014 https://doi.org/10.1177/136843199002004002 https://doi.org/10.1016/j.poetic.2013.08.005 https://doi.org/10.2307/j.ctvjsf59f https://doi.org/10.22148/16.011 https://doi.org/10.2478/dim-2019-0005 https://doi.org/10.2478/dim-2019-0005 https://doi.org/10.1163/24056480-00101009 https://doi.org/10.1163/24056480-00504002 https://doi.org/10.22148/001c.13147 https://doi.org/10.22148/16.005 https://doi.org/10.22148/16.005 https://doi.org/10.1093/llc/fqt031 event annotations of prose data paper corresponding author: michael vauth institut für sprachund literaturwissenschaft, technical university darmstadt, germany michael.vauth@gmx.de keywords: annotation; narratology; computational literary studies; event; prose to cite this article: vauth, m., & gius, e. (2022). event annotations of prose. journal of open humanities data, 8: 19, pp. 1–6. doi: https://doi.org/10.5334/ johd.83 event annotations of prose michael vauth evelyn gius abstract this dataset covers 41,341 manual event annotations of six german prose texts from the 19th and early 20th century comprising 290,997 tokens. for each text, the dataset includes annotations by two annotators and gold standard annotations. these annotations were used for the automation of narratological event annotations (vauth, hatzel, gius, & biemann, 2021), a reflection of inter annotator agreements in literary studies (gius & vauth, 2022) and the development of an event based plot model (gius & vauth, accepted). *author affiliations can be found in the back matter of this article mailto:michael.vauth@gmx.de https://doi.org/10.5334/johd.83 https://doi.org/10.5334/johd.83 https://orcid.org/0000-0002-3668-6273 https://orcid.org/0000-0001-8888-8419 2vauth and gius journal of open humanities data doi: 10.5334/johd.83 (1) overview repository location our dataset is located in a github repository within the fortext organisation: https://github. com/fortext/event_dataset. additionaly, this repository is published as a zenodo dataset (vauth & gius, 2022). context the annotations were produced as part of the research project event, located at the technical university darmstadt and the university of hamburg. the event project is part of the priority programme computational literary studies (cls), funded by the german research foundation (dfg). for further informations see the programme website: https://dfg-spp-cls.github.io/. we developed an event annotation tagset that is based on narrative theory, where events are considered the smallest units of narratives (vauth & gius, 2022). the event tagset has been used for annotating the texts, assigning to each subclause one of the four categories (non-event, stative event, process event and change of state). depending on the event types, additional properties have been assigned. (2) method the dataset is created by manual annotation using the catma tool (gius et al., 2022) for the manual annotations and the gitma package (vauth et al., 2022) for annotation data processing. steps the annotation procedure includes the following steps: • corpus collection: the six texts are collected from the textgrid corpus (textgrid, 2021) and the d-prose corpus (gius, guhr, & adelmann, 2021). we selected narratives representing the literary developments between 1800 and 1920. in order to represent the most common narrative genres of this time period, we included short stories, novellas and novels. the corpus consists of: ludwig tieck (1797): der blonde eckbert heinrich von kleist (1807): das erdbeben in chili annette von droste huelshoff (1842): die judenbuche theodor fontane (1894): effi briest marie von ebner-eschenbach (1896): krambambuli franz kafka (1915): die verwandlung • annotation guidelines: we developed guidelines for the annotation of narratological event types (vauth & gius, 2021). • manual annotation process: pilot annotations: the annotation guidelines were developed and improved by extensive pilot annotations. annotator training: annotators were first trained by annotating and discussing a training text. systematic annotations: every text has been annotated by two independent annotators (see table 1). the annotation process was accompanied by regular meetings to discuss cases of doubt. for the documentation of these cases, the annotators used a dedicated tag. gold standard annotations: based on the double annotations of every text, gold standard annotations were created by one annotator who resolved inconsistent annotations (table 3). here again, cases of doubt were discussed. in this process, the gitma package (vauth et al., 2022) was developed for supporting the extraction, comparison and integration of annotations in catma. https://github.com/fortext/event_dataset https://github.com/fortext/event_dataset https://dfg-spp-cls.github.io/ 3vauth and gius journal of open humanities data doi: 10.5334/johd.83 quality control the multi annotator approach with comprehensive training of annotators and the feedback loops described above were designed for controlling the quality of manual annotations. the main annotation task was the classification of the event types based on four categories • non_event • stative_event • process • change_of_state. here, we accomplished an agreement greater than 0.55 krippendorff’s α for the six texts. the evaluation results of inter annotator agreement (iaa) for the final annotations is documented in table 1. table 2 shows additional event classifications that are also grounded in narrative theory and depend on the event type classification. these categories are implemented as properties for defined event types. for instance, only process events and changes of state can be iterative. as the lower iaa values for some categories indicate, some of these categories are highly interpretative. the strongly varying agreement values are also due to the fact that different classification systems are provided for these event properties: • unpredictable: 0, 1, 2, 3, 4 • mental: yes, no • representation_type: (any combination of) narrator_speech, character_speech, thought_ representation • persistent: 0, 1, 2, 3, 4 • iterative: yes, no • intentional: yes, no • non_event_type: conditional_sentence, subjunctive_sentence, modalised_statement, negation, generic_sentence, ellipsis, imperative_sentence, question, request (3) dataset description object name annotations_event.json format names and versions json eckbert effi briest erdbeben judenbuche krambambuli verwandlung event type 0.73 0.57 0.75 0.61 0.66 0.73 table 1 inter annotator agreement (krippendorff’s α) for event types. eckbert effi briest erdbeben judenbuche krambambuli verwandlung unpredictable –0.25 –0.30 –0.08 –0.35 –0.21 –0.55 mental 0.79 0.33 0.58 0.39 0.46 0.79 representation_type 0.94 0.87 0.86 0.91 0.86 0.67 persistent 0.09 0.13 0.28 –0.14 0.25 –0.89 iterative 0.62 0.20 –0.29 0.35 0.07 0.70 intentional 0.75 0.24 0.45 0.43 0.32 0.70 non_event_type 0.66 0.68 0.80 0.71 0.80 0.69 table 2 inter annotator agreement (krippendorff’s α) for additional event properties. for a detailed description and examples see vauth and gius (2021). 4vauth and gius journal of open humanities data doi: 10.5334/johd.83 ta b le 3 n u m b e r a n d e xt e n si o n ( in t o ke n s) o f g o ld s ta n d a rd a n n o ta ti o n s p e r te xt . f o r to ke n iz a ti o n w e u se d t h e g e rm a n t o ke n iz e r in t h e n lt k t o o lk it v e rs io n 3 .7 ( b ir d e t a l., 2 0 0 9 ). e r d b e b e n v e r w a n d lu n g e c k b e r t k r a m b a m b u li ju d e n b u c h e e f f i b r ie s t a ll t e x t c o u n t to k e n c o u n t to k e n c o u n t to k e n c o u n t to k e n c o u n t to k e n c o u n t to k e n c o u n t to k e n n o n _e ve n t 1 6 7 1 ,0 8 6 7 5 7 5 ,9 3 8 2 1 2 1 ,4 8 8 1 1 6 7 1 2 8 5 6 4 ,7 3 2 2 ,8 8 7 1 6 ,6 5 5 4 ,9 9 5 3 0 ,6 1 1 st a ti ve _e ve n t 1 3 6 1 ,0 4 6 4 5 5 3 ,8 3 0 2 4 3 1 ,6 6 7 8 2 6 3 7 4 7 6 3 ,5 0 2 1 ,6 7 5 1 1 ,6 5 6 3 ,0 6 7 2 2 ,3 3 8 p ro ce ss 4 0 0 3 ,4 5 9 1 ,1 2 6 9 ,7 4 8 4 5 0 3 ,2 2 5 2 6 8 1 ,9 9 0 1 ,1 2 0 8 ,1 4 6 2 ,0 6 1 1 5 ,1 8 0 5 ,4 2 5 4 1 ,7 4 8 ch a n g e _o f_ st a te 9 6 3 2 6 2 1 6 2 5 1 6 3 4 3 9 3 9 3 2 4 4 3 3 6 2 1 4 6 1 ,1 6 7 5vauth and gius journal of open humanities data doi: 10.5334/johd.83 creation dates 2020-12-01 – 2022-03-31 dataset creators evelyn gius, michael vauth, michael weiland (student assistant), gina maria sachse (student assistant), angela nöll (student assistant) (all contributors are affiliated to technical university darmstadt). language german (texts) and english (annotation categories) license gpl-3.0 license. repository name event_dataset publication date 2022-04-01 (4) reuse potential the dataset is reusable for several natural language processing (nlp) tasks focused on the detection of events. based on the manual annotations in the dataset we accomplished the automation of narratological event type recognition (vauth et al., 2021). in general, the event annotations can be used as features for the detection of phenomena related to narrative text structures. furthermore, based on the event annotations we developed and evaluated an approach to model the narrativeness/eventfulness and to identify the most ‘tellable’ parts in a narrative (gius & vauth, accepted). in a next step, the modelling of narrativity will be used in text comparisons. funding statement the event project is funded by the german research foundation (dfg) within the priority programme spp 2207 computational literary studies (cls). competing interests the authors have no competing interests to declare. author contributions evelyn gius: conceptualization, project administration, funding acquisition, supervision, writing – review and editing; michael vauth: conceptualization, data curation, project administration, writing – original draft. author affiliations michael vauth orcid.org/0000-0002-3668-6273 institut für sprachund literaturwissenschaft, technical university darmstadt, germany evelyn gius orcid.org/0000-0001-8888-8419 institut für sprachund literaturwissenschaft, technical university darmstadt, germany https://orcid.org/0000-0002-3668-6273 https://orcid.org/0000-0002-3668-6273 https://orcid.org/0000-0001-8888-8419 https://orcid.org/0000-0001-8888-8419 6vauth and gius journal of open humanities data doi: 10.5334/johd.83 to cite this article: vauth, m., & gius, e. (2022). event annotations of prose. journal of open humanities data, 8: 19, pp. 1–6. doi: https://doi.org/10.5334/ johd.83 published: 12 august 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. references bird, s., klein, e., & loper, e. (2009). natural language processing with python: analyzing text with the natural language toolkit. o’reilly media, inc. gius, e., guhr, s., & adelmann, b. (2021, june). d-prose 1870–1920. zenodo. doi: https://doi.org/10.5281/ zenodo.5015008 gius, e., meister, j. c., meister, m., petris, m., bruck, c., jacke, j., schumacher, m., gerstorfer, d., flüh, m., & horstmann, j. (2022, january). catma. zenodo. doi: https://doi.org/10.5281/zenodo.6046763 gius, e., & vauth, m. (2022). inter annotator agreement und intersubjektivität – ein vorschlag zur messbarkeit der qualität literaturwissenschaftlicher annotationen. dhd 2022 kulturen des digitalen gedächtnisses. 8. tagung des verbands “digital humanities im deutschsprachigen raum” (dhd 2022), potsdam. doi: https://doi.org/10.5281/zenodo.6328209 gius, e., & vauth, m. (accepted). towards an event based plot model. a computational narratology approach. journal of computational literary studies. textgrid. (2021). die digitale bibliothek bei textgrid. retrieved 2021-11-10, from https://textgrid.de/de/ digitale-bibliothek vauth, m., & gius, e. (2021, july). richtlinien für die annotation narratologischer ereigniskonzepte. zenodo. doi: https://doi.org/10.5281/zenodo.5078175 vauth, m., & gius, e. (2022, april). fortext/event dataset: v.1.0. zenodo. doi: https://doi.org/10.5281/ zenodo.6406569 vauth, m., hatzel, h. o., gius, e., & biemann, c. (2021). automated event annotation in literary texts. computational humanities research, 333–345. retrieved from http://ceur-ws.org/vol-2989/ shortpaper18.pdf vauth, m., meister, m., hatzel, h. o., gerstorfer, d., & gius, e. (2022, march). gitma. zenodo. doi: https:// doi.org/10.5281/zenodo.6330464 https://doi.org/10.5334/johd.83 https://doi.org/10.5334/johd.83 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.5281/zenodo.5015008 https://doi.org/10.5281/zenodo.5015008 https://doi.org/10.5281/zenodo.6046763 https://doi.org/10.5281/zenodo.6328209 https://textgrid.de/de/digitale-bibliothek https://textgrid.de/de/digitale-bibliothek https://doi.org/10.5281/zenodo.5078175 https://doi.org/10.5281/zenodo.6406569 https://doi.org/10.5281/zenodo.6406569 http://ceur-ws.org/vol-2989/shortpaper18.pdf http://ceur-ws.org/vol-2989/shortpaper18.pdf https://doi.org/10.5281/zenodo.6330464 https://doi.org/10.5281/zenodo.6330464 oregon theater project: a dataset of oregon cinemas from the silent era data paper corresponding author: dr. michael aronson cinema studies, university of oregon, eugene, or, usa aronson@uoregon.edu keywords: film exhibition; movie theaters; film history; new cinema history; oregon film history to cite this article: aronson, m., peterson, e., & hayden, g. (2022). oregon theater project: a dataset of oregon cinemas from the silent era. journal of open humanities data, 8: 27, pp. 1–7. doi: https://doi. org/10.5334/johd.92 oregon theater project: a dataset of oregon cinemas from the silent era michael aronson elizabeth peterson gabriele hayden abstract the oregon theater project (otp) dataset is part of an ongoing collaborative research project by undergraduate students enrolled in successive iterations of “exhibition & audiences,” a cinema studies course at the university of oregon. it will be updated with additional data each time the course is taught. the data set comprises geo/historical data about movie theaters (cinemas) and exhibition in the state from approximately 1894 to 1929. the data is presented on a public website (https://oregontheaterproject. uoregon.edu/) which includes maps and individual theater profiles produced by the students. all profiles, and the underlying data, are reviewed by the course instructors and edited as needed for clarification or accuracy. profiles include, where available, the theater name, address, city, state, latitude, longitude, number of seats, owner/ manager names, and a narrative description. the underlying data, shared as excel documents and tab-delimited spreadsheets, invites historical comparative analysis of film exhibition practices across time and locale, both local and global. *author affiliations can be found in the back matter of this article mailto:aronson@uoregon.edu https://doi.org/10.5334/johd.92 https://doi.org/10.5334/johd.92 https://oregontheaterproject.uoregon.edu/ https://oregontheaterproject.uoregon.edu/ https://orcid.org/0000-0003-1790-7816 https://orcid.org/0000-0003-1258-4122 https://orcid.org/0000-0003-4740-4187 2aronson et al. journal of open humanities data doi: 10.5334/johd.92 (1) overview repository location harvard dataverse: https://doi.org/10.7910/dvn/fgouz3 front end interface: https://oregontheaterproject.uoregon.edu/ context the oregon theater project (otp) is one of an increasing number of digital projects documenting and sharing the history of movie theaters (cinemas), film programming, and film reception. most of these projects do not make their data publicly available in a usable format, even though the value of these data projects is greatly increased if they allow data to be aggregated (aronson et al., 2022a). this data paper contributes to building open data in regional cinema history; it describes the preliminary version of a data set that will be updated regularly. the oregon theater project (otp) is a collaboration between faculty in cinema studies and the university of oregon libraries, with a goal of integrating information literacy skills and concepts, as well as digital humanities tools, into the historical research course “exhibition & audiences”. students, guided by faculty mentors, come away from this course with a broad knowledge of film exhibition theory and history, along with a firm grasp of research methods. students learn how to identify appropriate sources for their information need; to select appropriate research tools from a variety of options; to search efficiently within online databases and digital collections, as well as traditional print-based media; to evaluate sources for credibility and authority; to analyse and interpret primary sources; to use information ethically; to cite their sources appropriately; and to publish their finished work online using a selection of digital humanities presentation tools. each time the course is taught, students build on and improve the research conducted by students in previous years. a new, improved data set based on this work will be published following each course iteration. (2) method in the otp, undergraduate students learn cinema studies research methods within a context of film exhibition history and audiences course content. students conduct original research in primary sources to compile data and to compose short narratives about oregon movie theaters during the period of study (1894–1929). primary sources include newspapers, industry trade journals, city and county directories, business directories, maps, and photographs. students in the course use a shared google drive with a hierarchical folder and file system to manage their research materials. steps students enter data directly into a structured website platform built on a drupal content management system. figure 1 shows a screenshot of part of the page students use to enter information about a new theatre. data is updated directly in the platform every time a class is taught. the drupal database includes images taken from newspapers that are the source of most of the information contained in the database. these images are taken informally as screen shots and published on our website under “fair use” terms. because we do not have copyright documentation or permissions for each image, we are not including the images as part of this data set. however, we include several data columns that reference these files to create more contextual information. first, we include a column, ‘works_cited’, that offers unstructured text citations to sources. second, we include both plain text and full html versions of text from the website (column names are ‘body’, ‘body_ html’; ‘additional_facts’, ‘additional_facts_html’; ‘works_cited’; ‘works_cited_html’). the html versions include relative links to images as they are embedded in the text. finally, we include a variable that lists image file names for images highlighted in a special section on the page (‘gallery_images’). in theory, this should allow users to create links back to the images for the lifetime of the website. https://doi.org/10.7910/dvn/fgouz3 https://oregontheaterproject.uoregon.edu/ 3aronson et al. journal of open humanities data doi: 10.5334/johd.92 quality control the course instructors serve as editors for the course data and content. they review every entry for accuracy, citations, and correct formatting. students follow a file-naming convention that embeds source citation information within file names to ensure proper attribution during data entry and writing. this method also allows the course instructors to easily consult the research materials to verify facts as presented in the theater data and narratives. after the class is finished, the course instructors remediate any data entry errors that affect data completeness (such as missing geospatial coordinates) in the drupal database. however, because when we began this project proofreading was focused on the human-readable website and not on creating machine-readable data, we have not systematically corrected differences in formatting in string variables such as addresses. missing data may be blank or listed as ‘unknown’ or ‘unknown’ and there may be extra spaces, periods, or other irregularities. we hope in future versions to remediate these issues. data is exported as a csv file from several sql views in the drupal database, cleaned using an r script, and saved as new spreadsheets. as documented in the readme file and the r script included with the data set, we trim white space from some columns, split out some variables, and join several spreadsheets to create final versions we think may be most useful to future users. blanks have been left as they are rather than converted to nas. to make this data widely accessible, we share results in tab-delimited form and as excel files; we also share the original files downloaded from drupal and the r script used to process them. in future versions of this data set, we hope to also include links to theater urls in the front-end database and shapefiles corresponding to theater locations. data structure while the data readme will include complete, up-to-date documentation of data variables as the data set grows and evolves, here we highlight import elements of the processed data that we expect will remain stable over time. the tabular data contained in the files ‘theaters_[date]. tab’ and ‘theaters_excel_[date].xlsx’ includes the following important variables: id (integer) – unique id assigned to each theater “entry” in drupal. a theater with the same name will sometimes be listed more than once (and thus will have more than one theater id). sometimes this means that the theater has moved, and sometimes it means that two unrelated theaters with the same name appear in two locations. theater_name (character) – theater refers to a physical building, sometimes called a “cinema” or “cineplex.” we are defining a theater as anywhere where a film was displayed to a public audience. theater names are not unique. address (character) – full address (if known) or intersection. we hope in future to standardize entries in this column. figure 1 a partial screenshot of the drupal form for entering information about theaters in the oregon theater project website. 4aronson et al. journal of open humanities data doi: 10.5334/johd.92 city, state, city_state (character) – city in oregon, state (or), or “city, or”. latitude, longitude (double/float) – in degrees. start_date_of_operation, end_date_of_operation (date) – in “yyyy-mm-dd” format. theaters for which no closing date was entered were coded by the drupal database as “ongoing” or “still open.” this may mean they are in fact still open, or it may mean that the closing date is unknown. in either case, the data export records their closing date as the date the data was last downloaded. these theaters will have the most recent “end_date” entries and are recognizable as many will “end” on the same recent day. start_year, end_year (integer) – in “yyyy” format. number_of_seats (character) – venue capacity. this is sometimes an integer, but sometimes it includes more extensive notes or estimates. owner_and_manager_names (character) – if individual names were created as separate entries in the drupal database, then each name is separated by a semicolon in this column. however, some entries were created as just one entry separated by commas or have complex annotations. we hope in future to standardize this field to allow exploration of who owned more than one theater. body, additional_facts, body_html, additional_facts_html (character) – descriptions of the movie theater written by a student or group of students. “html” versions include all html formatting that creates the page, including links to embedded images. important note: in the ‘theaters_excel_[date].xlsx’ version of the data set, ‘body_html’ is replaced by ‘body_html_length’, which is an integer value listing the number of characters in the ‘body_html’ column. because some columns exceed the maximum cell length in excel, ‘body_html’ is omitted from the excel files. gallery_images (character) – list of 0 to many relative links to images used in the “gallery” section of a blog post, separated by semicolons. the ‘owners_[date].tab’ and ‘owners_excel_[date].xlsx’ files repeat information found in the theaters spreadsheets but create a new row for each owner/manager of a particular theater that was broken out (separated by a semicolon) in the original data. “owner_and_manager_ name’s” (character) is the only column containing unique values in this spreadsheet. the ‘articles_[date].tab’ and ‘articles_excel_[date].xlsx’ spreadsheets include a list of articles (blog posts) that are not entries for a specific theater. the articles data have a unique integer id assigned by drupal, ‘gallery_images’, ‘body’, and either ‘body_html’ or ‘body_html_length’ columns with the same specifications as the theaters data sets. columns unique to this data set include ‘authored_by’ (character), which is the name of the drupal user who uploaded the article (sometimes but not always the article author), and ‘categories’ (character), a list of 0 to many topic tags assigned in drupal and separated by semicolons. data users could link articles to theaters spreadsheets via the ‘related_cities_and_theaters’ column in the articles data, which sometimes indicates that the article is describing a theater set in a particular city. any such join would be incomplete, since the column takes between 0 and many cities or theaters, separated by a semicolon. the column would need to be divided into multiple columns and parsed to identify cities vs theaters. in future we plan to parse this column for users. cities are listed in the format “city, or” and could be joined via the ‘city_ state’ column in the theaters spreadsheet. theaters should be listed using the same name used in the ‘theater_name’ column in the ‘theaters’ spreadsheet, but there may be errors. since the combination of ‘theater_name’ and ‘city_state’ is likely to be unique, articles could be imperfectly joined to theaters using both columns as keys. (3) dataset description object name – oregon theater project database. see ‘or_theater_project_readme_2022-08. txt’ for complete list of filenames. format names and versions – tab, txt, xlsx, r, pdf creation dates – 2020-01-01 to 2022-08-26 5aronson et al. journal of open humanities data doi: 10.5334/johd.92 dataset creators michael aronson and elizabeth peterson (university of oregon) were responsible for conceptualization, funding acquisition, project administration, supervision, dataset creation and editing. john zhao and gabriele hayden (university of oregon) designed the data export views, and gabriele hayden cleaned and curated the dataset. the following university of oregon students contributed research and writing to create this dataset: lauren adzima, khalil afariogun, andrew arachikavitz, malia balzer, jacob beeson, sylas bosman, kyra brennan, ezra brothers, christian cancilla, katy cannon, eliza castillosalazar, jourdan cerillo, tom chamberlain, shelby chapman, cody churchill, jude corwin, heath cotter, julian d’ambra, megan deck, patrick dunham, chloe duryea, leah durkee, morgan egbert, maggie elias, jack elliot, joseph endler, emily fine, kyle fleming, alex fox, javier fregoso, sammie garcia, hayden garrett, ireland gill, austin griggs, tayte hansen, isabella harrington, kara hilton, ashli horrell, amanda james, zach jones, ethan laarmanhughes, addie lacewell, abby lewis, jimmy lieu, kaden lipkin, joie littleton, wanfang long, peter lovejoy, shelby marthaller, cassie mccready, carly mcdaniel, brittany mcdowell, brendan mcmahon, eric mcmichael, maddie miner, maryam moghaddami, jack moran, parker morgan, nicholas mundorff, alexis neal, michael o’ryan, kelsey parker, dre parkinson, reese patanjo, katherine pelch, ben pettis, sienna pigg, shelby platt, ellie reis, bailey rierden, manuel rios, jayna rogers, anthoni rosas, emily ruthruff, payton schiffer, becca schomer, huntley sims, bella smith, megan snyder, britnee spelce-will, malley stanovsek, connor templeman, weston tengan, jess thompson, sarah tidwell, evan vacek, dylan wakelin, jalon watts, joe weber, makaal williams, veronica wilson, charlie winn, david young, and sam zepeda. language – english license – cc-by repository name – harvard dataverse publication date – 2022-10-31 (4) reuse potential this data is likely to be of interest to scholars in the humanities and social sciences. it could be used to create new visualizations or digital exhibitions; re-creating a map of these venues, for example, could be a project for an advanced digital humanities course. it could be aggregated with other regional, national, or international cinema history projects, such as that shared on the mapping movies site, or could be modified to fit the data model used by cinema context or the european cinema audiences project1 to allow for the comparative study of cinema venues (klenotic, 2022; create, 2022). however, this would require standardizing many of the freeform columns in our data. the information contained in this data set would map onto the venue, address, person, company, publication, and archive tables in the original cinema context sql database (van oort & noordegraaf, 2020). this data could also be used in social science research, for example to track the relationship between the opening and closing of theaters and larger socioeconomic trends across oregon. one of our anonymous reviewers offered several specific, inspiring suggestions for how our data set, aggregated with others, could be useful in tracking historical questions. for example, the data on theater owners and managers could be cleaned and aggregated with other data sets to map female business ownership during the years leading up to the passage of the 19th amendment granting women’s suffrage in the us in 1920. theater openings and closings might offer insights—particularly when aggregated with other historical business data in oregon or data on other theaters across the us—into how businesses adapted to economic shocks such as world war i, the 1918 flu pandemic, or the white supremacist terrorism of the red summer of 1919. scholars seeking to pursue the kinds of data aggregation that would allow for such work must do a great deal of sophisticated data processing to normalize data across differences of data definition and structure. we have done our best to document how our data is defined and 1 https://www.europeancinemaaudiences.org/research/, last accessed date: 8 november 2022. https://www.europeancinemaaudiences.org/research/ 6aronson et al. journal of open humanities data doi: 10.5334/johd.92 structured to allow for others to build on our work. however, as we discuss in aronson et al. (2022a), the first challenge scholars face is simply gaining access to the data itself. the data set from that paper includes links to the minority of projects surveyed that do share data as of 2022 and may form a starting point for scholars seeking to do comparative work (aronson et al., 2022b). we are inspired to share our own small, imperfect data set to model for colleagues what we hope they will do as well: share data early and often, updating as the extent and quality of the data improves over time. acknowledgements the otp platform was created in collaboration with shirley galloway, loring hummel, daniel mundra, caden williams and john zhao, programmers and web designers in the college of arts and sciences at the university of oregon. thank you to our reviewers, whose suggestions have greatly improved the quality of this data paper and given us several ideas for how to improve our data going forward. funding information funding for the oregon theater project was, in part, provided by a 2019 instructional grant (approximately $15,000) from the tom and carol williams fund for undergraduate education at the university of oregon. competing interests the authors have no competing interests to declare. author contributions michael aronson: conceptualization, funding acquisition, project administration, supervision, writing elizabeth peterson: conceptualization, funding acquisition, project administration, supervision, writing gabriele hayden: data curation, writing author affiliations dr. michael aronson orcid.org/0000-0003-1790-7816 cinema studies, university of oregon, eugene, or, usa elizabeth peterson orcid.org/0000-0003-1258-4122 digital scholarship services, university of oregon libraries, eugene, or, usa dr. gabriele hayden orcid.org/0000-0003-4740-4187 data services, university of oregon libraries, eugene, or, usa references aronson, a., peterson, e., & hayden, g. (2022a). local cinema history at scale: data and methods for comparative exhibition studies. (forthcoming). iluminace: journal for film theory, history, and aesthetics, 34(2). preprint. doi: https://doi.org/10.7264/t0ky-0q37 aronson, a., peterson, e., & hayden, g. (2022b). “replication data for: local cinema history at scale: data and methods for comparative exhibition studies”. harvard dataverse, v1. unf:6:/qdv535cscvkd2odc/ dakq== [fileunf]. doi: https://doi.org/10.7910/dvn/6woqpo create. (2022). cinema context rdf documentation. retrieved from https://uvacreate.gitlab.io/cinemacontext/cinema-context-rdf/ (last accessed date: 8 november 2022). klenotic, j. (2022). mapping movies. retrieved from http://mappingmovies.unh.edu/ (last accessed date: 8 november 2022). van oort, t., & noordegraaf, j. (2020). the cinema context database on film exhibition and distribution in the netherlands: a critical guide: arts and media. research data journal for the humanities and social sciences, 5(2), 91–108. doi: https://doi.org/10.1163/24523666-00502008 https://orcid.org/0000-0003-1790-7816 https://orcid.org/0000-0003-1790-7816 https://orcid.org/0000-0003-1258-4122 https://orcid.org/0000-0003-1258-4122 https://orcid.org/0000-0003-4740-4187 https://orcid.org/0000-0003-4740-4187 https://doi.org/10.7264/t0ky-0q37 https://doi.org/10.7910/dvn/6woqpo https://uvacreate.gitlab.io/cinema-context/cinema-context-rdf/ https://uvacreate.gitlab.io/cinema-context/cinema-context-rdf/ http://mappingmovies.unh.edu/ https://doi.org/10.1163/24523666-00502008 7aronson et al. journal of open humanities data doi: 10.5334/johd.92 to cite this article: aronson, m., peterson, e., & hayden, g. (2022). oregon theater project: a dataset of oregon cinemas from the silent era. journal of open humanities data, 8: 27, pp. 1–7. doi: https://doi. org/10.5334/johd.92 published: 12 december 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. https://doi.org/10.5334/johd.92 https://doi.org/10.5334/johd.92 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ forming the corpus criticorum (1450–1640): bibliography, title pages, dataset data paper corresponding author: goran gaber école des hautes études en sciences sociales (lier-fyt), paris, france; maison française d’oxford, oxford, uk goran.gaber@ehess.fr keywords: critique; title pages; union catalogues; dataset; book history; history of concepts to cite this article: gaber, g. (2022). forming the corpus criticorum (1450–1640): bibliography, title pages, dataset. journal of open humanities data, 8: 28, pp. 1–7. doi: https://doi. org/10.5334/johd.93 forming the corpus criticorum (1450–1640): bibliography, title pages, dataset goran gaber abstract the corpus criticorum (1450–1650) (cc) is a pioneering comprehensive bibliography of early modern publications that feature the notion of critique on their title pages. it was constituted by collecting, validating, and curating information from pan-european, language-based, and national union catalogues. a complementary and interconnected “data package” was deposited on zenodo, comprising: (1) a classical text-based bibliography, supplemented by (2) a csv dataset of information contained therein, (3) the images of title pages not readily available online, and (4) a comprehensive bibtex dataset. the cc can be reused in further internal research on the history of critique and as a model for research on the history of other concepts and ideas. malito:goran.gaber@ehess.fr https://doi.org/10.5334/johd.93 https://doi.org/10.5334/johd.93 https://orcid.org/0000-0002-0139-8220 2gaber journal of open humanities data doi: 10.5334/johd.93 (1) overview repository location bibliography: https://doi.org/10.5281/zenodo.7233745 dataset: https://doi.org/10.5281/zenodo.7231721 images: https://doi.org/10.5281/zenodo.6984843 bibtex: https://doi.org/10.5281/zenodo.7231197 context the corpus criticorum (1450–1650) is a comprehensive bibliography of early modern publications that feature the notion of critique on their title pages. it aims to answer questions about the chronology, the geography, and the discursive domains of critique’s conceptual institutionalisation. an exploratory version of the cc was formed in 2018 during a phd thesis at the ehess in paris (gaber 2019). the methodology was updated, and the study was repeated and finalised during two postdoctoral fellowships in 2021, first at the leibniz-zentrum für literaturund kultur forschung in berlin and then at the oxford centre for european history. (2) method the corpus criticorum (1450–1650) (cc) builds upon and improves existing research methods on the history of critique. while previous surveys relied either on terminological reference works or the writings of authoritative intellectual figures, the cc was able to exploit recent advances in the digitisation of library catalogues worldwide. by manually querying available union catalogues, an unprecedently exhaustive corpus of texts was formed, one that includes 218 publications relevant to the history of critique and thus significantly expands the spectrum of its historical canon. the early modern period was chosen for the following theoretical and methodological reasons. firstly, as an ideal testing ground for georges canguilhem’s insight on the process of conceptual institutionalisation, namely, that “when a word appears in the title of a book or a paper, it has been recognised as more than a mere metaphor by the competent scientific community” (canguilhem 1988: 100). the comprehensiveness of early modern title pages, as well as their symbolic importance and commercial function in that period (bertram, büttner & zittel 2021; maclean 2012), suggest that this era is, in fact, particularly appropriate for tracing the process of critique’s conceptual institutionalisation according to canguilhem’s criterium. secondly, this period is also the site of a favourable overlap of “bibliographic offer” and “historiographic demand”. on the one hand, critique’s history before 1650 is particularly crepuscular, with only a handful of great intellectual figures lighting the way. on the other hand, 1650 also marks the current end-point of the most comprehensive european union catalogue to date, the universal short title catalogue (ustc, https://www.ustc.ac.uk/). a final methodological decision must be mentioned before describing the research procedure itself, namely, that the cc comprises only the first editions of particular texts and that republications or translations are not included therein. step 1: identifying the sources for reasons just stated, the ustc was identified as the primary source for the constitution of the cc and complemented by the british library’s incunabula short title catalogue (istc, https://data.cerl.org/istc/_search) for publications prior to 1500. since the ustc is an ongoing endeavour, other union, language-based, and national library catalogues from the following countries and areas were manually queried, both as means of verifying the information in the ustc and compensating for potential omissions therein: belgium (unicat, https://www. unicat.be/), croatia (hrvatski nacionalni skupni katalog, http://skupnikatalog.nsk.hr/), czech republic (skc, https://aleph.nkp.cz/f/?func=file&file_name=find-b&local_base=skc&con_ https://doi.org/10.5281/zenodo.7233745 https://doi.org/10.5281/zenodo.7231721 https://doi.org/10.5281/zenodo.6984843 https://doi.org/10.5281/zenodo.7231197 https://www.ustc.ac.uk/ https://data.cerl.org/istc/_search https://www.unicat.be/ https://www.unicat.be/ http://skupnikatalog.nsk.hr/ https://aleph.nkp.cz/f/?func=file&file_name=find-b&local_base=skc&con_lng=eng 3gaber journal of open humanities data doi: 10.5334/johd.93 lng=eng), denmark (bibliotekdk, https://bibliotek.dk/eng), england and the british isles (estc, http://estc.bl.uk/), estonia (ester, https://www.ester.ee/), finland (melinda, https://melinda. kansalliskirjasto.fi/), france (ccfr, https://ccfr.bnf.fr/), germany’s vd 16 (https://www.gatewaybayern.de/) & vd 17 (http://www.vd17.de/en), hungary (mokka, http://mokka.hu/en/), italy (opac sbn, https://opac.sbn.it/), latvia (kopkatalogs, https://kopkatalogs.lv/), lithuania (libis, https://www.lnb.lt/en/?view=article&id=1387:lithuanian-integral-information-system-oflibraries-libis&catid=26), netherlands (ncc, https://picarta.oclc.org/), poland (nukat, https:// centrum.nukat.edu.pl/en/), portugal (porbase, https://porbase.bnportugal.gov.pt/), russia (nlr online catalogues, https://nlr.ru/eng/ra2117/online-catalogues), slovakia (slovak library union catalogue, https://www.snk.sk/en/home/28-catalogues/416-slovak-library-unioncatalogue.html), spain (rebiun for university libraries, https://rebiun.baratz.es/rebiun/; rebeca opac for public libraries, http://catalogos.mecd.es/rebeca/rebecaopac/; bne, https://www.bne. es/en/catalogues), and sweden (libris, http://libris.kb.se/). step 2: search string the following search strings were used to manually query the catalogues: *criti* (covering latin, romance languages, and english) and *kriti* (covering both german and slavic languages, as well as latinised transcriptions of titles composed in ancient greek). as european languages introduced the notion of critique in their vocabularies from ancient greek or latin, the working hypothesis was that these search strings could capture the quasitotality of relevant items. step 3: data validation after harvesting the catalogues, the following procedure was applied to validate the acquired information. first, impertinent items were discarded from the cc. these included conceptually irrelevant results (e.g., “vita theocriti”, “descritione”, etc.), so-called “bibliographical ghosts”, blatant authorial misattributions, and evidently erroneous dating (figure 1). second, bibliographic metadata was manually verified by consulting a digitised image of the relevant title page readily available online or by acquiring such an image from an institutional or private owner of the book in question (figure 2). third, for all the items included in the cc, information concerning an available copy was searched for when the publication could not be readily accessed online (figure 3). step 4: data curation different data curation procedures were applied for each item of the cc “data package”, that is, (1) for the classical, text-based, bibliography, (2) the csv dataset, (3) the images of title pages, and (4) the bibtex dataset. figure 1 the ustc entry for andré dacier’s (1651–1722) dissertation critique sur l’art poétique d’horace with the publication date in 1618. https://aleph.nkp.cz/f/?func=file&file_name=find-b&local_base=skc&con_lng=eng https://bibliotek.dk/eng http://estc.bl.uk/ https://www.ester.ee/ https://melinda.kansalliskirjasto.fi/ https://melinda.kansalliskirjasto.fi/ https://ccfr.bnf.fr/ https://www.gateway-bayern.de/ https://www.gateway-bayern.de/ http://www.vd17.de/en http://mokka.hu/en/ https://opac.sbn.it/ https://kopkatalogs.lv/ https://www.lnb.lt/en/?view=article&id=1387:lithuanian-integral-information-system-of-libraries-libis&catid=26 https://www.lnb.lt/en/?view=article&id=1387:lithuanian-integral-information-system-of-libraries-libis&catid=26 https://picarta.oclc.org/ https://centrum.nukat.edu.pl/en/ https://centrum.nukat.edu.pl/en/ https://porbase.bnportugal.gov.pt/ https://nlr.ru/eng/ra2117/online-catalogues https://www.snk.sk/en/home/28-catalogues/416-slovak-library-union-catalogue.html https://www.snk.sk/en/home/28-catalogues/416-slovak-library-union-catalogue.html https://rebiun.baratz.es/rebiun/ http://catalogos.mecd.es/rebeca/rebecaopac/ https://www.bne.es/en/catalogues https://www.bne.es/en/catalogues http://libris.kb.se/ 4gaber journal of open humanities data doi: 10.5334/johd.93 figure 2 the erroneously dated title page of the dissertation critique sur l’art poétique d’horace (1618). figure 3 the correctly dated title page of the dissertation critique sur l’art poétique d’horace (1698). 5gaber journal of open humanities data doi: 10.5334/johd.93 the classical bibliography was standardised according to the guidelines of the association of college and research libraries’ rare books and manuscript section (2011). this process included correctly identifying the various areas of the title page (statement of responsibility, edition, title proper) and following its guidelines on transcribing ancient greek, early modern gothic, and renaissance latin script. furthermore, the bibliography provides standardised names of authors, editors, and translators of the publications and indicates the modernised form of places of publication. finally, information that could not be verified is signalled with an *. the comprehensive csv dataset was constructed in such a way that would most facilitate crossreferencing with the classical, text-based, bibliography. thus, the title page transcription was kept “as-is”, meaning that its text was not romanised, and interpolations remained in square brackets. however, the names of authors, editors, translators, and publishers or publishing entities of the texts are presented separately in tabular form, as are the year and place of publication. moreover, an internal identification number (e.g., cc_book_001) was attributed to every item in the cc, and url links were provided for its source catalogue and the online site of the title page (or the physical location of the consulted book). the supplement presenting jpeg images of title pages not readily available online includes the cc identification number, the location of the physical copy, its shelf mark, and other standard bibliographic metadata. it also provides relevant information concerning the copyright of presented images. the bibtex dataset have been elaborated concomitantly with the classical, text-bibliography, and shares its data curation procedures. step 5: data publication the classical bibliography was formed with zotero and by choosing the 12th edition of harvard’s “cite them right” format. the comprehensive dataset was constructed in microsoft excel and exported in csv file format. the supplement presenting the images of title pages not readily available online was formed in microsoft powerpoint and published in pdf form. the bibtex dataset was likewise constructed in and exported from zotero. quality control the author manually examined the quality of the data gathered from the abovementioned catalogues. (3) dataset description object name gaber, g. (2022a). corpus criticorum (1450–1650)—a comprehensive bibliography of early modern publications featuring the notion of critique on their title pages. zenodo. https://doi. org/10.5281/zenodo.7233745 gaber, g. (2022b). corpus criticorum (1450–1650)—supplement 1—a comprehensive dataset of early modern publications featuring the notion of critique on their title pages. zenodo. https://doi.org/10.5281/zenodo.7231721 gaber, g. (2022c). corpus criticorum (1450–1650)—supplement 2—reproduction of title pages not accessible in the public domain. zenodo. https://doi.org/10.5281/zenodo.6984843 gaber, g. (2022d). corpus criticorum (1450–1650)—supplement 3—a bibtex dataset of early modern publications featuring the notion of critique on their title pages. zenodo. https://doi. org/10.5281/zenodo.7231197 format names and versions pdf, csv, bib creation dates – 2018-10-01 to 2021-12-06 dataset creators – goran gaber (ehess, lier-fyt & mfo) was responsible for gathering, curating and publishing the data. language – ancient greek, english, french, italian, latin, portuguese, spanish https://doi.org/10.5281/zenodo.7233745 https://doi.org/10.5281/zenodo.7233745 https://doi.org/10.5281/zenodo.7231721 https://doi.org/10.5281/zenodo.6984843 https://doi.org/10.5281/zenodo.7231197 https://doi.org/10.5281/zenodo.7231197 6gaber journal of open humanities data doi: 10.5334/johd.93 license – cc attribution 4.0 international repository name zenodo publication date – 2022/08/12 and 2022/10/20 (4) reuse potential the cc can be reused in further internal research on the history of critique and as a model for research in the history of other ideas or concepts. with regards to further internal research on the history of critique, two articles exploiting the cc’s potential are already in preparation. the first will delineate the chronology and the geography of critique’s conceptual institutionalisation in the early modern period by exploiting information contained in the “year of publication” and the “place of publication” columns in supplement 1. the second will build upon and enrich the existing dataset by identifying the religious affiliations of the authors, editors, and translators of the publications covered by the cc. the publication of a third, “people-focused”, supplement is planned for 2023. an additional pathway for reusing the cc is situated in the field of social network analysis. doing so, however, would imply further historical research on the identities of publishers and publishing entities of the books in the cc. likewise, the existing dataset would need to be enriched with unique (viaf) identifiers for authors editors, translators and publishers. finally, cc’s method and workflow can also be used as a model for research on the history of other ideas and concepts. geographical and chronological limitations of the current research framework can easily be surmounted by including other relevant union catalogues. acknowledgements i would like to thank the following libraries and institutions for providing the images of title pages not readily available online: tallin university library, tallin, estonia bibliothèque mazarine, paris, france biblothèque nationale de france, paris, france bibliothèque nationale et universitaire, strasbourg, france niedersächsische staatsund universitätsbibliothek göttingen, germany ratsschulbibliothek zwickau, zwickau, germay stadtbibliothek braunschweig, braunschweig, germany thüringer universitätsund landesbibliothek jena, germany universitätsbibliothek erlangen-nürnberg, nürnberg, germany biblioteca dell’archiginnasio, bologna, italy biblioteca del seminario vescovile di padova, italy biblioteca nazionale braidense, milano, italy biblioteca sormani, milano, italy biblioteka uniwersytecka, uniwersytet wrocławski, wroclaw, poland sir duncan rice library, university of aberdeen, aberdeen, scotland, uk the bodleian libraries, oxford, uk the british library, london, uk hanna holborn gray special collections research center, university of chicago library, chicago, usa biblioteca nacional de españa, madrid, spain 7gaber journal of open humanities data doi: 10.5334/johd.93 to cite this article: gaber, g. (2022). forming the corpus criticorum (1450–1640): bibliography, title pages, dataset. journal of open humanities data, 8: 28, pp. 1–7. doi: https://doi. org/10.5334/johd.93 published: 14 december 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. funding information école des hautes études en sciences sociales, phd fellowship. daad postdoctoral fellowship. competing interests the author has no competing interests to declare. author affiliation goran gaber orcid.org/0000-0002-0139-8220 école des hautes études en sciences sociales (lier-fyt), paris, france; maison française d’oxford, oxford, uk references association of college and research libraries, rare books and manuscripts section, & bibliographic standards committee. (2011). descriptive cataloging of rare materials (books). washington, dc: library of congress. bertram, g., büttner, n., & zittel, c. (eds.) (2021). gateways to the book: frontispieces and title pages in early modern europe. leiden: brill. doi: https://doi.org/10.1163/9789004464520 canguilhem, g. (1988). ideology and rationality in the history of the life sciences. cambridge, ma: mit press. gaber, g. (2019). histoire et logique de la raison critique. de la philologie classique à la réflexion transcendantale (xvie–xviiie siècle). paris: école des hautes études en sciences sociales. http://www. theses.fr/2019ehes0009 maclean, i. (2012). scholarship, commerce, religion. the learned book in the age of confessions. cambridge, ma: harvard university press. doi: https://doi.org/10.4159/harvard.9780674065321 https://doi.org/10.5334/johd.93 https://doi.org/10.5334/johd.93 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://orcid.org/0000-0002-0139-8220 https://orcid.org/0000-0002-0139-8220 https://doi.org/10.1163/9789004464520 http://www.theses.fr/2019ehes0009 http://www.theses.fr/2019ehes0009 https://doi.org/10.4159/harvard.9780674065321 the conlit dataset of contemporary literature data paper corresponding author: andrew piper department of languages, literatures, and cultures, mcgill university, montréal, ca andrew.piper@mcgill.ca keywords: literature; fiction; english (language); readership to cite this article: piper, a. (2022). the conlit dataset of contemporary literature. journal of open humanities data, 8: 24, pp. 1–7. doi: https://doi. org/10.5334/johd.88 the conlit dataset of contemporary literature andrew piper abstract this dataset includes derived data on a collection of ca. 2,700 books in english published between 2001–2021 and spanning 12 different genres. the data was manually collected to capture popular writing aimed at a range of different readerships across fiction (1,934) and non-fiction (820). genres include forms of cultural capital (bestsellers, prizewinners, elite book reviews), stylistic affinity (mysteries, science fiction, biography, etc.), and age-level (middle-grade and young adult). the dataset allows researchers to explore the effects of audience, genre, and instrumentality (i.e., fictionality) on the stylistic behavior of authors within the recent past across different classes of professionally published writing. mailto:andrew.piper@mcgill.ca https://doi.org/10.5334/johd.88 https://doi.org/10.5334/johd.88 https://orcid.org/0000-0001-9663-5999 2piper journal of open humanities data doi: 10.5334/johd.88 (1) overview repository location https://doi.org/10.6084/m9.figshare.21166171.v1 context access to well-defined collections of contemporary writing is extremely limited today due to intellectual property restrictions, corporate control of data, and the absence of clear consensus surrounding literary categorization. our dataset is designed to provide researchers with freely accessible derived data of a robust collection of professionally published writing in english produced since 2001, which spans 12 different genre categories. while the term “genre” has been understood in multiple ways within the research community over the years (cohen, 1986; underwood, 2016a), we define genre for our purposes as a form of institutionally framed classification (castellano, 2018). according to this definition, genre is what a given institution labels a book using a distinct category of writing. as we show with the overview of our data (table 1), our institutional frameworks can include bestseller lists, prize committee shortlists, book review lists, user-generated “choice awards”, or corporate forms of categorization. taken together, they allow research on three different types of institutional framing: cultural capital, stylistic affinity, and reading level. rather than rely on a single “best” framework, we choose to include multiple forms of selection to allow researchers to explore the effects of different institutional frameworks on stylistic behavior. in addition to our manually curated selection of books, we also provide researchers with a set of derived features that can be used for further research on the style and content of books (described in table 2). (2) method steps the steps for our dataset construction were the following. books were manually selected according to the sampling strategies described in table 1; digitized and manually cleaned; processed using the “large model” of booknlp (bamman, 2022); and manually and computationally annotated for features indicated in table 2. code genre instrumentality platform selection criteria # docs bio biography non-fiction goodreads “best memoir/biography/autobiography” list 193 bs bestseller fiction new york times fiction published since 2001 with the longest aggregate time on the new york times bestseller list 249 hist history non-fiction amazon books listed under “history” under the “bestsellers” tag 205 mem memoir non-fiction amazon books listed under “memoir” under the “bestsellers” tag 229 mid middle school fiction goodreads goodreads choice awards for “middle grade” books 166 mix assorted nonfiction non-fiction amazon books listed under assorted non-fiction tags such as “health”, “politics”, and “business”, under the “bestsellers” tag 193 my mystery fiction amazon books listed under “mystery, thriller, suspense” under the “bestsellers” tag 234 nyt new york times reviewed fiction new york times fiction reviewed in the new york times book review 419 pw prizelists fiction 5 prizelists (us, uk, canada) works shortlisted for the national book award (us), pen/ faulkner award (us), governor general’s award (canada), giller prize (canada), and the man booker prize (uk) 258 rom romance fiction amazon books listed under “romance” under the “bestsellers” tag 208 sf science-fiction fiction amazon books listed under “science fiction & fantasy” under the “bestsellers” tag 223 ya young adult fiction goodreads goodreads choice awards for young adult fiction 177 table 1 list of genres, their selection criteria, and the total number of documents per category. https://doi.org/10.6084/m9.figshare.21166171.v1 3piper journal of open humanities data doi: 10.5334/johd.88 sampling strategy all books were chosen to represent “popular” writing across 12 different genres of contemporary publishing spanning a 20-year timeframe dating from 2001 through 2021. we define “popular” through multiple criteria that include user-generated awards or lists, elite prize committee lists or book reviews, or bestseller tags on platforms like amazon or the new york times. as a further way to validate popularity, we provide two measures drawn from the platform goodreads. we define genre through three different kinds of institutional framing: cultural capital (bestsellers, prizewinners, elite book reviews), stylistic affinity (mysteries, science fiction, biography, etc.), and age-level (middle-grade and young adult (ya)). this allows researchers a high degree of flexibility to better understand stylistic behavior of professionally published books targeting different kinds of readerships. we also segment our genres by the “instrumentality” of the information contained (“fiction” or “non-fiction”). while our genre categories are not mutually exclusive (mysteries may appear in bestsellers and vice versa), no books appear in two separate categories. it is important to note that our larger genre categories (cultural capital, style, age) are not necessarily commensurate with one another and thus researchers should use caution when comparing across these categories. experimentation with alternative genre labeling systems can be a further affordance of this dataset. finally, we aimed to select ca. 200 works per category, which we have found is sufficient for training robust text classification algorithms. due to text availability, list sizes, and cleaning, some categories have more or less than this number. in the case of those books reviewed in the new york times, we iterated twice on this process. in total, we assemble 2,754 books representing 2,234 unique authors across 12 genres. to further understand our data, we provide figures of the distribution of publication dates (figure 1), the average user rating on goodreads (figure 2), and the log-transformed number of ratings on goodreads (figure 3) to capture book popularity. finally, while no attention was given to the selection of books based on author gender, our gender distribution across all books table 2 list of 20 features included in our data. feature description annotation type category fiction or non-fiction manual genre twelve categories manual publication date date of first publication manual author gender perceived authorial gender manual pos part-of-speech uniand bigrams computational supersense frequency of 41-word supersenses computational word frequencies word frequencies for every book/1,000-word passage computational token count work length measure computational total characters estimated total number of named characters computational protagonist concentration percentage of all character mentions by main character computational avg. sentence length average length of all sentences per book computational avg. word length average length of all words per book computational tuldava score reading difficulty measure computational event count estimated number of diegetic events computational goodreads avg. rating average user rating on goodreads computational goodreads total ratings total number of ratings on goodreads as of june 2022 computational average speed measure of narrative pace computational minimum speed measure of narrative distance computational volume measure of topical heterogeneity computational circuitousness measure of narrative non-linearity computational 4piper journal of open humanities data doi: 10.5334/johd.88 is 49.76% women and 49.94% men with only eight books written by self-identified non-binary authors. we note, however, that there are meaningful within-genre differences (figure 4) as predicted by prior research (argamon et al., 2003). figure 1 distribution of publication dates of books in our sample. figure 2 distribution of the average user rating on goodreads for books in our sample. only includes books with > 9 ratings. figure 3 distribution of the log-transformed number of ratings on goodreads for books in our sample. only includes books with > 9 ratings. 5piper journal of open humanities data doi: 10.5334/johd.88 quality control all texts were manually cleaned of front and end matter. metadata such as publication date, authorial gender, author name and title were all manually entered. the dataset was manually reviewed for the appropriateness of genre labels for every book. finally, duplicates were removed and any books that were not at least 15,000 tokens in length were also removed. no maximum length was set. limitations our data is limited by intellectual property restrictions that do not allow access to full text data. to overcome this limitation, we provide a robust set of derived data that has served in prior research as a reliable foundation for the stylistic understanding of creative writing. our data is also limited by focusing on a single language. future work will want to emphasize multilingual data construction to facilitate our understanding of cross-cultural stylistic behavior. finally, for both manually and computationally derived features, we expect there to be some level of error. for the manual features, we have undertaken two-levels of review. for the computational features, the booknlp documentation provides estimates on the expected error rates of different predictive models. nevertheless, it is important for researchers to be aware that our derived features are always estimates. we would flag “character count” and “event counts” as two features that are worth further research due to the challenging nature of their prediction. (3) dataset description object name conlit format names and versions .csv creation dates start date: 2015-03-10; end date: 2022-06-22. dataset creators andrew piper (mcgill university) was responsible for the overall design of the dataset. eve kraicer (mcgill university) and joey love (mcgill university) assisted with cleaning and processing the data. figure 4 distribution of author gender by genre. 6piper journal of open humanities data doi: 10.5334/johd.88 language english license attribution-noncommercial 4.0 international (cc by-nc 4.0). repository name figshare publication date 2022-09-22 (4) reuse potential prior work on the computationally driven study of genre has focused on using different selection mechanisms to better understand the role that genre plays in organizing literary communities and reader responses, ranging from studies of historical text data (sharma et al., 2020; underwood, 2016b; wilkens, 2016) to contemporary reader response data (bourrier et al., 2020; pianzola et al., 2020; walsh et al., 2021). summarizing this work, one could say that research on the content or stylistic aspects of genre has largely focused on historical data while research into contemporary genre formations has largely focused on metadata or nonprofessionally published writing. our dataset is thus designed to give researchers access to stylistic data of contemporary, professionally published writing that spans a range of genre definitions and institutional frameworks. doing so can help further research into understanding the role genre plays in constraining authorial behavior. it can also facilitate further understanding that the role of differentiation plays in genre classification (sharma et al., 2022). as genre-theorist ralph cohen argued some time ago, “a genre, therefore, is to be understood in relation to other genres, so that its aims and purposes at a particular time are defined by its interrelation with and differentiation from others” (cohen, 1986, p. 89). our data will facilitate the empirical exploration of such theories. by providing goodreads user response data, our dataset also allows further research into the relationship between style and success (toubia et al., 2021). the links provided to the goodreads versions of our books also allow our data to be combined with reader-based response data. an exciting new avenue of literary study aims to better understand the causes and conditions of readers’ responses to texts (mendelman et al., 2021; pianzola et al., 2020; walsh et al., 2021) and our data provides the infrastructure to undertake such a research program across a large, diverse set of professionally published contemporary writing. funding information the creation of this dataset was funded by the social sciences and humanities research council of canada grant no. 895-2013-1011. competing interests the author has no competing interests to declare. author affiliation andrew piper orcid.org/0000-0001-9663-5999 department of languages, literatures, and cultures, mcgill university, montréal, ca https://orcid.org/0000-0001-9663-5999 https://orcid.org/0000-0001-9663-5999 7piper journal of open humanities data doi: 10.5334/johd.88 to cite this article: piper, a. (2022). the conlit dataset of contemporary literature. journal of open humanities data, 8: 24, pp. 1–7. doi: https://doi. org/10.5334/johd.88 published: 11 october 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. references argamon, s., koppel, m., fine, j., & shimoni, a. r. (2003). gender, genre, and writing style in formal written texts. text & talk, 23(3), 321–346. doi: https://doi.org/10.1515/text.2003.014 bamman, d. (2022). booknlp. retrieved from: https://github.com/booknlp/booknlp (last accessed: june 2022). bourrier, k., & thelwall, m. (2020). the social lives of books: reading victorian literature on goodreads. journal of cultural analytics, 5(1), 1–34. doi: https://doi.org/10.22148/001c.12049 castellano, c. g. (2018). the institution of institutionalism: difference, universalism and the legacies of institutional critique. culture, theory and critique, 59(1), 59–73. doi: https://doi.org/10.1080/147357 84.2017.1410438 cohen, r. (1986). history and genre. new literary history, 17(2), 203–218. doi: https://doi. org/10.2307/468885 mendelman, l., & mukamal, a. (2021). the generative dissensus of reading the feminist novel, 19952020: a computational analysis of interpretive communities. journal of cultural analytics, 6(3), 31–73. doi: https://doi.org/10.22148/001c.30009 pianzola, f., rebora, s., & lauer, g. (2020). wattpad as a resource for literary studies: quantitative and qualitative examples of the importance of digital social reading and readers’ comments in the margins. plos one, 15(1), e0226708. sharma, a., hu, y., wu, p., shang, w., singhal, s., & underwood, t. (2020). the rise and fall of genre differentiation in english-language fiction. chr 2020: workshop on computational humanities research. toubia, o., berger, j., & eliashberg, j. (2021). how quantifying the shape of stories predicts their success. proceedings of the national academy of sciences, 118(26), e2011695118. doi: https://doi. org/10.1371/journal.pone.0226708 underwood, t. (2016a). genre theory and historicism. journal of cultural analytics, 2(2), 1–6. doi: https:// doi.org/10.22148/16.008 underwood, t. (2016b). the life cycles of genres. journal of cultural analytics, 2(2), 1–25. doi: https://doi. org/10.22148/16.005 walsh, m., & antoniak, m. (2021). the goodreads ‘classics’: a computational study of readers, amazon, and crowdsourced amateur criticism. journal of cultural analytics, 6(2), 243–287. doi: https://doi. org/10.22148/001c.22221 wilkens, m. (2016). genre, computation, and the varieties of twentieth-century u.s. fiction. journal of cultural analytics, 2(2), 1–24. doi: https://doi.org/10.22148/16.009 https://doi.org/10.5334/johd.88 https://doi.org/10.5334/johd.88 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1515/text.2003.014 https://github.com/booknlp/booknlp https://doi.org/10.22148/001c.12049 https://doi.org/10.1080/14735784.2017.1410438 https://doi.org/10.1080/14735784.2017.1410438 https://doi.org/10.2307/468885 https://doi.org/10.2307/468885 https://doi.org/10.22148/001c.30009 https://doi.org/10.1371/journal.pone.0226708 https://doi.org/10.1371/journal.pone.0226708 https://doi.org/10.22148/16.008 https://doi.org/10.22148/16.008 https://doi.org/10.22148/16.005 https://doi.org/10.22148/16.005 https://doi.org/10.22148/001c.22221 https://doi.org/10.22148/001c.22221 https://doi.org/10.22148/16.009 china biographical database (cbdb): a relational database for prosopographical research of pre-modern china data paper corresponding author: song chen department of east asian studies, bucknell university, lewisburg, us song.chen@bucknell.edu keywords: chinese history; relational database; prosopography; geographical information system; social network analysis to cite this article: chen, s., & wang, h. (2022). china biographical database (cbdb): a relational database for prosopographical research of pre-modern china. journal of open humanities data, 8(1); 4, pp. 1–6. doi: https://doi. org/10.5334/johd.68 china biographical database (cbdb): a relational database for prosopographical research of pre-modern china song chen hongsu wang abstract the china biographical database (cbdb) is the largest prosopographical database for the study of chinese history. we use regular expressions and neural network models to systematically harvest data from primary and secondary sources and employ an entity-relationship model to organize our data. as a relational database with both online and offline versions, cbdb provides freely accessible, structured data for macroscopic, quantitative studies of premodern china. the data in cbdb is continuously disambiguated and readily formatted for statistical, social network, and spatial analyses, and also has value for tagging named entities in historical texts and contextualizing other data collections. *author affiliations can be found in the back matter of this article mailto:song.chen@bucknell.edu https://doi.org/10.5334/johd.68 https://doi.org/10.5334/johd.68 https://orcid.org/0000-0003-3922-4792 https://orcid.org/0000-0002-1840-2046 2chen and wang journal of open humanities data doi: 10.5334/johd.68 (1) overview repository location the database is available in both microsoft access and sqlite versions on dataverse at https:// doi.org/10.7910/dvn/paggqs and on github at https://github.com/cbdb-project/cbdb_sqlite. they are regularly updated with new contents and functions. context the china biographical database (cbdb) amasses biographical information from disparate historical sources to facilitate quantitative, prosopographical research of premodern china. the project originated with the dataset that robert m. hartwell (1932–1996) created between the mid-1970s and 1995, as part of his research on the social and political history of middleperiod china (ca. 7th–13th century), and willed to the harvard-yenching institute. in 2004–05, michael a. fuller restructured and converted the data from dbase first into foxpro and then into microsoft access format. it has since been transferred to the fairbank center for chinese studies at harvard university, which, together with the center for research on ancient chinese history at peking university and the institute of history and philology at academia sinica, continued to add new contents under the direction of an international committee chaired by peter k. bol. over the past sixteen years, cbdb has grown from a database of about 25,000 individuals to include approximately 491,000 individuals (as of may 2021) whose lives spanned from the seventh through nineteenth centuries and is available for scholarly use in several online and offline (microsoft access, microsoft sql server, mysql, and sqlite) versions.1 the contents of cbdb benefit from, and are inevitably shaped by, china’s historiographical tradition which provides rich data on family relations, literary exchanges, intellectual interactions, and careers in government, among others, but is often reticent about issues like gender relations and economic transactions. because of this, cbdb has 275,945 records on bureaucratic appointments, 482,953 records on kinship relations, 160,219 records of non-kin social connections, but hardly any on economic activities as of may 2021. (2) method steps there are two core tasks in our data collection: data mining and disambiguation. cbdb is a relational database that uses the entity-relationship model to organize biographical information. persons are a type of entity. so are places, texts, offices, and so forth. each entity has its own set of attributes (e.g., each person has a birth year and a death year, and each place has a longitude and a latitude), and every life event is conceptualized as an instance of a relationship between multiple entities (e.g., a bureaucratic appointment is an instance of relationship, from the beginning to the end year of that appointment, between a person, the office he held, and the jurisdiction of that office). data collection is, in substance, a matter of identifying named entities and their relationships in historical sources that are described in narrative forms. for this purpose, we have experimented with several data mining approaches and found value in algorithms based on regular expressions and neural network models, such as bidirectional encoder representations from transformers (bert) and bidirectional long short-term memory (bi-lstm). we use bert, for example, to create a vector representation of each chinese character (an approach known as “word embedding”), which allows us to capture semantic and syntactic relations between characters through mathematical operations. we also use bi-lstm to tag the characters and predict whether a character is part of a string that signifies a specific person, place, or bureaucratic office. outputs from these automated data mining algorithms are reviewed by an editorial team before they are prepared for inclusion into our database. 1 the microsoft access and sqlite versions of cbdb are updated on a regular basis. to download the most recent version of cbdb in the microsoft access format, see https://projects.iq.harvard.edu/cbdb/download-cbdbstandalone-database. the up-to-date sqlite version is downloadable from https://github.com/cbdb-project/cbdb_ sqlite. our microsoft sql server is currently undergoing alpha testing. the mysql version of cbdb provides data dump for development teams and other experienced users upon request. the cbdb online querying and data visualization interface for general use is developed by our commercial collaborator and accessible via http://www. inindex.cn/. with collaboration from academia sinica and the cbdb open-source community, we have also been developing various backend apis (cc by-nc-sa 4.0) that support the future design of alternative online interfaces (https://github.com/cbdb-project/cbdb-online-main-server/blob/develop/api.md). https://doi.org/10.5334/johd.68 https://doi.org/10.7910/dvn/paggqs https://doi.org/10.7910/dvn/paggqs https://github.com/cbdb-project/cbdb_sqlite https://projects.iq.harvard.edu/cbdb/download-cbdb-standalone-database https://projects.iq.harvard.edu/cbdb/download-cbdb-standalone-database https://github.com/cbdb-project/cbdb_sqlite https://github.com/cbdb-project/cbdb_sqlite http://www.inindex.cn/ http://www.inindex.cn/ https://github.com/cbdb-project/cbdb-online-main-server/blob/develop/api.md 3chen and wang journal of open humanities data doi: 10.5334/johd.68 in merging newly harvested data into cbdb, the chief challenge comes from the complex relationship in natural language between a name and the entity it signifies. cbdb assigns a unique identifier (“id” or “code”) to each named entity regardless of how it is referenced in the sources, and our development team makes every effort to disambiguate all newly harvested data before incorporating them into the database. take persons for example. while we are blessed by the fact that most people of all walks of life in chinese society, unlike the europeans, had possessed both a family name and a given name since the han dynasty (202 bce–220 ce) and had the flexibility of composing given names from almost any chinese character, it is not rare for two persons to have exactly the same name. on the other hand, members of the elite in imperial china were typically known by a wide variety of names and could be referred to by their office titles and other honorific appellations. therefore, it is often necessary to disambiguate personal names and appellations in historical sources. in practice, we make use of a variety of biographical information such as alternative names, birth and death year, native place, examination degree, and data on kinship and social connections to distinguish a person from his namesake and consolidate data points about the same person whom the sources reference in various ways. we do not only disambiguate and code entities, but also disambiguate kinship relations. we have designed a set of symbols to describe kinship relations with greater precision than they are expressed in the natural language (e.g., we use fbs and mbs [father’s or mother’s brother’s son], among others, to distinguish different kinds of paternal and maternal cousins). we also normalize social relations by aggregating varied expressions found in historical sources into coded categories. natural language has numerous ways of describing social relations. while the nuances in these descriptions (e.g., to censure someone vs. to criticize someone) merit attention and may, at least in some cases, reflect subtle differences in the nature of actual social relationships or the perceptions thereof, the strength of cbdb lies in facilitating the analysis of a large amount of historical data in the aggregate. to achieve this goal, we classify social relations into coded categories. as of may 2021, we have 470 pairs of coded relations that are further organized into larger classes and subclasses, which include literary exchanges, teacher-disciple ties, supportive or oppositional political relations, and so forth. after fully disambiguating and normalizing (“coding”) named entities and their relations, we partition the data into separate tables which are subsequently uploaded to the database. the primary key in each data table eliminates duplicate records, and the foreign key ensures proper linkage between tables. disambiguation and normalization are time-consuming tasks that require domain knowledge in specific historical periods and topics. to expedite the process, we launched a crowdsourcing platform in 2021 to encourage contributions from historians of premodern china. sampling strategy our ultimate goal is to collect all biographical information in the extant historical record of premodern china. resource constraints, however, require that we must set priorities. to produce a large collection of data for scholarly use within a reasonable timeframe, we have worked mainly with digitized, searchable texts, especially those that were written and formatted in a style particularly suitable for automated data extraction, and prioritized data sources that can systematically expand the coverage of our database. these include both modern scholarly works, such as biographical sketches and rosters of officeholders compiled by twentiethcentury historians, and primary historical documents, such as biographies in official histories and local gazetteers, tomb epitaphs, records of imperial examination graduates, and the lists of letters and other writings in literary collections. several biographical dictionaries, compiled in the 1960s and 1970s, provide a large assemblage of material on the lives of approximately 70,000 persons between the tenth and seventeenth centuries (chang & wang, 1974; wang, li, & pan, 1979; national central library, 1965). by systematically harvesting the data in these dictionaries, the cbdb team managed to create basic profiles for a large number of historical figures during an early phase of our project. since then, we have expanded coverage by concentrating data collection in three areas: bureaucratic appointments, family relations, and literary exchanges. we have collected data from two multi-volume compendia which contributed more than 35,000 records on prefectural appointments from the seventh to thirteenth centuries (yu, 2000; li, 2001). https://doi.org/10.5334/johd.68 4chen and wang journal of open humanities data doi: 10.5334/johd.68 these were recently supplemented by another 107,000 entries on local appointments taken from 158 local gazetteers compiled in ming-qing times (1368–1912). using fifty-two examination records from the ming dynasty (1368–1644), we have added roughly 14,116 metropolitan examination graduates and their 130,000 relatives into the database. we are now expanding data coverage in this area with a new dataset containing 19,576 song-dynasty (960–1279) examination graduates based on a recent publication (fu, gong, & zu, 2009). with the help of tang historians (yao ping and nicolas tackett), we have added some 100,000 instances of kinship relations from tomb epitaphs between the seventh and tenth centuries (zhou, 1992; zhou & zhao, 2001), and we are currently preparing a massive collection of officeholding data from song-dynasty administrative documents (xu, 2014). at present, the majority of our data on social relations are based on records of literary exchanges. we collected 18,124 instances of poetic exchange between the seventh and tenth centuries, based on the work of a modern scholar (wu, 1993), and some 8,800 instances of epistolary exchange between the tenth and thirteenth centuries based on complete song-dynasty prose (zeng & liu, 2006). we will soon add another 40,000 instances of epistolary exchange from ming-dynasty (1368–1644) literary collections. for a full list of our data sources, see https:// projects.iq.harvard.edu/cbdb/cbdb-sources. in addition, we have also coded and incorporated data from existing databases that focus on specific social groups and historical periods. these include, for example, a massive collection of data on family relations and officeholding for more than 46,000 persons from the database of names and biographies (institute of history and philology, academia sinica, n.d.) and some 5,000 female writers from ming-qing women’s writings project (mcgill university, n.d.). cbdb is a work in progress and has no end date planned. its current contents reflect its history that began with hartwell’s dataset of song-dynasty officials and gradually extended back into the tang dynasty and forward into the yuan, ming, and qing dynasties. as more historical texts from premodern china become available in searchable digital formats and the technology of data mining improves, the contents of cbdb will continue to grow. quality control our editorial group, composed of doctoral students in chinese history who specialize in various topics and periods, review the output from data mining algorithms and, when necessary, manually input data into our database. additionally, when new data are prepared for uploading to cbdb, the primary and foreign keys in data tables also function as a line of defense for data integrity. (3) dataset description object name sqlite version: cbdb_20210525.7z; microsoft access version: cbdb_bc_20210525.7z format names and versions cbdb is available for downloading in sqlite and microsoft access versions. both its content and interface are constantly evolving. data contents are dated by the most recent update in the format of yyyy-mm-dd, and the interface is versioned using two lowercase english letters (the latest release is the bc version). creation dates – 1970s to 2021–05–25 dataset creators – current executive committee members include peter k. bol (harvard university, chair), xiaonan deng (center for research on ancient chinese history, peking university), michael a. fuller (university of california at irvine), song chen (bucknell university), hsi-yuan chen (institute of history and philology, academia sinica), wenyi chen (institute of history and philology, academia sinica), xin luo (center for research on ancient chinese history, peking university). current project managers are hongsu wang (harvard university) and yang xu (peking university). for a list of past and present committee members, editors, https://doi.org/10.5334/johd.68 https://projects.iq.harvard.edu/cbdb/cbdb-sources https://projects.iq.harvard.edu/cbdb/cbdb-sources 5chen and wang journal of open humanities data doi: 10.5334/johd.68 and other contributors, see https://projects.iq.harvard.edu/cbdb/core-institutions-and-editors. for a list of crowdsourcing contributors, see https://projects.iq.harvard.edu/cbdb/cbdb-crowdsourcing-projects. language – variable names are in english. data are bilingual (english and chinese). license – cc by-nc-sa 4.0 repository name – dataverse and github publication date – 2021–05–25 (4) reuse potential cbdb assembles biographical information from disparate sources and is particularly suited for data-driven, social scientific research that aims at discovering macroscopic patterns in chinese history and complements the qualitative, humanistic approach of close reading. the current coverage of cbdb makes it particularly powerful for prosopographical studies of the chinese elite from the seventh through nineteenth centuries. the data in cbdb is continuously disambiguated and readily formatted for statistical, social network, and spatial analyses. a growing number of articles are published every year that use cbdb data to explore topics ranging from career trajectory, regional composition, and family connections of civil officials to intellectual and social networks of neo-confucian moral philosophers, antiquities collectors, and members of political factions. for a full list of publications that use cbdb data, see https:// projects.iq.harvard.edu/cbdb/publications-use-cbdb-data. cbdb also has immense value for developing new digital projects. online text markup platforms, like markus (ho & de weerdt, n.d.), use cbdb code tables to tag persons, bureaucratic offices, places, and temporal references in user-uploaded historical texts. specialized databases (e.g., database of names and biographies) access cbdb, through our api, to provide more context to their data collections. the chinese text project integrates data from cbdb and other sources to produce a knowledge graph in its data wiki (sturgeon, n.d.), and the shanghai library uses our data for its linked open data project (shanghai library, n.d.). universities, such as tsinghua, use cbdb to teach digital methods for chinese studies and incorporate cbdb into their pedagogical platforms (tsinghua university, n.d.) that train the next generation of digital humanists. funding information col digital publishing group co., ltd. (2018–) the tang research foundation (2015–17) the henry luce foundation (2012–15) institute of history and philology, academia sinica (2006–) center for research on ancient chinese history, peking university (2010–) harvard university and harvard university asia center (2008, 2009–2011) the national endowment for the humanities (2009–2012; pw-50438-09) chiang ching-kuo foundation for international scholarly exchange (2011–2018) the social sciences and humanities research council of canada (2011–2015) the american council of learned societies (2008) bequest from the estate of robert hartwell to harvard-yenching institute (2005–2010) competing interests the authors have no competing interests to declare. author contributions song chen: conceptualization, methodology, writing – original draft. hongsu wang: data curation, project administration, software, writing – review & editing. https://doi.org/10.5334/johd.68 https://projects.iq.harvard.edu/cbdb/core-institutions-and-editors https://projects.iq.harvard.edu/cbdb/cbdb-crowdsourcing-projects https://projects.iq.harvard.edu/cbdb/publications-use-cbdb-data https://projects.iq.harvard.edu/cbdb/publications-use-cbdb-data 6chen and wang journal of open humanities data doi: 10.5334/johd.68 to cite this article: chen, s., & wang, h. (2022). china biographical database (cbdb): a relational database for prosopographical research of pre-modern china. journal of open humanities data, 8(1): 4, pp. 1–6. doi: https://doi. org/10.5334/johd.68 published: 27 january 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. author affiliations song chen orcid.org/0000-0003-3922-4792 department of east asian studies, bucknell university, lewisburg, us hongsu wang orcid.org/0000-0002-1840-2046 institute for quantitative social science, harvard university, cambridge, us references chang, b., & wang, d. (1974). song ren zhuanji ziliao suoyin 宋人傳記資料索引 [index to biographical materials of song figures]. taipei: dingwen shuju. fu, x., gong, y., & zu, h. (2009). song dengke ji kao 宋登科記考 [research on examination graduates of the song dynasty]. nanjing: jiangsu jiaoyu chubanshe. harvard university, academia sinica, and peking university. china biographical database. https:// projects.iq.harvard.edu/cbdb ho, h. l. b., & de weerdt, h. markus: text analysis and reading platform. https://dh.chinese-empires.eu/ markus/beta/ institute of history and philology, academia sinica. database of names and biographies 人名權威人物傳記資料庫. https://newarchive.ihp.sinica.edu.tw/sncaccgi/sncacftp li, z. (2001). songdai junshou tongkao 宋代郡守通考 [comprehensive studies on song-dynasty prefects]. chengdu: ba shu shushe. mcgill library. ming-qing women’s writings project. directed by grace s. fong and song shi. https:// digital.library.mcgill.ca/mingqing/english/index.php national central library. (1965). ming ren zhuanji ziliao suoyin 明人傳記資料索引 [index to biographical materials of ming figures]. taipei: guoli zhongyang tushuguan. shanghai library. cbdb linked open data. https://cbdb.library.sh.cn sturgeon, d. chinese text project data wiki. https://ctext.org/tools/linked-open-data tsinghua university. tsinghua digital humanities teaching and research platform 清華大學數字人文教學與研究平臺. http://qh.nqcx.net wang, d., li, r., & pan, b. (1979). yuan ren zhuanji ziliao suoyin 元人傳記資料索引 [index to biographical materials of yuan figures]. taipei: xinwenfeng chuban gongsi. wu, r. (1993). tang wudai ren jiaowangshi suoyin 唐五代人交往詩索引 [indexes to the exchange poems of tang and five dynasties]. shanghai: shanghai guji chubanshe. xu, s. (2014). song huiyao jigao 宋會要輯稿 [collected administrative documents from the song dynasty]. shanghai: shanghai guji chubanshe. yu, x. (2000). tang cishi kao quanbian 唐刺史考全編 [complete collection of studies on tang-dynasty prefects]. hefei: anhui daxue chubanshe. zeng, z., & liu, l. (2006). quan song wen 全宋文 [complete song-dynasty prose]. shanghai: shanghai cishu chubanshe. zhou, s. (1992). tangdai muzhi huibian 唐代墓誌彙編 [collection of tang-dynasty tomb epitaphs]. shanghai: shanghai guji chubanshe. zhou, s., & zhao, c. (2001). tangdai muzhi huibian xuji 唐代墓誌彙編續集 [sequel to the collection of tang-dynasty tomb epitaphs]. shanghai: shanghai guji chubanshe. https://doi.org/10.5334/johd.68 https://doi.org/10.5334/johd.68 https://doi.org/10.5334/johd.68 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://orcid.org/0000-0003-3922-4792 https://orcid.org/0000-0002-1840-2046 https://projects.iq.harvard.edu/cbdb https://projects.iq.harvard.edu/cbdb https://dh.chinese-empires.eu/markus/beta/ https://dh.chinese-empires.eu/markus/beta/ https://newarchive.ihp.sinica.edu.tw/sncaccgi/sncacftp https://digital.library.mcgill.ca/mingqing/english/index.php https://digital.library.mcgill.ca/mingqing/english/index.php https://cbdb.library.sh.cn https://ctext.org/tools/linked-open-data http://qh.nqcx.net a dataset for toponym resolution in nineteenth-century english newspapers data paper corresponding author: mariona coll ardanuy the alan turing institute, london, uk; queen mary university of london, london, uk mcollardanuy@turing.ac.uk keywords: benchmark; dataset; geographic information retrieval; newspapers; nineteenth-century english; toponym resolution to cite this article: coll ardanuy, m., beavan, d., beelen, k., hosseini, k., lawrence, j., mcdonough, k., nanni, f., van strien, d., & wilson, d. c. s. (2022). a dataset for toponym resolution in nineteenthcentury english newspapers. journal of open humanities data, 8(1), 3, pp. 1–7. doi: https://doi.org/10.5334/johd.56 a dataset for toponym resolution in nineteenthcentury english newspapers mariona coll ardanuy david beavan kaspar beelen kasra hosseini jon lawrence katherine mcdonough federico nanni daniel van strien daniel c. s. wilson abstract we present a new dataset for the task of toponym resolution in digitized historical newspapers in english. it consists of 343 annotated articles from newspapers based in four different locations in england (manchester, ashton-under-lyne, poole and dorchester), published between 1780 and 1870. the articles have been manually annotated with mentions of places, which are linked—whenever possible—to their corresponding entry on wikipedia. the dataset consists of 3,364 annotated toponyms, of which 2,784 have been provided with a link to wikipedia. the dataset is published in the british library shared research repository, and is especially of interest to researchers working on improving semantic access to historical newspaper content. *author affiliations can be found in the back matter of this article mailto:mcollardanuy@turing.ac.uk https://doi.org/10.5334/johd.56 https://orcid.org/0000-0001-8455-7196 https://orcid.org/0000-0002-0347-6659 https://orcid.org/0000-0001-7331-1174 https://orcid.org/0000-0003-4396-6019 https://orcid.org/0000-0001-6561-6381 https://orcid.org/0000-0001-7506-1025 https://orcid.org/0000-0003-2484-4331 https://orcid.org/0000-0003-1684-6556 https://orcid.org/0000-0001-6886-775x 2coll ardanuy et al. journal of open humanities data doi: 10.5334/johd.56 1 overview in this paper, we present a new dataset for the task of toponym resolution in digitized historical newspapers in english. toponym resolution is a subtask of entity linking, focused on detecting and resolving mentions of places (i.e., toponyms) to their corresponding referent in a gazetteer or other knowledge base. resolving toponyms in texts enables new forms of large-scale semantic and geographic analyses. however, most approaches to entity linking and toponym resolution are optimized to perform well with clean texts originally intended for a global audience and they do not generalize well to noisy, historical, or regional texts (ehrmann, romanello, flückiger, & clematide, 2020; gritta, pilehvar, limsopatham, & collier, 2018; wang & hu, 2019). some entity linking datasets have been created to address this issue, such as ehrmann et al. (2020) and hamdi et al. (2021), both built from digitized historical newspaper collections. our dataset differs from others in its emphasis on the geographical aspect of newspaper data. the british provincial press—from which we sampled our articles—was strongly anchored in place: articles and advertisements were selected and edited with a local audience in mind. after the repeal on the ‘taxes on knowledge’ in the 1850s and 1860s, the provincial press proliferated; its readership expanded as did the number of titles, trumping the london-based press in size. despite this plethora of available materials, to date historians have mostly favored the metropolitan papers at the expense of the local press, which remains largely understudied (beelen, lawrence, wilson, & beavan, under submission; hobbs, 2018). as shown in lieberman, samet, and sankaranarayanan (2010) and coll ardanuy et al. (2019), the distribution of places mentioned in newspapers varies considerably depending on their intended audience (grounded in a certain place and time), hindering the resolution of ambiguous place names. our dataset has been created to assess the robustness of entity linking and toponym resolution methods in this particularly challenging but common scenario. we hope that improved toponym resolution for these newspapers will translate into greater interest in them as research materials. this dataset is comprised of 343 articles carefully sampled from a variety of provincial nineteenth-century newspapers based in four different locations in england. the articles have been manually annotated with mentions of places, which are linked—whenever possible— to their corresponding entry on wikipedia. a total of 3,364 toponyms have been annotated, of which 2,784 have been linked to wikipedia. the text of the articles is ocr-generated and has not been manually corrected. the dataset has been created with the aim of becoming a benchmark for several tasks: fuzzy string matching and toponym recognition and resolution, among others, all of which contribute to the challenging pursuit of improving semantic access to ocred historical texts in english. this dataset has been produced as part of living with machines,1 a multidisciplinary research project focused on the lived experience of industrialization in britain during the long nineteenth century and, in particular, on the social and cultural impact of mechanization as reported in newspapers and other sources. living with machines is one of many projects that harness the growing volume of digitized newspaper collections for humanities research.2 a fraction of the annotated data has been used in previous studies from living with machines, in particular coll ardanuy et al. (2019), and for fuzzy string matching in hosseini, nanni, and coll ardanuy (2020) and coll ardanuy et al. (2020). 2 method data processing the initial source of the data was formatted as metadata encoding and transmission standard/ analyzed layout and text object (mets/alto) files3 and consisted of 72 newspaper titles of publications (including subsequent variant titles) from the english counties of lancashire and dorset. these were obtained from the genealogy company find my past, custodians of 1 https://livingwithmachines.ac.uk (last access: 2021-07-19). 2 other notable projects are: impresso (https://impresso.github.io/), newseye (https://www.newseye. eu/), oceanic exchanges (https://oceanicexchanges.org/), or viraltexts (https://viraltexts.org/) (last access: 2021-08-20). 3 https://www.loc.gov/standards/mets/ (last access: 2021-08-12). https://doi.org/10.5334/johd.56 https://livingwithmachines.ac.uk https://impresso.github.io/ https://www.newseye.eu/ https://www.newseye.eu/ https://oceanicexchanges.org/ https://viraltexts.org/ https://www.loc.gov/standards/mets/ 3coll ardanuy et al. journal of open humanities data doi: 10.5334/johd.56 the british newspaper archive, the most extensive corpus of digitised british newspapers.4 this mets/alto file format contains both logical and physical layout information, along with document textual contents, expressed as extensible markup language (xml).5 it is verbose and does not lend itself directly to manipulation in natural language processing pipelines and tools. instead, we used extensible stylesheet language transformations (xslt)6 to extract the plain text of each article; each article being explicitly segmented and identified in the mets logical structure map, the plain text extracted being all physical alto textblocks attributed to that article. this plain text is supplemented by minimal metadata extracted into in a companion file. this step is performed by alto2txt, which is a python wrapper for those xslt transformations, and is being prepared for public release via github. this corpus consisted of 11,761,898 articles (as defined above). this metadata was ingested into a postgresql7 relational database for ease of querying and filtering, its relational schema mirrors directly the hierarchy of the metadata xml files. sampling we created a subsample that consists of 343 articles published between 1780 and 1870 in local newspapers based in four different locations: manchester and ashton-under-lyne (a large town and a medium-sized market town, broadly representing the industrial north of england), and poole and dorchester (respectively medium-sized port and market towns, representing the rural south).8 figure 1 gives an overview of the number of annotated articles per decade and place of publication. we biased our sample toward articles that have a length between 150 and 550 words and an ocr quality confidence score greater than 0.7 (calculated as the mean of the per-word ocr confidence scores as reported in the source metadata). most of the text is legible, even though it contains many ocr errors. see table 1 for a more detailed overview of the sample. 4 https://www.britishnewspaperarchive.co.uk/ (last access: 2021-08-12). 5 https://www.w3.org/standards/xml/ (last access: 2021-08-12). 6 https://www.w3.org/style/xsl/ (last access: 2021-08-12). 7 https://www.postgresql.org/ (last access: 2021-08-12). 8 the newspapers from which we obtained the articles to annotate are the dorset county chronicle, the poole and south-western herald, the manchester mercury and harrops general advertiser, the manchester mercury; and tuesdays general advertiser, the manchester courier and lancashire general advertiser, the ashton reporter, and the ashton and stalybridge reporter. figure 1 number of annotated articles per decade and place of publication. https://doi.org/10.5334/johd.56 https://www.britishnewspaperarchive.co.uk/ https://www.w3.org/standards/xml/ https://www.w3.org/style/xsl/ https://www.postgresql.org/ 4coll ardanuy et al. journal of open humanities data doi: 10.5334/johd.56 we did not perform any manual post-processing to correct the errors produced in the ocr or layout recognition steps. therefore, the toponyms in this dataset often contain ocr errors (e.g., ‘ihancfjrcter’ for ‘manchester’). additionally, our dataset is rich with name variations that are characteristic of historical data, such as spelling variations (e.g., ‘leipsic’ for ‘leipzig’) and other forms of name change (e.g., ‘kingstown’ for ‘dún laoghaire’). annotation six annotators from different disciplinary backgrounds (history, literature, data science, and linguistics) manually annotated the toponyms present in the subsample. we used the inception annotation platform9 (klie, bugert, boullosa, de castilho, & gurevych, 2018). a toponym is a mention of a location in a text. we defined a location as any entity that is static and can be represented by its geographical coordinates. toponyms were classified into the following categories: building (names of buildings, such as the ‘british museum’), street (streets, roads, and other odonyms, such as ‘great russell st’), loc (any other real world places regardless of type or scale, such as ‘bloomsbury’, ‘london’, or ‘great britain’), alien (extraterrestrial locations, such as ‘venus’), fiction (fictional or mythical places, such as ‘hell’), and other (other types of entities with coordinates, such as events, like the ‘battle of waterloo’). where possible, toponyms were linked to the corresponding wikipedia entries (from which geographic coordinates can be derived) by their url. this would be left empty if the location had no wikipedia entry or the annotators were uncertain as to the correct disambiguation, either because the ocr made it impossible to correctly determine the referent or due to insufficient context.10 while the annotations were made on the ocred text, it was possible for the annotator to consult the original page image online on the british newspaper archive. annotators were encouraged to discuss difficult choices with each other, and to document their decisions in a shared document. table 2 gives an overview of the annotations for each class. 9 https://inception-project.github.io/ (last access: 2021-08-18). 10 for reference, we provide the original annotation guidelines together with the dataset. however, note that the final annotations have been refined in version 2 of the dataset. these changes are described in the accompanying readme file. class annotations unique toponyms unique wikipedia links unlinked toponyms loc 2764 1348 827 133 building 354 294 83 248 street 240 194 32 198 other 5 5 5 0 fiction 1 1 0 1 alien 0 0 0 0 table 2 total number of annotations, unique toponyms, unique wikipedia links and toponyms with no link to wikipedia, per class. ashton dorchester manchester poole 1860 1820 1830 1860 1780 1800 1820 1830 1860 1860 number of articles 36 36 36 36 21 34 36 36 36 36 avg word count 300 323 313 325 311 368 378 354 312 288 avg ocr quality: mean 0.89 0.86 0.88 0.89 0.75 0.77 0.87 0.88 0.84 0.9 avg ocr quality: sd 0.18 0.21 0.19 0.18 0.27 0.27 0.21 0.19 0.23 0.14 table 1 number of annotated articles, average article word count, and average article ocr quality mean and standard deviation per decade and place of publication. https://doi.org/10.5334/johd.56 https://inception-project.github.io/ 5coll ardanuy et al. journal of open humanities data doi: 10.5334/johd.56 quality control to assess the quality of the annotations, we had 77 newspaper articles annotated by two people, for a total of 740 annotation pairs. we used the inception agreement functionality to assess the inter-annotator agreement between the two sets of annotations. using the krippendorff’s alpha (nominal) measure, we obtained an agreement of 0.87 for place name detection and classification and 0.89 for linking to wikipedia. to further ensure the quality of our resource, after the annotation process, a curator went through all the annotations and made final decisions on which annotations to keep and which to discard, making sure the annotations were consistent throughout the dataset. 3 dataset description object name topres19th_v2. format names and versions we are sharing the annotated files in the webanno tsv (tab-separated values) file format, version 3.2.11 there are 343 files, one for each newspaper article. accompanying the dataset is an additional tsv file that contains the metadata associated with each article: word count, ocr quality mean and standard deviation, date (and decade) of publication, place of publication, newspaper publication code and publication title, and an additional field (annotation_batch) in which the article is assigned to one of three batches that are similarly distributed in terms of place and decade of publication (this field was used during the sampling process, and may be useful for researchers wishing to split the dataset for experimental purposes). we have also prepared a readme file and the original annotation guidelines in markdown markup. the present paper describes version 2 of the dataset. creation dates 2019-01-01 to 2021-07-27. language nineteenth-century english. license the dataset is released under open license cc-by-nc-sa, available at https://creativecommons. org/licenses/by-nc-sa/4.0/. repository name the dataset is stored in the british library shared research repository at https://doi.org/10.23636/ r7d4-kw08. publication date 2021-12-03. 4 reuse potential the vast archive of the british newspaper archive and other british historical newspaper corpora will be re-used by hundreds of scholars in the coming years. establishing benchmark datasets like this provides a foundation for others to assess the performance of methods related to the identification and location of places in historical newspapers. although toponym density was 11 the webanno tsv format is a conll-based file format, a format that is a widely-used in natural language processing (especially in shared tasks of the conference on computational natural language learning). it lists one token per line, the different layers of annotation are tab-separated, and it uses blank lines to separate sentences. see https://webanno.github.io/webanno/releases/3.4.5/docs/user-guide.html#sect_ webannotsv (last access: 2021-07-27). https://doi.org/10.5334/johd.56 https://creativecommons.org/licenses/by-nc-sa/4.0/ https://creativecommons.org/licenses/by-nc-sa/4.0/ https://doi.org/10.23636/r7d4-kw08 https://doi.org/10.23636/r7d4-kw08 https://webanno.github.io/webanno/releases/3.4.5/docs/user-guide.html#sect_webannotsv https://webanno.github.io/webanno/releases/3.4.5/docs/user-guide.html#sect_webannotsv 6coll ardanuy et al. journal of open humanities data doi: 10.5334/johd.56 always greatest for newspapers’ immediate locality, all newspapers included a rich diversity of national and international place names linked to reports of trade, war, conquest and state politics. our annotations cover the different scales of places that make up the locations of the political, economic, and everyday life reported in nineteenth-century provincial newspapers. we hope that this dataset contributes to improving methods for finding difficult-to-recognize toponyms in digitized texts and linking them to context-appropriate knowledge base records. acknowledgements newspaper data has been provided by findmypast limited from the british newspaper archive, a partnership between the british library and findmypast (https://www.britishnewspaperarchive. co.uk/). we thank the anonymous reviewers for their careful and constructive reviews. we are grateful to giovanni colavizza (university of amsterdam) and james hetherington (university college london) for helping with the research infrastructure, to claire austin (british library) for helping with data access, and to the members of living with machines who helped with the annotations. funding statement this work was supported by living with machines (ahrc grant ah/s01179x/1) and the alan turing institute (epsrc grant ep/n510129/1). the living with machines project, funded by the uk research and innovation (ukri) strategic priority fund, is a multidisciplinary collaboration delivered by the arts and humanities research council (ahrc), with the alan turing institute, the british library and the universities of cambridge, east anglia, exeter, and queen mary university of london. competing interests the authors have no competing interests to declare. author contribution mariona coll ardanuy (conceptualization, data curation, formal analysis, project management, writing), david beavan (resources, software, writing), kaspar beelen (resources, data curation, writing), kasra hosseini (resources, software), jon lawrence (conceptualization, data curation, project management), katherine mcdonough (conceptualization, data curation, writing), federico nanni (validation, writing), daniel van strien (resources, software), daniel c.s. wilson (conceptualization, data curation, writing). author affiliations mariona coll ardanuy orcid.org/0000-0001-8455-7196 the alan turing institute, london, uk; queen mary university of london, london, uk david beavan orcid.org/0000-0002-0347-6659 the alan turing institute, london, uk kaspar beelen orcid.org/0000-0001-7331-1174 the alan turing institute, london, uk; queen mary university of london, london, uk kasra hosseini orcid.org/0000-0003-4396-6019 the alan turing institute, london, uk jon lawrence orcid.org/0000-0001-6561-6381 the university of exeter, exeter, uk katherine mcdonough orcid.org/0000-0001-7506-1025 the alan turing institute, london, uk; queen mary university of london, london, uk federico nanni orcid.org/0000-0003-2484-4331 the alan turing institute, london, uk daniel van strien orcid.org/0000-0003-1684-6556 the british library, london, uk daniel c. s. wilson orcid.org/0000-0001-6886-775x the alan turing institute, london, uk; queen mary university of london, london, uk https://doi.org/10.5334/johd.56 https://www.britishnewspaperarchive.co.uk/ https://www.britishnewspaperarchive.co.uk/ https://orcid.org/0000-0001-8455-7196 https://orcid.org/0000-0001-8455-7196 https://orcid.org/0000-0002-0347-6659 https://orcid.org/0000-0002-0347-6659 https://orcid.org/0000-0001-7331-1174 https://orcid.org/0000-0001-7331-1174 https://orcid.org/0000-0003-4396-6019 https://orcid.org/0000-0003-4396-6019 https://orcid.org/0000-0001-6561-6381 https://orcid.org/0000-0001-6561-6381 https://orcid.org/0000-0001-7506-1025 https://orcid.org/0000-0001-7506-1025 https://orcid.org/0000-0003-2484-4331 https://orcid.org/0000-0003-2484-4331 https://orcid.org/0000-0003-1684-6556 https://orcid.org/0000-0003-1684-6556 https://orcid.org/0000-0001-6886-775x https://orcid.org/0000-0001-6886-775x 7coll ardanuy et al. journal of open humanities data doi: 10.5334/johd.56 to cite this article: coll ardanuy, m., beavan, d., beelen, k., hosseini, k., lawrence, j., mcdonough, k., nanni, f., van strien, d., & wilson, d. c. s. (2022). a dataset for toponym resolution in nineteenthcentury english newspapers. journal of open humanities data, 8(1), 3, pp. 1–7. doi: https://doi.org/10.5334/johd.56 published: 24 january 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. references beelen, k., lawrence, j., wilson, d. c., & beavan, d. (under submission). victorian perspectives on digital newspapers: addressing bias and representativeness in digital heritage collections. coll ardanuy, m., hosseini, k., mcdonough, k., krause, a., van strien, d., & nanni, f. (2020). a deep learning approach to geographical candidate selection through toponym matching. in proceedings of the 28th international conference on advances in geographic information systems (sigspatial): poster papers (pp. 385–388). doi: https://doi.org/10.1145/3397536.3422236 coll ardanuy, m., mcdonough, k., krause, a., wilson, d. c., hosseini, k., & van strien, d. (2019). resolving places, past and present: toponym resolution in historical british newspapers using multiple resources. in proceedings of the 13th workshop on geographic information retrieval (pp. 1–6). doi: https://doi.org/10.1145/3371140.3371143 ehrmann, m., romanello, m., flückiger, a., & clematide, s. (2020). extended overview of clef hipe 2020: named entity processing on historical newspapers. in ceur workshop proceedings. doi: https://doi. org/10.1007/978-3-030-58219-7_21 gritta, m., pilehvar, m. t., limsopatham, n., & collier, n. (2018). what’s missing in geographical parsing? language resources and evaluation, 52(2), 603–623. doi: https://doi.org/10.1007/s10579-017-9385-8 hamdi, a., linhares pontes, e., boros, e., nguyen, t. t. h., hackl, g., moreno, j. g., & doucet, a. (2021). a multilingual dataset for named entity recognition, entity linking and stance detection in historical newspapers. in proceedings of the 44th international acm sigir conference on research and development in information retrieval (pp. 2328–2334). doi: https://doi. org/10.1145/3404835.3463255 hobbs, a. (2018). a fleet street in every town: the provincial press in england, 1855–1900. cambridge: open book publishers. doi: https://doi.org/10.11647/obp.0152 hosseini, k., nanni, f., & coll ardanuy, m. (2020). deezymatch: a flexible deep learning approach to fuzzy string matching. in proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations (pp. 62–69). doi: https://doi.org/10.18653/v1/2020.emnlpdemos.9 klie, j.-c., bugert, m., boullosa, b., de castilho, r. e., & gurevych, i. (2018,-). the inception platform: machine-assisted and knowledge-oriented interactive annotation. in proceedings of the 27th international conference on computational linguistics: system demonstrations (pp. 5–9). association for computational linguistics. retrieved from https://aclanthology.org/c18-2002/ lieberman, m. d., samet, h., & sankaranarayanan, j. (2010). geotagging with local lexicons to build indexes for textually-specified spatial data. in 2010 ieee 26th international conference on data engineering (icde 2010) (pp. 201–212). doi: https://doi.org/10.1109/icde.2010.5447903 wang, j., & hu, y. (2019). are we there yet? evaluating state-of-the-art neural network based geoparsers using eupeg as a benchmarking platform. in proceedings of the 3rd acm sigspatial international workshop on geospatial humanities (pp. 1–6). doi: https://doi.org/10.1145/3356991.3365470 https://doi.org/10.5334/johd.56 https://doi.org/10.5334/johd.56 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1145/3397536.3422236 https://doi.org/10.1145/3371140.3371143 https://doi.org/10.1007/978-3-030-58219-7_21 https://doi.org/10.1007/978-3-030-58219-7_21 https://doi.org/10.1007/s10579-017-9385-8 https://doi.org/10.1145/3404835.3463255 https://doi.org/10.1145/3404835.3463255 https://doi.org/10.11647/obp.0152 https://doi.org/10.18653/v1/2020.emnlp-demos.9 https://doi.org/10.18653/v1/2020.emnlp-demos.9 https://aclanthology.org/c18-2002/ https://doi.org/10.1109/icde.2010.5447903 https://doi.org/10.1145/3356991.3365470 a social network of the prosopography of the neo-assyrian empire data paper corresponding author: heidi jauhiainen department of digital humanities, university of helsinki, helsinki, finland heidi.jauhiainen@helsinki.fi keywords: social network analysis; prosopography; social structure; assyria; assyriology; cuneiform to cite this article: jauhiainen, h., & alstola, t. (2022). a social network of the prosopography of the neo-assyrian empire. journal of open humanities data, 8: 8, pp. 1–8. doi: https://doi. org/10.5334/johd.74 a social network of the prosopography of the neo-assyrian empire heidi jauhiainen tero alstola abstract the dataset is a social network of over 17,000 individuals who lived during the so-called neo-assyrian period of mesopotamian history, primarily in the eighth and seventh centuries bce. the undirected network of individuals connected by co-occurrences in cuneiform documents was semi-automatically extracted from the prosopography of the neo-assyrian empire. in addition to two weighted versions of the one-mode co-occurrence network, the dataset also contains a two-mode person-text network and rich metadata for each individual. for the first time, the dataset allows largescale computational analysis of social structures in the assyrian empire. the data is primarily stored as plain text and csv files, inviting scholars to further expand and enrich it. the scripts and files used for creating and standardizing the data are also available in the zenodo repository. *author affiliations can be found in the back matter of this article mailto:heidi.jauhiainen@helsinki.fi https://doi.org/10.5334/johd.74 https://doi.org/10.5334/johd.74 https://orcid.org/0000-0002-8227-5627 https://orcid.org/0000-0003-4382-1733 2jauhiainen and alstola journal of open humanities data doi: 10.5334/johd.74 (1) overview repository location https://doi.org/10.5281/zenodo.5862904 context this dataset is a social network of over 17,000 individuals attested in cuneiform documents from the neo-assyrian period, primarily in the eighth and seventh centuries bce. the data originates from the prosopography of the neo-assyrian empire (pna), which has been available only as a printed edition (radner & baker, 1998–2011). heather d. baker’s prosopography of the neo-assyrian empire online project1 has the long-term goal of making the pna data digitally available, but it currently provides only additions and corrections to the printed pna volumes. we extracted the data from the text and pdf file versions of the pna that we received from simo parpola, the editor-in-chief of the series. the dataset published here was produced for the research purposes of the centre of excellence in ancient near eastern empires (university of helsinki). (2) method steps the earlier pna volumes (1/i–3/i) were available to us as plain text files that were used to typeset the printed publications. as the last volume (3/ii) was laid out using different software, it was available to us only as a pdf file. we wrote a number of scripts in java to extract and process the data. their source codes are available in the repository. entries in the pna consist of two parts (see figure 1): 1) an entry starts with a personal name and its linguistic analysis; and 2) then follows a list of the individuals who used this name. in some cases, there is only a single person who used a name, whereas some names were borne by dozens of people. it is often not entirely clear if two or more cuneiform texts refer to the same person or several homonymous individuals. in the pna, the decision to connect an attestation of a name to a historical individual is made by a trained assyriologist. we followed these identifications but could not take their level of uncertainty into account, although this is sometimes expressed in the pna. for each individual, the pna provides a short description (e.g., “individual from assur (reign of sennacherib)” or “tammaritu ii, king of elam c. 652–649 (reign of assurbanipal)”), followed by the attestations of this person in texts and a short description of the person’s role in each text. the dataset was produced in four steps. in steps one and two, we worked with the text files, because their structure was ideal for automated processing. first, we extracted the names of the neo-assyrian cuneiform documents in which persons are attested. the document names were indicated by “@@” at the beginning of a line in the text files (see figure 1). since the pna was published over a period of thirteen years and the entries were written by numerous scholars, there are inconsistencies in how the documents are referred to. several typing errors were also detected. furthermore, only some of the document names indicate where the name of the document ends and a possible line number starts. using rules and reference lists, we compiled a list that connects standardized document names to the original “@@” lines in the text files. in step two, individual persons were extracted from the text files with the name of each document they were attested in. it was not possible to collect information about the individual’s role in a document, because this is given in unstructured running text. homonymous individuals were distinguished by consecutive numbers added after their names (e.g., aššūr-iddin_1), following the numbering of individuals in the pna. the general description and dating of each individual as well as the language and gender of the name were also extracted. it can be very difficult to assign a language and gender to an ancient name (pna 1/i, p. xxii; földi, 2019), and some analyses given in the pna could be contested. as with all other data, we simply extracted the information as it is in the pna. 1 http://oracc.museum.upenn.edu/pnao/. https://doi.org/10.5334/johd.74 https://doi.org/10.5281/zenodo.5862904 http://oracc.museum.upenn.edu/pnao/ 3jauhiainen and alstola journal of open humanities data doi: 10.5334/johd.74 in the third step, we extracted information from the pdf version of the last pna volume. as preprocessing, we opened the file in a safari browser and copied the text into microsoft word, an operation that kept the formatting of the font style. entries for personal names and individuals could be identified and were marked before turning the file into plain text. copying the text from two columns of the pdf file resulted in many inconsistencies, and the new text file was manually curated before information about documents and individuals was extracted. entries for document names could not be identified in the structure of the pdf file, and the identification and extraction of documents is thus based on concordance lists and document names attested in other pna volumes. finally, we created a two-mode network of persons and texts and a one-mode network of all individuals. both networks are undirected. in the two-mode network, persons were connected to the texts in which they are attested. the one-mode network connects two persons if they are attested in the same document (figures 2 and 3). isolates (persons without any connections) were not included in the network. there are two versions of the one-mode network: one uses the number of co-occurrences as edge weights, and another one calculates the edge weights in relation to the total number of persons appearing in the same document (see below). as the pna entries for assyrian kings do not include all their attestations in the extant texts, we collected their attestations from the state archives of assyria book series and the corresponding online editions from the open richly annotated cuneiform corpus (oracc). we also supplied some metadata for individuals in the one-mode network. in addition to the unique id number, each *a@@u_r-re_mu_ti_ ("a@@ur is my mercy"); akk.; masc.; wr. [1]=a@-@ur--rem2-ut-tu, 1=a@-@ur--rem2-u-te; ahw 971. individual from assur (reign of esarhaddon and early reign of assurbanipal): 1=--rem2-u-t[e] acts as a witness @@as8617b s002 (675); [1]=a@-@ur---tu is mentioned in a broken legal document of unclear function @@saab 5 40 002 (date lost); the name of a witness acting for abba^, 1=a@-@ur--re, is incomplete due either to a mistake of the writer of the tablet or the modern copyist (the tablet is unaccessible). the name is followed by the curious phrase @a an`e=kur.ra-me` ta-da-na-ma. @@rfdn 17 05 s001 (666). even though the restoration as a@@u_r-re_mu_ti_ seems to be likely because of the space available and chronological reasons, there are other possibilities: a@@u_r-re_htu-u$ur (but the name seems to be too long), a@@u_r-re_manni (in a rarely used spelling variant) or a@@u_r-re_$u_wa (as suggested by ahmad [1996] 223. however, the only attestation of the latter in assur is as the recipient of the letter @@as13846q). <編集委員会 [postwar reparation handbook edition committee], 1992; 戦後補償問題研究会 [committee of postwar reparation study], 1990). individual activist groups focusing on specific issues also meticulously record their courses of actions, including litigations. for example, the support group for the lawsuit by victims of the biological weapons trials from southern china, mainly managed by lawyer ichinose keiichirō who spearheaded the litigations, details the course of the lawsuit on their website, which is a typical practice among reparation activists (731部隊細菌戦国家賠償請求訴訟 [the lawsuit for compensation for unit 731 germ warfare], n.d.). these acts of recording the movement focus more on the production of testimonies by the victims.8 for example, support group websites always privilege testimonies by the victims over other legal documents, even in the lawsuits sections. in contrast, the “overview of postwar reparation lawsuits” (nihon sengo hoshō saiban sōran) compiled by lawyer yamamoto seita, a seasoned lawyer who fought multiple lawsuits in the movement, takes the form of a chart that provides the barebones information of all the lawsuits in the movement (yamamoto, n.d.). the information provided in the “overview” for each lawsuit includes the title, summary, dates of filings and rulings, plaintiffs, defendants, plaintiffs’ representatives (lawyers), and rulings. while expansive, the “overview,” which includes 101 lawsuits, is still not complete or exhaustive. for example, many entries are missing ruling or filing dates, and the lists of lawyers for the lawsuits are often incomplete. despite these limitations, this project uses the “overview” as the foundation for the graph database9 since it is the most comprehensive compilation of lawsuits in the movement and is based on both personal experiences and connections of movement participants and existing publications documenting the movement. the project then consults the websites of each lawsuit or cause, mass and social media reportage on the lawsuits, and personal interviews to supplement missing or inaccurate information. however, such supplementations are also 7 the data used in this paper can be accessed on github at https://github.com/azurebamboo/graphdatabase-for-reparation-lawsuits-against-japan-for-colonial-and-wartime-atrocities-. it is also deposited in the harvard dataverse at https://doi.org/10.7910/dvn/cz4pbo. 8 again, i use the term “victim” not to minimize the agency of these people in participating in the reparation movement (the prime goal of which is precisely to reclaim their agency vis-à-vis the japanese state), but to refer to their role in the legal battles (higaisha by japanese lawyers and activists) as the actors who produce testimonies that carry both legal and historical authority. 9 a graph database is a type of database that uses graph structures for semantic queries. compared to a relational database (in network analysis, a simple example would be a set of node and edge tables), a graph database represents relationships (or edges) between nodes explicitly (instead of implicitly). see section 3 for details. https://github.com/azurebamboo/graph-database-for-reparation-lawsuits-against-japan-for-colonial-and-wartime-atrocitieshttps://github.com/azurebamboo/graph-database-for-reparation-lawsuits-against-japan-for-colonial-and-wartime-atrocitieshttps://doi.org/10.7910/dvn/cz4pbo 5pan journal of open humanities data doi: 10.5334/johd.76 often incomplete, which is why the project is an ongoing one that calls for collaboration. one example is the lawsuit launched by the bereaved families of liu lianren, who was a chinese victim of forced labor by japanese military and mining corporations. his family sued the japanese government in 1996. because lawsuits in the movement usually drag on for years and are fought all the way to the supreme court, lawyers attached to the suits are often subject to change over the years, and the “lawyers” section of the “overview” often comes with the entry “others.” liu’s case continued until 2005, when it was dismissed by the supreme court of japan. according to the “overview,” apart from lawyer takahashi toru, there have been 120 “other” lawyers on the case over the years (yamamoto, n.d., case 43). accurate and exhaustive information can sometimes be found in the official petitions or rulings for lawsuits, which support groups often publish on their websites. such documents can be found on the website for liu’s case, but even these official public documents do not contain the exhaustive list of the 120 lawyers, which means that only personal interviews would enable us to complete the list. however, the names of two of the other lawyers, onodera toshitaka and oyama hiroshi, who also worked on numerous other cases and are thus key nodes in the graph database, can be recovered from the documents (すおぺいネット[suopei.net], n.d.). as a result, currently in the database, the node of liu’s lawsuit is connected to these three lawyers and the stand-in node of “others.” as a result, while the database can represent all the lawsuits in the movement, data for the lawyers, activists, and plaintiffs for some lawsuits are still missing. for this ongoing project, the next steps would thus be more personal interviews with participants of the movement and archival research of both court and other government materials and personal documents from the participants to further enrich the data. (2.2) privacy, personal safety, and other ethical concerns lawyer yamamoto seita has a page called “anonymous treatment of court records and other materials” on his website that houses an “overview” (訴訟記録などの資料の匿名処理について [about the anonymity in our records], n.d.). “in light of the nature of the medium of the internet and the recent degradation of japanese society,” yamamoto has decided to only include the family name of the plaintiffs in the lawsuits unless they have already willingly publicized their identities when giving testimonies. although technically, the ruling and filing papers that carry the names of the lawyers are in the public domain (accessible from the websites of japanese court systems if one is willing to look), i share the same concern with lawyer yamamoto that including the full names of lawyers and activists in this graph database may jeopardize their privacy and even their safety given the widespread use of harassment and threats, both online and offline by ultra-right wing groups and individuals, against the plaintiffs, activists, lawyers, and scholars who participate in the movement. i thus consulted with lawyer ichinose keiichirō about the potential legal and political ramifications of creating such a database. lawyer ichinose (half-jokingly) concurred with yamamoto’s assessment that japanese society has become “degraded” but acknowledged the “concern about deliberate attacks on this type of information.”10 ichinose then advised me to follow yamamoto’s convention and include the full names of only the lawyers, and not publicize the full names of other movement participants unless explicit consent was acquired. as a result, for now, the database only contains the full names of the lawyers. to further expand the database, this project plans to formally reach out to activists and groups that have participated in the movement to obtain consent (or explicit declination) for the inclusion of their information in the database. (3) result and discussion: the application of graph database management systems to network analysis research and teaching in the field of digital humanities, network analysis and visualization (pan, 2020) are often achieved with open-source software designed for and geared towards academic research. the most beginner-friendly tool is palladio, a product of an neh implementation grant and that 10 again, i use the term “victim” not to minimize the agency of these people in participating in the reparation movement (the prime goal of which is precisely to reclaim their agency vis-à-vis the japanese state), but to refer to their role in the legal battles (higaisha by japanese lawyers and activists) as the actors who produce testimonies that carry both legal and historical authority. 6pan journal of open humanities data doi: 10.5334/johd.76 can visualize simple graphs with edge table inputs (palladio app, n.d.). gephi, one of the most frequently used instruments in the field, comes with a more powerful toolkit for visualization options and statistical analysis (gephi.org, n.d.). the rapidly growing field of social network analysis, especially for research done on social media, has spurred the development of a wide range of tools, such as netlytic and socioviz, that are designed for such purposes. for advanced users, packages and integrations in the statistical computing environment r and programing language python, such as igraph, can enable more complex analysis, although knowledge about these programming and query languages is usually required. while a plethora of tools for visualization and analysis for network research exists, most studies in the field use simple edge lists, a type of relational database structure, as their basic data structure. few studies in digital humanities have focused on graph database creation and management as a foundation for continuing research and teaching based on network analysis and visualization. compared to relational databases, which implicitly represent the relationships (or edges) between nodes (or entities) in a network, graph databases treat relationships as first class citizens that can have labels and properties, and thus represents them directly as objects. for example, when combined together, the “node table” and the “edge table” that the software gephi usually requires constitute a relational database. programs like gephi can use the “id” column (“primary key” in database terminology) to generate the edges (or relationships) when relevant values are referenced in the “edge table” in the “source” and “target” columns. in contrast, a graph database eliminates this extra layer of implicit referencing (“indexing” in traditional relational databases, which becomes resource expensive when data volumes and queries increase) (lazarevic, 2021). figure 1 illustrates this difference. this has significant implications for network analysis in digital humanities research and teaching. most directly, this property graph model in a graph database expands the dimension of information the edges can convey in a network. in relational databases, edges usually only convey directions and at most labels (categories), but they can carry easily expandable and modifiable properties in graph databases. this means that for long term projects such as this one (which, because of the current incompleteness of the source data, calls for continuing addition of data), graph database allows for more possibilities in terms of efficient and versatile querying and expansion. for example, while one would have to change several rows or even key tables in a relational database if new nodes need to be added into existing relationships or relationships need to be modified, one could simply annotate and change single entities in a graph database to achieve the same effect. since the data collection of this project is still ongoing, this project adopts a graph database model and uses a graph database management system to house the data. figure 1 an example of a relational database versus a graph database structure. on the top, the node table on the left and the edge table on the right represent that the three lawyers are connected to each other when combined together in a relational database. in contrast, the graph database structure on the bottom directly represents the nodes and edges (with properties). 7pan journal of open humanities data doi: 10.5334/johd.76 another reason why this project uses a graph database structure is the property graph’s strengths of representing edges (or relationships). in a property graph, edges (or relationships) in the database can have not only directions, but also properties and labels. for example, while representing a lawsuit that was filed and received rulings from a number of courts at different times and with different results (which are properties of the edges or relationships) can be complicated and even unattainable in a relational database, one can easily add simple “filed_to” and “ruled” edges (which are first class citizens in the graph database) with expandable properties like “date” and “result” in a graph database. the creation of a graph database can be realized in open-source versions of software geared towards commercial use, such as graphql and neo4j. this project utilizes neo4j and its query language cypher to build and manage the database. neo4j is a graph database management system developed by neo4j, inc., and this project utilizes its gpl3-licensed open-source “community edition,” which is free to the public (neo4j inc., n.d.). neo4j is also available in an “enterprise edition” under a closed-source commercial license and is used mainly commercially in fields like fraud detection, supply chain management, and identity and access management. due to its versatile functions and a myriad of add-ons, neo4j is very suitable for network analysis research in digital humanities. in fact, this project is not the first digital humanities initiative to utilize neo4j. most related to the subject of this project, the graphing pow camp mistreatment in wartime japan portion of the war crimes documentation initiative (wcdi) at the university of hawai’i at mānoa also uses neo4j and the query language cypher to “illuminate the connections and relationships among seemingly discrete war crimes trials that the allied authorities held in asia and the pacific following the end of world war ii” (war crimes documentation initiative, n.d.). in addition, historian javier cha also uses neo4j to study the network of yangban aristocracy in medieval korea (cha, 2019, 2021/2022). all such previous scholarship chose the graph database structure and the neo4j platform partly because of their strengths in data expandability and relationship representation, the same reasons why this project chose these methods, which will be discussed later in the paper. in neo4j and other graph database management systems, the database is organized according to a schema, namely, what the nodes are and how they are connected (by what relationships with what properties). for example, the wcdi graph database uses a pole (persons, objects, locations, events) principle to guide the development of its schema, in which “person” nodes, which are affiliated with “location,” are victims or perpetrators of “crime” who are charged in “trial,” among other relationships (war crimes documentation initiative, 2021). the graph database for this project uses a primary schema that can be modified when the data are processed in other tools such as graphxr (kineviz inc., n.d., discussed later).11 the primary schema consists of the node classes “lawsuit,” “court,” “lawyer,” and “defendant.” in the primary schema, plaintiffs are recorded as properties of the “lawsuit,” and geographical coordinates are properties of the “court,” which can be mapped using visualization tools like graphxr. the schema is the foundation of a graph database, and thus its modeling entails significant interpretation of the data. however, since the schema can be modified through extractions of properties from both nodes and relationships in graphxr without affecting the database itself, the structure of the primary schema here has more flexibility.12 in terms of the relationships, “lawsuit” is fought “against” “defendant,” and is “filed_to” the “court,” which “ruled” on the case; “lawsuit” is also “fought” by “lawyer,” who “collab” with each other. figure 2 illustrates this schema along with some other metrics of the database (such as number of nodes and relationships). 11 for example, in some analyses with tools like graphxr, “lawsuit” nodes are eliminated and deprecated as a property of the “collab” relationship between “lawyer” to facilitate the execution of a community detection algorithm on “lawyer.” 12 for example, one might question why “lawsuit,” a type of event, is a node here instead of set of properties or relationships. in fact, in graphxr, one can easily produce both of such forms in different network manipulations. this can be achieved using the transform functions in graphxr (how to graphxr, n.d.). 8pan journal of open humanities data doi: 10.5334/johd.76 creating the graph database in neo4j comes with several strengths. first, one can easily query the database with the cypher language for different combinations of information needed from the database. for example, if one wants a table for the title and the first filing date of all the lawsuits originating from mainland china, one can simply write the query shown in figure 3 (namely, matching nodes with the value “chinese” as their property “origin” and returning the info needed) and export a table (as csv or in another format) from either the neo4j sandbox (a free browser version of neo4j), neo4j desktop, or other free neo4j software. second, the graph database is freely modifiable and expandable. as a relatively small database (currently with 519 nodes and 4270 relationships) whose sources are mainly gathered as sporadic leads instead of automatically generated data, the expansion of the database will most probably be done through human input and will contain possible human errors. with neo4j, one can easily add new nodes, relationships, and properties (including those connecting to existing nodes), modify them, and examine possible errors with cypher query codes. thirdly, the neo4j browser (the querying interface in the neo4j software) also comes with simple data visualization tools; if one wants to quickly visualize and examine parts of the database, one can do so with a simple line of query as shown in figure 4. with these features, we can tabulate some obvious statistics for the movement. for example, we can easily export all 259 filings and rulings in the movement by extracting such information from the properties of the “lawsuit” nodes and “filed_to” and “ruled” relationships, and we can see that the majority (200) of the filings were dismissed in different forms, and only 21 filings (only 2 of which were at supreme court) were completely upheld.13 figure 5 is a summary of such results. this merely confirms the general wisdom that the litigation activism in the reparation movement served more to give the plaintiffs a platform to articulate their own 13 there are a variety of terms for dismissing a filing in japanese courts, including 棄却 (kikyaku), 却下 (kyakka), and 取下 (torisage). they have nuanced (but important, depending on the context) legal differences, but for the plaintiffs, they were all essentially “dismissals” of their cases by the japanese courts. figure 2 schema and data summary of the graph database. figure 3 query result in the neo4j browser of lawsuits by chinese plaintiffs and their filing dates at local courts. 9pan journal of open humanities data doi: 10.5334/johd.76 experiences as legal (as well as historical and personal) testimonies, rather than to substantially achieve financial remittances. however, putting information about the movement in a graph database makes exporting, tabulating, and visualizing such general observations (through simple querying) scalable and modifiable, and thus easy to achieve. in the following sections, this paper will examine other capabilities of the graph database on top of these functions. (4) applications: visualization and teaching potentials: a geo-spatial analysis of the progression of the reparation movement a large and significant advantage of network analysis in digital humanities teaching and research is the techniques to visualize and intuitively present complex relationships among entities that traditional text descriptions simply cannot convey. putting the data of the reparation movement figure 4 simple visualization of one lawsuit and connected nodes in neo4j browser. figure 5 excerpt and pie chart from the lawsuit filings tabulation exported from the database. 10pan journal of open humanities data doi: 10.5334/johd.76 into a graph database greatly strengthens this forte of network analysis. the management system of neo4j is equipped with a wide variety of graph apps for such purposes, and one of them, graphxr, is particularly suited for presenting our data. in itself, graphxr is a browserbased visual analytics platform compatible with a variety of database systems including neo4j, for which its mother company kineviz built the native graph app (kineviz inc., n.d.). like neo4j, it also has a free open-source “explorer” version that allows users to store up to three projects with less than 1000 nodes, which is perfect for our database at this stage. another advantage of graphxr over other visual analytics tools is that it can visualize networks in both 2d and 3d space and is compatible with vr devices, which opens up a wide variety of teaching and research possibilities. in this section, i will discuss two examples of visual presentation with our graph database in graphxr, before going, in the next section, into the visual analytics that use the centrality and community detection algorithms in graphxr and neo4j. as mentioned, data can be loaded into graphxr in different ways. when connected to a neo4j database, nodes and relationships can be pulled onto the scene using either the search function or with cypher or other kinds of query codes.14 hence, we can choose to load only part of the database. the first visualization example consists of a time-lapse playback of the lawsuits filed over time and the filing and ruling courts. to achieve this, we can use the cypher code to only load the “lawsuit” nodes and “court” nodes. using the “layout,” “filter,” and “map” functions of graphxr, we can create a video-like time-lapse view of the lawsuits by their filing and ruling dates, in which we arrange the “lawsuit” nodes into a circle for visibility, filter by the ruling and filing dates properties recorded in the database, and pin the “court” nodes onto the map using their latitude and longitude properties recorded in the database. in the actual graphxr interface, one can zoom into the view and double-click on specific nodes to examine detailed information. figure 6 is a snapshot of this view, with the blue edges being the “filed_to” relationships and the red ones the “ruled” relationships. one can use the link in the caption to access the time-lapse view in a shareable graphxr instance. 14 for a tutorial to graphxr, see: (how to graphxr, n.d.). figure 6 end-point snapshot of time-lapse of lawsuits filed in japanese courts from 1970s to 2010s: a time-lapse view can be accessed at https://graphxr. kineviz.com/ share/610cd4dd 7218aa003c1fcfb0/ lawsuit/6114a2507218aa00 3c268e32 (with the “filter” function on the left of the graphxr interface). https://graphxr.kineviz.com/share/610cd4dd7218aa003c1fcfb0/lawsuit/6114a2507218aa003c268e32 https://graphxr.kineviz.com/share/610cd4dd7218aa003c1fcfb0/lawsuit/6114a2507218aa003c268e32 https://graphxr.kineviz.com/share/610cd4dd7218aa003c1fcfb0/lawsuit/6114a2507218aa003c268e32 https://graphxr.kineviz.com/share/610cd4dd7218aa003c1fcfb0/lawsuit/6114a2507218aa003c268e32 https://graphxr.kineviz.com/share/610cd4dd7218aa003c1fcfb0/lawsuit/6114a2507218aa003c268e32 11pan journal of open humanities data doi: 10.5334/johd.76 to take this visualization a step further, one can also create new nodes representing the locale of origin of the plaintiffs in the lawsuits and overlay them onto this view (figure 7). this can be achieved by extracting the “origin” property of the “lawsuit” nodes into another category of nodes using the “transform” function in graphxr and adding geographic coordinate properties to them. to enhance visibility, one can also tilt the map and arrange the nodes in 3d space and delete or hide the “court” nodes. the time-lapses intuitively demonstrate that although taiwanese and zainichi (people of korean origin residing in japan) plaintiffs launched the first several lawsuits in the 1970s (which were mainly about their unequal treatment by the japanese government compared to their japanese counterparts who served the empire in similar roles), lawsuits from south korea and mainland china quickly became the majority beginning in the late 1980s. most of these suits originated in the tokyo local court, especially those against the japanese state (kuni) as the defendant, indicating tokyo as the central stage for the national and international political aspects of litigation activism in the movement. a significant minority of the lawsuits were filed in locales where the atrocity on trial occurred (such as sites of forced labor). these were often supported by local networks of lawyers and activists who were not as connected with those based in tokyo, and this will be discussed in later sections. although queries in neo4j can also reveal these findings, being able to directly visualize and observe them definitely carries great value in itself, especially for teaching and presentation. conclusion using the reparation movement for japanese colonial and wartime atrocities as a case study, this paper has demonstrated the applicability of graph database management systems in the research and teaching of network analysis and other cultural analytics fields. compared to using a relational database structure, which is currently utilized by most network analysis research, graph databases using the property graph model, such as those used by neo4j, can accommodate more layers of expandable network data and more efficiently and versatilely query them for analysis by treating not only nodes but also edges in the database as first class citizens with easily modifiable properties. storing data about the reparation movement thus figure 7 end-point snapshot of time-lapse of lawsuits by origin of the plaintiffs visualized in 2d and 3d space: a time-lapse view can be accessed at https://graphxr. kineviz.com/share/610cd4 dd7218aa003c1fcfb0/ lawsuit/615def8e782c 180052a81724. https://graphxr.kineviz.com/share/610cd4dd7218aa003c1fcfb0/lawsuit/615def8e782c180052a81724 https://graphxr.kineviz.com/share/610cd4dd7218aa003c1fcfb0/lawsuit/615def8e782c180052a81724 https://graphxr.kineviz.com/share/610cd4dd7218aa003c1fcfb0/lawsuit/615def8e782c180052a81724 https://graphxr.kineviz.com/share/610cd4dd7218aa003c1fcfb0/lawsuit/615def8e782c180052a81724 https://graphxr.kineviz.com/share/610cd4dd7218aa003c1fcfb0/lawsuit/615def8e782c180052a81724 12pan journal of open humanities data doi: 10.5334/johd.76 enables a variety of approaches to visualize and analyze the movement. for example, in addition to being able to export all the iterations of information in relational data forms (as different kinds of edge and node lists, for example), the graph database of the reparation movement can also achieve time-series and geospatial visualizations and analyses of nodes and edges. this paper uses the tool graphxr to create time-lapse views of where the lawsuits came from, were filed, and were ruled on throughout the last four decades. from such visualizations, one can intuitively observe how the reparation movement has grown dramatically since the late 1980s and how countries like china and south korea have come to the forefront of the movement. this technique with graphxr can also be used for any other graph and relational database with time stamp properties on nodes or edges, greatly broadening the horizon for pedagogy and research with network data.15 as mentioned above, the data size of this project is still comparatively small and incomplete at the moment, so the next steps would be a) continuing to amass source data and make addition and modification to the current database, and b) conducting statistical analysis, such as centrality and closeness tests with nodes, with data in the database, to better understand networks in the reparation movement. additional file the additional file for this article can be found as follows: • pan, keyao. (2022). “networking for historical justice: the application of graph database management systems to network analysis projects and the case study of the reparation movement for japanese colonial and wartime atrocities”, https://doi.org/10.7910/dvn/ cz4pbo, harvard dataverse, v1, unf:6:p8zad6owocdzt0iiu8lr5w== [fileunf]. acknowledgements i express gratitude to lawyer ichinose keiichirō for providing legal advice regarding the privacy and ethics of the data collected for this paper. competing interests the author has no competing interests to declare. author affiliations keyao pan orcid.org/0000-0001-6756-823x department of history, florida international university, miami, usa references 731部隊細菌戦国家賠償請求訴訟. (n.d.). [the lawsuit for compensation for unit 731 germ warfare]. retrieved september 5, 2021, from http://www.anti731saikinsen.net/saiban/1shin/index.html azurebamboo. (2021). azurebamboo/graph-database-for-reparation-lawsuits-againstjapan-for-colonial-and-wartime-atrocities-. https://github.com/azurebamboo/ graph-database-for-reparation-lawsuits-against-japan-for-colonial-and-wartime-atrocities (original work published 2021). cha, j. (2019). to build a centralizing regime: yangban aristocracy and medieval patrimonialism. seoul journal of korean studies, 32(1), 35–80. doi: https://doi.org/10.1353/seo.2019.0003 cha, j. (2022). medieval yangban [jupyter notebook]. https://github.com/javiercha/medieval_yangban (original work published 2021). gephi.org. (n.d.). gephi—the open graph viz platform. retrieved september 8, 2021, from https://gephi. org/ gordon, a. (ed.). (1993). postwar japan as history. university of california press. doi: https://doi. org/10.1525/9780520911444 hein, l. (2003). “war compensation claims against the japanese government and japanese corporations for war crimes.” in torpey, j. (ed.). politics and the past: on repairing historical injustices. rowman & littlefield publishers. pp. 127–147. 15 graphxr has a forum that showcases other such use cases, such as tracking sars-cov-2 genetic drift across migration patterns (law, 2022). https://doi.org/10.7910/dvn/cz4pbo https://doi.org/10.7910/dvn/cz4pbo https://orcid.org/0000-0001-6756-823x https://orcid.org/0000-0001-6756-823x http://www.anti731saikinsen.net/saiban/1shin/index.html https://github.com/azurebamboo/graph-database-for-reparation-lawsuits-against-japan-for-colonial-and-wartime-atrocitieshttps://github.com/azurebamboo/graph-database-for-reparation-lawsuits-against-japan-for-colonial-and-wartime-atrocitieshttps://doi.org/10.1353/seo.2019.0003 https://github.com/javiercha/medieval_yangban http://gephi.org https://gephi.org/ https://gephi.org/ https://doi.org/10.1525/9780520911444 https://doi.org/10.1525/9780520911444 13pan journal of open humanities data doi: 10.5334/johd.76 to cite this article: pan, k. (2022). networking for historical justice: the application of graph database management systems to network analysis projects and the case study of the reparation movement for japanese colonial and wartime atrocities. journal of open humanities data, 8: 11, pp. 1–13. doi: https://doi. org/10.5334/johd.76 published: 25 april 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. how to graphxr. (n.d.). retrieved september 16, 2021, from https://helpcenter.kineviz.com/learningcenter/hc/how-to-graphxr.820445303.html kineviz, inc. (n.d.). kineviz graphxr: visual analytics, graph bi, and more... retrieved september 10, 2021, from https://www.kineviz.com koga, y. (2016). inheritance of loss: china, japan, and the political economy of redemption after empire. the university of chicago press. http://pi.lib.uchicago.edu/1001/cat/bib/11267502. doi: https://doi. org/10.7208/chicago/9780226412276.001.0001 law, a. (2022). “tracking sars-cov-2 genetic drift across migration patterns—gallery.” (2022, january 21). kineviz community. https://community.kineviz.com/t/tracking-sars-cov-2-genetic-drift-acrossmigration-patterns/159 lazarevic, l. (2021). “graph databases vs relational databases – learn how a graph database works.” freecodecamp.org. https://www.freecodecamp.org/news/graph-database-vs-relational-database/ neo4j inc. (n.d.) neo4j graph platform – the leader in graph databases. retrieved september 9, 2021, from https://neo4j.com/ palladio app. (n.d.). about. retrieved september 8, 2021, from http://hdlab.stanford.edu/palladio/about/ pan, k. (2020). “networking for war criminal amnesty: the establishment of japan’s war convicted benefit society,” the asia-pacific journal: japan focus, april 1, 2020. volume 18 | issue 7 | number 1 article id 5383, https://apjjf.org/2020/7/pan.html pan, k. (2021). beyond postwar, beyond nation: “human rights” and the “history problem” in modern japan and asia. phd dissertation, the university of chicago. doi: https://doi.org/10.6082/ uchicago.2975 seraphim, f. (2006). war memory and social politics in japan, 1945–2005. harvard university asia center. doi: https://doi.org/10.1163/9781684174478 war crimes documentation initiative. (2021). wcdi data and visualization. arcgis storymaps. https:// storymaps.arcgis.com/stories/63a5d9da553145479f7a09dec21246fb [last accessed 12 april 2022]. war crimes documentation initiative. (n.d.). wcdi graphing pow camp mistreatment in wartime japan. retrieved september 10, 2021, from https://manoa.hawaii.edu/wcdi/projects/pow-campmistreatment/ wdci. (2021, april 21). wcdi data and visualization illuminating the scale and patterns of japanese war crimes in asia and the pacific. retrieved april 12, 2022, from https://storymaps.arcgis.com/ stories/63a5d9da553145479f7a09dec21246fb yamamoto, s. (n.d.). “日本戦後補償裁判総覧,” [overview of postwar reparation lawsuits] accessed may 3, 2021, http://justice.skr.jp/souran-jp-intro.html すおぺいネット[suopei.net]. (n.d.). retrieved september 6, 2021, from https://www.ne.jp/asahi/suopei/ net/3_saiban/5_renko/saiban_renko.htm 経済企画庁. (1976). 復刻経済白書. 日本経済評論社. [economic planning bureau of japan (1976) economic whitepapers reprint. nihon keizai hyoronsha ltd.] http://id.ndl.go.jp/bib/000011289346 訴訟記録などの資料の匿名処理について. [about the anonymity in our records] (n.d.). retrieved september 7, 2021, from http://justice.skr.jp/anonymous.html 編<ハンドブック戦後補償>編集委員会, ハンドブック戦後補償, シリーズ・問われる戦後補償 ; 別冊 (梨の木舎, 1992), http://id.ndl.go.jp/bib/000002303114. [postwar reparation handbook edition committee, handbook of postwar reparation. (nashinokisha, 1992)]. 編戦後補償問題研究会, 戦後補償問題資料集, 第1集 (戦後補償問題研究会, 1990), http://id.ndl.go.jp/ bib/000002232620. [committee of postwar reparation study, compilation of postwar reparation problems volume 1. (committee of postwar reparation study, 1990)]. https://doi.org/10.5334/johd.76 https://doi.org/10.5334/johd.76 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://helpcenter.kineviz.com/learning-center/hc/how-to-graphxr.820445303.html https://helpcenter.kineviz.com/learning-center/hc/how-to-graphxr.820445303.html https://www.kineviz.com http://pi.lib.uchicago.edu/1001/cat/bib/11267502 https://doi.org/10.7208/chicago/9780226412276.001.0001 https://doi.org/10.7208/chicago/9780226412276.001.0001 https://community.kineviz.com/t/tracking-sars-cov-2-genetic-drift-across-migration-patterns/159 https://community.kineviz.com/t/tracking-sars-cov-2-genetic-drift-across-migration-patterns/159 http://freecodecamp.org https://www.freecodecamp.org/news/graph-database-vs-relational-database/ https://neo4j.com/ http://hdlab.stanford.edu/palladio/about/ https://apjjf.org/2020/7/pan.html https://doi.org/10.6082/uchicago.2975 https://doi.org/10.6082/uchicago.2975 https://doi.org/10.1163/9781684174478 https://storymaps.arcgis.com/stories/63a5d9da553145479f7a09dec21246fb https://storymaps.arcgis.com/stories/63a5d9da553145479f7a09dec21246fb https://manoa.hawaii.edu/wcdi/projects/pow-camp-mistreatment/ https://manoa.hawaii.edu/wcdi/projects/pow-camp-mistreatment/ https://storymaps.arcgis.com/stories/63a5d9da553145479f7a09dec21246fb https://storymaps.arcgis.com/stories/63a5d9da553145479f7a09dec21246fb http://justice.skr.jp/souran-jp-intro.html https://www.ne.jp/asahi/suopei/net/3_saiban/5_renko/saiban_renko.htm https://www.ne.jp/asahi/suopei/net/3_saiban/5_renko/saiban_renko.htm http://id.ndl.go.jp/bib/000011289346 http://justice.skr.jp/anonymous.html http://id.ndl.go.jp/bib/000002303114 http://id.ndl.go.jp/bib/000002232620 http://id.ndl.go.jp/bib/000002232620 from static to interactive maps: drawing diachronic maps of (latin) modality with pygmalion research paper corresponding author: paola marongiu institut des sciences du langage, university of neuchâtel, neuchâtel, switzerland paola.marongiu@unine.ch keywords: visualisation; modality; diachrony; latin language; diachronic map; pygmalion tool to cite this article: marongiu, p., & dell’oro, f. (2022). from static to interactive maps: drawing diachronic maps of (latin) modality with pygmalion. journal of open humanities data, 8: 2, pp. 1–14. doi: https://doi.org/10.5334/johd.58 from static to interactive maps: drawing diachronic maps of (latin) modality with pygmalion paola marongiu francesca dell’oro *author affiliations can be found in the back matter of this article abstract in this paper, we present the diachronic maps of a selection of 75 latin modal markers designed through the tool pygmalion. both the maps and pygmalion were conceived in the framework of the woposs project, which aims at analysing the diachronic pathways of modality in latin. while the description of the tool and its functionalities is beyond the scope of this paper, we focus here on the description of our diachronic modal maps. using visualisations to represent semantic shifts is a well-known practice in some linguistic fields such as typology and lexicography, and they have already been applied to modality. though the situation is rapidly evolving, typological semantic maps as well as lexicographic maps are still for the most part static and usually not-interactive visualisations. our modal maps stand out not only for their interactivity, but also for the richness of the information conveyed: chronology, etymology, semantics, syntax, first attestation and diachronic relationships between the meanings. after presenting our conceptual framework for modality, we illustrate the process of conceptualisation and development of our diachronic maps of modality. more specifically, we explain how we gathered and organised the data in order to transpose it into a visual representation. we then showcase the map of possum as an example of our results. subsequently, we discuss the results with respect to previous literature concerning both visualisation of modal evolution from a general point of view and the investigation of modality in latin. finally, we outline possible applications within and beyond the woposs project. mailto:paola.marongiu@unine.ch https://doi.org/10.5334/johd.58 https://orcid.org/0000-0002-5060-3307 https://orcid.org/0000-0002-8343-356x 2marongiu & dell’oro journal of open humanities data doi: 10.5334/johd.58 (1) context and motivation in this paper we show how modality and some related semantic, syntactic and textual aspects can be visualised along a diachronic axis by using digital, interactive maps. we focus on the case of modal meanings in the latin language and we showcase how we transposed the gathered data from a discursive to a visual form. in this introductory section, after having defined modality, we briefly present the research project from which the latin maps stem. finally, we outline the types and subtypes of modality that the user will find in our diachronic maps. modality is a fascinating, though sometimes elusive, concept. indeed, in the relevant literature, the term is used to refer to different notions. in our work (cf. dell’oro, 2019 drawing on nuyts, 2016) as well as in this paper we intend it as the association of the notional domains of necessity, possibility and volition to any state of affairs (soa): an example is the sentence i must go now, in which must is the modal marker of necessity and i … go now is the modalised soa. the diachronic maps of modality presented here were mainly conceived as a synthesis of lexicographic works on latin modal markers to be variously used in the framework of the woposs1 project (cf. dell’oro, 2020; dell’oro, bermúdez sabel & marongiu, 2020). woposs aims at reconstructing the pathways of modality in the diachrony of the latin language on a corpusbased approach. in order to achieve this goal we are currently setting up and annotating a corpus of latin texts spanning from the 3rd-century bce to the 7th-century ce. in order to obtain visualisations of the continuity or of the changes of the modal meanings along the dimension of time, we set up pygmalion (2020b, cf. also 2020a and 2021), a user-friendly tool to create highly informative, interactive, diachronic visualisations of words, functions or concepts and their relationships (cf. dell’oro, bermúdez sabel & marongiu, forthcoming). though initially conceived to describe latin lemmas and modal concepts, pygmalion works for any language, with any script and can be applied to represent any notion in its diachronic evolution. thanks to this tool we designed diachronic visualisations for each of the 75 latin modal markers— lemmas or constructions—selected to be analysed in the context of the woposs project. these markers were selected as they were considered to be the most representative for the expression of modality in latin. note that we did not limit our choice to verbs, but we also included other parts of speech. though we focus on latin, our methodology can be easily applied to other languages. for the annotation of the woposs corpus we adopt a fine-grained analysis (see dell’oro, 2019) based on 23 modal subtypes2 and on the possibility of multiple annotations of the same modal passage. however, in the case of the maps we decided to simplify the annotation schema, while preserving the main types and subtypes of modality (see figure 1; in the examples, modal markers are in bold). we distinguished two basic types for dynamic modality—necessity or possibility3—, five subtypes for deontic modality—i.e., permission and obligation (both depending on the presence of a source of authority), acceptability (evaluative modality), volition and intention4—and one basic type for epistemic modality. the modal maps are mainly based on the thesaurus linguae latinae (1900–)—from now on thll5—, complemented with other reference works and selected literature (see 2.1 below). the resulting descriptions are adapted and enriched according to the woposs theoretical framework that we have outlined above. the paper is organised as follows. in the next section we outline the steps followed to draw the interactive diachronic maps (see 2 below). then we show the main results that we obtained, taking the map of possum as a reference example and highlighting our contribution to the previous discussion (see 3 below). finally, we present some implications and applications of the maps, from research to didactics (see 4 below). 1 https://woposs.unine.ch/about.php (last accessed: 10 december 2021). 2 the woposs theoretical framework is mainly based on the systematisation by nuyts (2016). 3 for the subtypes of dynamic modality i.e., participant-inherent, participant-imposed and situational see nuyts (2016, pp. 34–35). 4 the inclusion of volition and intention in the domain of deontic modality is a matter of debate, see nuyts (2016, p. 37). 5 the thll is available in open access at https://thesaurus.badw.de/en/tll-digital/tll-open-access.html. https://doi.org/10.5334/johd.58 https://woposs.unine.ch/about.php https://thesaurus.badw.de/en/tll-digital/tll-open-access.html 3marongiu & dell’oro journal of open humanities data doi: 10.5334/johd.58 (2) method we usually drew a single map for each lemma annotated in the woposs project, though sometimes different lemmas (e.g., an adjective and the corresponding adverbs) are outlined in the same map. the choice of drawing one or more maps was determined by the relative complexity of the description, as we did not want to charge a map with an excessive amount of information, though technically this would have been possible. each lemma is described under different aspects: etymology, semantics and syntax (in particular collocations) with respect to their diachrony. our reference model is a map that is able to convey knowledge about the following aspects: 1. a precise chronology (within the limits imposed by data) 2. the etymology 3. the (dis-)continuity of a meaning over time 4. the relations between the meanings 5. the directionality of the relations 6. some examples and relevant textual references figure 1 main types and subtypes of modality used in the diachronic maps of modality. https://doi.org/10.5334/johd.58 4marongiu & dell’oro journal of open humanities data doi: 10.5334/johd.58 7. the presence of constructions, including collocations 8. the presence of groups of meanings 9. the presence of incertitude in defining some of the previous aspects 10. the presence of analytical categories, such as modal types most of such issues (1 to 4, 6, 7 and 10) were already solved in our prototype map of the modal lemma potestas ‘power, possibility’ (bermúdez sabel, dell’oro, & marongiu, 2020), in which modal types were color-coded. in the following we will focus on the process underlying the development of a diachronic modal map. (2.1) gathering and organising the information the first type of information displayed in the maps is the etymology. we mainly relied on de vaan (2008) and ernout and meillet (2001), complementing them with meiser (1998), walde and hofmann (1938–1956), and meyer-lübke (1935). the core of the maps is the list of meanings (or functions), together with their first attestations, positioned with respect to a chronological axis. the description of the semantics of each modal marker is based on the thll, a monolingual monumental dictionary of latin, written in latin. the thll ideally reconstructs all the meanings attested for a lemma, listing the majority of the attestations of each meaning and organising them in chronological order. however, while the thll is the best and most complete dictionary of latin (though the project is still ongoing), its description of modal markers is not compliant with any specific framework of a theory of modality. therefore, building on the information given by the thll, we add our semantic analysis, which is oriented at identifying the modal meanings of each marker in a more finegrained way. when relevant, we also add premodal and postmodal meanings, as defined in typological studies, in particular the one by van der auwera and plungian (1998). sometimes the analysis of the attestations brings to light meanings, or, more often, specific nuances that were not mentioned by the thll. when this is the case, we include them in the description. for instance, one of the meanings illustrated by the thll for debeo is “to be obliged by necessity”. however, a closer analysis of the attestations revealed that this meaning could have three different modal readings depending on the context: dynamic necessity, deontic authority and deontic acceptability. we decided to represent in the map three instances of this meaning, one for each modal reading.6 at times, in particular in the case of collocations, we also needed to add one or more meanings to the ones already suggested by the thll. an example is the collocation necessitatem habeo in the map of necessitas ‘necessity’. the thll specifies all possible constructions (i.e., with the accusative necessitatem standalone or followed by a gerund(ive) in genitive or by an infinitive), but it does not associate it with any specific meaning. building on the semantic analysis of the attestations, we assigned necessitatem habeo two meanings: ‘to show/imply a state of necessity’ and ‘to be under the necessity of’. the former is not modal, while the latter shows a basic value of necessity further specified into dynamic or deontic modality depending on the context.7 the changes or adjustments that we made to the semantic pieces of information given by the thll mainly resulted from a deeper analysis of the modal meanings of the relevant lemma. we rarely added or modified meanings that did not show any possible modal reading, unless it seemed necessary. as an example, we can mention the case of possum ‘to be strong, to be able’. for this verb, the thll presents a specific context of use, in which possum can be referred to words, money and similar. in this context, possum indicates the value of something: the economic value of money, or the semantic value of a word, i.e., its meaning. however, the thll does not provide a specific meaning description for these cases. therefore, we decided to introduce the meaning ‘having some meaning or value’, as we did not want to discard this use of the verb, although not modal.8 6 map of debeo: https://woposs.unine.ch/maps/map-debeo.php. 7 map of necessitas: https://woposs.unine.ch/maps/map-necessitas.php. 8 map of possum: https://woposs.unine.ch/maps/map-possum.php. https://doi.org/10.5334/johd.58 https://woposs.unine.ch/maps/map-debeo.php https://woposs.unine.ch/maps/map-necessitas.php https://woposs.unine.ch/maps/map-possum.php 5marongiu & dell’oro journal of open humanities data doi: 10.5334/johd.58 the thll offers extremely detailed descriptions that are often organised in different sections depending on various criteria, such as the animacy of the participants, the context of use, or specific constructions of the lemma (e.g., with dative, with gerund(ive) in genitive or with the subordinating conjunction ut). as the maps need to be informative and legible at the same time, we partially discarded this type of information. however, we made some exceptions, based on our interest in the study of modality. for the modal marker licet ‘it is allowed’ the thll lists, among the others, the meaning ‘it is necessary’. this is in contrast with the basic value of possibility expressed by the marker, and with the other listed modal meanings (e.g., ‘it is allowed’, ‘it is permitted’, ‘to have the chance/possibility/opportunity of doing/undergoing something’). in fact, the thll adds that this meaning is attested in plautus with an ironic nuance, where it is used as a mockery. in this case, the indication “ironically, hapax in plautus” appears in the map next to the meaning ‘it is necessary’ in order to avoid possible confusion on the part of the user.9 as mentioned above, the maps also inform about syntactic structures or collocations in which the modal marker participates and that convey a specific meaning altogether. for instance, the lemma certus ‘certain’ can be used in collocations such as certum habeo ‘i consider it certain’, certo certius ‘certainly’, certior sum ‘to be informed’.10 therefore, in the map of certus the user will also find these three collocations and their respective meanings. in some cases, the thll organises the materials according to a semantic criterion, which we tried to preserve by gathering the meanings around the suggested groups. an example is the lemma incertus ‘uncertain’, whose description is divided into two sections: active sense, describing someone who is uncertain, and passive sense, relative to something doubtful, of which one should doubt. in the digital map, the two sections are represented as semantic groups, and each meaning is assigned to the relevant one. for instance, the meaning ‘lacking knowledge, unaware’ refers to the semantic group labelled ‘who is in doubt’, whereas ‘not determined, delimited, defined’ is assigned to the semantic group ‘doubtful’. for some lemmas a description of the thll is not available yet. therefore, we decided to rely on the oxford latin dictionary—old (1968). this is the case for nolo ‘to want not’, (ne)queo ‘(not) to be able’, usus ‘use’ (with the modal meaning of ‘need’ arising in collocations such as usus venit and usus est), valeo ‘to have strength, to be able’, volo ‘to want’ and voluntas ‘will’. although the old is undoubtedly less rich in information, especially about semantic groups and contexts, we applied the same procedure, by extracting the list of meanings with their first attestations and providing additional modal analysis. specific constructions indicated by the dictionary were also listed as constructions/collocations, when pertinent. once all relevant pieces of information are gathered, the meanings are organised in space according to a chronological criterion. in this phase we provide an additional layer of analysis that aims at individuating the pathways of the semantic evolution of the lemma throughout time. this is done by detecting the relations between the different meanings of a lemma, based on their semantics and on previous studies. an important reference for this operation is represented by the models illustrated in van der auwera and plungian (1998), in a cross-linguistic perspective, and magni (2005, 2010), for an application to latin. the former elaborated a ground-breaking model of a cross-linguistic modal map, illustrating semantic relations involving premodal, modal and postmodal concepts in both synchrony and diachrony.11 building on this model, magni (2005, 2010) developed a comprehensive map specific to latin. her map is based on the analyses of the modal pathways of debeo, licet, necesse (est) ‘(it is) necessary’, oportet ‘it is right, proper’, and possum (for a further discussion of the interplay between our maps and previous studies, see 3.2 below). (2.2) from the draft to the digital version the digital map is created by means of the tool pygmalion. the information is encoded by filling out two forms. in the first form the user enters the lemma for which the map will be created and the etymological, chronological and semantic information, together with the attestations. 9 map of licet: https://woposs.unine.ch/maps/map-licet.php. 10 map of certus: https://woposs.unine.ch/maps/map-certus.php. 11 the model is based on the single pathways for possibility and necessity elaborated by bybee, perkins and pagliuca (1994). https://doi.org/10.5334/johd.58 https://woposs.unine.ch/maps/map-licet.php https://woposs.unine.ch/maps/map-certus.php 6marongiu & dell’oro journal of open humanities data doi: 10.5334/johd.58 the second form is dedicated to determining the relations between the meanings and their directions. in the first form, the number of etymological steps can be set by the user. for the modal maps we encoded the proto-indo-european root and the proto-italic form that gave rise to the relevant latin lemma. the user can also choose the width of the chronological subdivision to appear in the bar above the map: it can be set by centuries, decades or specific years. our diachronic modal maps are based on centuries. then, the meanings selected as pertinent are encoded one by one. for each meaning of the lemma we indicate: if it is conveyed by a collocation and which one (as for certus); if it is part of a larger semantic group (as for incertus); if it is modal (and what type of modality); the passage in which it is first attested; the century of the first attestation. in the second form, the relations are drawn by selecting the two semantically related meanings and the direction of the link between them. it is also possible to leave the direction of the relation unspecified. we chose this last option for cases of two related meanings that are both first attested in the same century or, more generally speaking, when the directionality of the relation could not be established for some reason. the etymological steps, the modal readings, or the relations between meanings can all be flagged as uncertain in the pygmalion forms. once the two forms have been filled out, the map is automatically saved and can be downloaded as a json file. it can be visualised by uploading it on the pygmalion platform at https://woposs. unine.ch/map.php. (2.3) supplementary material: location and format of the data the supplementary material accompanying this article contains the data (i.e., the maps) and additional instructions on how to use them. the maps are stored in a repository in zenodo.12 the dataset contains: a readme file which provides information about the contents of the dataset and the context in which the dataset was produced, and instructions on how to use the data; a json file for each modal map. the reason for using json format is that the tool employs the javascript library d3 for rendering the information in a visual form. figure 2 shows how information about the meaning ‘ought, should (for logical or similar reasons)’ of the lemma debeo is encoded in a json file, and figure 3 shows how it is rendered visually. the structure of a json file consists of pairs of name and corresponding value. in this case, the meaning 12 doi: 10.5281/zenodo.5735624 for the last version. figure 2 section of the json file for the map of debeo. figure 3 map of debeo, with a focus on the meaning ‘ought, should (for logical or similar reasons)’. https://doi.org/10.5334/johd.58 https://woposs.unine.ch/map.php https://woposs.unine.ch/map.php https://doi.org/10.5281/zenodo.5735624 7marongiu & dell’oro journal of open humanities data doi: 10.5334/johd.58 ‘ought, should (for logical or similar reasons)’ is the value associated with the name ‘definition’. constructions (including collocations) or semantic groups are specified by the two corresponding names, respectively ‘construct’ (an abbreviation for ‘construction’) and ‘group’. as in this case no constructions or semantic groups were associated with the meaning, their value is null, represented here with ‘_’. as more than one type of analysis is possible for a meaning, the information about semantic analysis is encoded in the array of objects ‘analysis’. within this array, the name ‘category’ illustrates the modal (or not modal) reading (‘modal: epistemic’), while the other names are associated with the date of emergence and disappearance of the meaning,13 the first attestation, the certainty of the modal analysis (defined by the boolean values ‘true’ or ‘false’) and the relationships between the concerned meaning and the other meanings in the map. the information about the relationships is stored in three different arrays depending on their type: ‘origins’ for directed relationships (ingoing arrow), ‘destinations’ for directed relationships (outgoing arrow), and ‘unspecified’ for undirected relationships (simple line). each meaning (or semantic analysis, in case there are more modal readings for the same meaning) is assigned a random id (in this case ‘mkny’). the ids are then used as values to create the relations between meanings. (3) results and discussion (3.1) main results: an overview of an interactive map and its functionalities in this section we will give an overview of an interactive map, taking possum as a reference example (figure 4). the etymology of the lemma is displayed in the arrow above the map, which is divided in as many sections as the number of etymological steps set by the user. in the case of possum, the arrow shows first the indo-european root *pót-i-, followed by the proto-italic forms *potiand *pot-ēand their meanings, respectively ‘master, in control of’ and ‘to be master’, ending with the latin paradigm possum, potuı̄ , posse. the chronological bar below, set by centuries, indicates the time span in which the attested meanings of the lemma appear. all the meanings are displayed inside arrows. each arrow begins in correspondence with the century in which the meaning was first attested. as the thll does not usually provide information about the date of (possible) disappearance of a meaning, the arrows continue until the end of the chronological bar. one exception is represented by the already mentioned meaning 13 the date of disappearance is never specified (“disparition”: “none”), as the lexicographic resources we used do not provide this type of information. figure 4 visualisation of the meanings of possum. https://doi.org/10.5334/johd.58 8marongiu & dell’oro journal of open humanities data doi: 10.5334/johd.58 ‘it is necessary’ for licet (see 2.1 above). as the thll specifies that it is only attested in the aulularia by plautus, the arrow starts and ends in the 2nd-century bce. the additional semantic analysis is color-coded: a legend at the beginning of the map explains the correspondences. for what concerns the modal readings, we have found attestations for the expression of dynamic possibility (dark blue), e.g., ‘to be able’; deontic authority (light blue), e.g., ‘it is allowed, fair, appropriate’; dynamic necessity (purple), ‘~ to have to’, where the symbol ‘~’ indicates an approximation of the use of possum in that context to the meaning ‘to have to’; and epistemic modality (green), ‘it is possible’. the meaning ‘it is possible’ with a deontic authority value is enclosed in dotted lines, which indicate that this type of reading is doubtful. indeed, the deontic authority value seems to be attested only in this specific fragmentary passage. since we do not have access to the context of this fragment, we cannot be sure about the modal reading. non-modal meanings (grey) are ‘to have some meaning or value’ and ‘to be very powerful, to prevail’. we could not find any postmodal value for possum. on the other hand, the meaning ‘to be strong, powerful’ was analysed as premodal, leading to dynamic possibility values.14 one innovative feature of the maps is interactivity: the users can set the visualisation according to their needs. the first attestations are not displayed by default, as this would weigh down the visualisation. however, granting access to the first attestations is important for two reasons: each meaning is located on the chronological axis based on this information; moreover, the semantic context in which the marker appears is fundamental for determining a modal reading. in addition, in some cases different arrows can display meanings with the same wording but different modal readings. as shown in figure 4, in the map of possum the meaning ‘it is possible’ can have a dynamic possibility, an epistemic or a (somewhat doubtful) deontic authority modal reading. the user can retrieve the different contexts by hovering over the arrow containing the selected meaning: the first attestation appears showing details about the author, the work and the line or verse corresponding to the attestation. we kept the thll abbreviations, so that the user can directly refer back to the index available on the thll website.15 another interactive function is the visualisation of the semantic links between the meanings (figure 5). if the directionality of the relation is specified, the link will appear as a line beginning or ending with an arrow. if the direction is not specified, the two meanings will be linked by a simple line. the user can display the relations by clicking on the arrow containing the selected meaning. this will show all the meanings that were at its origin (outcoming arrow) or were derived from it (incoming arrow). in order to visualise all the meanings again, the user just needs to double click on one of them. the user can also choose the type of visualisation according to three sorting methods: chronological order, semantic groups and constructions/collocations. if the user selects the first option, the meanings are displayed from the most ancient to the most recent (figure 4). the presence of semantic groups or constructions/collocations, if any, is not displayed with this option. the method ‘groups’ shows the meanings gathered by a criterion defined by the user in the first pygmalion form. the map of possum does not have any groups, but that of incertus, as mentioned in (2.1), shows two groups, labelled ‘doubtful’ and ‘someone who doubts’. the method ‘constructions/collocations’ offers a global visualisation of the meanings where those associated with a construction or collocation appear above (figure 6). 14 as mentioned above, we use “premodal” and “postmodal” in the sense of van der auwera and plungian (1998). 15 https://www.thesaurus.badw.de/tll-digital/index/a.html (last accessed: 10 december 2021). figure 5 visualisation of the links between some of the meanings of possum. https://doi.org/10.5334/johd.58 https://www.thesaurus.badw.de/tll-digital/index/a.html 9marongiu & dell’oro journal of open humanities data doi: 10.5334/johd.58 possum has two collocations: fieri potest ut ‘it can happen, be that’ (dynamic possibility) and potestur with passive infinitive ‘it is possible’, which receives two modal analyses (dynamic possibility and deontic authority). the option ‘constructions/collocations’ is also used for forms of the same lemma associated with a specific meaning. we exploited this method to illustrate the latin developments of proto-italic *lic-e-: alongside the impersonal modal verb licet, we find the active form liceo ‘to be for sale, salable’, ‘to have for sale’ or ‘to have the possibility/chance/opportunity of doing/ undergoing something’ (the latter with a dynamic modal reading), and the middle liceor ‘to bid, make an offer’ or ‘to bargain, negotiate’. each lemma is also illustrated via a network graph shown below the map, where the nodes are the meanings and the edges are the semantic relations between the meanings (figure 7). figure 6 visualisation of the collocations of possum. figure 7 network graph of possum. https://doi.org/10.5334/johd.58 10marongiu & dell’oro journal of open humanities data doi: 10.5334/johd.58 the edges can be directed or undirected, depending on whether we were able to establish the directionality of the relation. the network graph does not give information about attestations, semantic groups, or collocations. however, the chronological information is color-coded: the most ancient meanings are represented by darker blue nodes, and the shade of blue becomes lighter as the meanings become more recent. the network graph is meant to offer a synthetic overview of the meanings and their relations. (3.2) further results: the maps as a reference tool for studying modal markers the digital modal maps were initially conceived for being used within the project in two main respects. first of all, they represent a support tool for the annotation task. they provide the annotator with a concise and intuitive overview of each modal marker to be annotated in the woposs corpus. at the same time, the maps are also used as a reference tool for our study of modal markers. at this stage, as explained in (2), they result from the organisation of lexicographic information in the theoretical and more abstract schema of the semantic maps of modality devised by van der auwera and plungian (1998) and magni (2005, 2010). however, on a more advanced stage of the project, this first version of the maps will be compared against the evidence obtained from the annotated woposs corpus. the aim is to complement or modify the maps basing on the annotated data: this could concern a more ancient attestation for a modal marker or a modal meaning, a semantic nuance that was not captured on the base of the thll or the old, a new collocation, a change in the relations between the meanings. our maps also raise some interesting issues that we will illustrate by means of the case of licet. magni’s work already showed that an attentive review of the latin attestations can be revealing. for instance, she points out that possum is not a case of demodalisation as suggested by van der auwera and plungian (1998, p. 106), as the premodal meaning is attested early, already in plautus (magni, 2010, p. 217). building on this, our maps contribute to the discussion in two main respects. first of all, thanks to a clear chronological visualisation, they make evident when the possibility to establish the directions of diachronic changes cannot be based on the available attestations, but needs to be reconstructed on the basis of comprehensive and cross-linguistic studies. while semantic maps intended in the traditional sense offer a linear visualisation of the semantic changes, our maps, by integrating a philological and lexicographic approach, represent more clearly the actual situation we have to deal with, based on the available evidence. for instance, the map of licet shows a very early postmodal meaning conveyed by the construction ‘licet + subjunctive’, where licet has a concessive value. notice that this meaning is as anciently attested as the deontic and dynamic ones.16 our maps also allow us to highlight a caveat in the use of lexicographic sources for the study and representation of semantic changes, when these are not complemented with other studies. although the thll is extremely rich in attestations, by relying entirely on this type of resource, we risk to end up neglecting rare (modal) meanings or specific semantic shifts. this is even more true for the old, which provides a smaller amount of attestations. the case of licet provides a good example. magni, building on núñez (1991, pp. 77–81), classifies licet as a modal verb as it can express both dynamic/deontic modality and epistemic modality. however, during our analysis of the attestations provided by lexicographic sources, we did not encounter any instance of epistemic modality for licet. the two authors present some evidence for this type of reading. the most striking example is given by núñez (1991, p. 185): 1. lucr. 5, 597 illud item non est mirandum, qua ratione tantulus ille queat tantum sol mittere lumen, quod maria ac terras omnis caelumque rigando compleat et calido perfundat cuncta vapore. nam licet hinc mundi patefactum totius unum largifluum fontem scatere atque erumpere lumen, ex omni mundo quia sic elementa vaporis undique conveniunt et sic coniectus eorum confluit, ex uno capite hic ut profluat ardor. 16 without a further corpus-based analysis, it is not possible to date the grammaticalisation of licet as the concessive conjunction ‘although’. the main issue is that it is often hard to determine if licet is used as a verb with a concessive value or as a conjunction. https://doi.org/10.5334/johd.58 11marongiu & dell’oro journal of open humanities data doi: 10.5334/johd.58 “another thing also need not excite wonder, how it can be that so small a sun emits so much light, enough to fill with its flood seas and all lands and the heavens, and to suffuse all with warm heat. for it is possible that from this place is opened one single fountain of the whole world, to splash its generous flood and to fling forth light, because the elements of heat gather together from all parts of the world in such a manner, and their assemblage flows together in such a manner, that the heat flows out here from one single source.”17 in the text that follows this passage, not included for reasons of space, lucretius offers two other hypotheses that could explain the phenomenon under discussion, i.e., the ability of the sun to emit so much light. therefore, we can agree with núñez on a predominant epistemic reading of licet in this passage. it is worth mentioning that the attestations offered by the thll did not include the passages cited by magni and núñez; moreover, both authors (magni, 2010, p. 228; núñez, 1991, p. 184) underline that epistemic readings of licet are rare. building on this, we decided to add to the map of licet the epistemic reading, attested by this passage in lucretius’ de rerum natura. this example proves that sometimes relying on lexicographic sources is not enough to provide a representation of the diachronic evolution of a modal marker as complete as possible. however, it also allows us to show the flexibility of our maps, which can be enriched with new information gathered from our sources, such as other researchers’ work (as in this case) or more extensive corpus data. in fact, a more comprehensive, corpus-based semantic analysis of the attestations could capture instances of relatively rare modal readings, as it was the case for licet, providing new and different insights on the evolution of modal markers in latin. as already mentioned, we will be able to provide this type of analysis in the next stage of the project, by means of our annotated corpus. (4) implications and applications (4.1) implications for the practice thanks to the progress of computer science the field of data visualisation is rapidly evolving. therefore, it is important to highlight possible implications of our visualisations in designing modality and modality-related issues, as well as their role as linguistic tools of synthesis and discovery. our maps represent the digital transposition of traditional non-interactive drawings which are found in many dictionaries and other types of publications. by exploiting computer technologies and in particular the tool pygmalion, we combined this transposition not only with interactivity, but also with the possibility to add layers of information without overloading the design of the map. from the point of view of more traditional linguistic approaches, our map model sits between that of lexicography—we will use the term ‘lexicographic maps’— and that of typology—their maps are usually called ‘semantic maps’ (for an application of semantic maps to the domain of modality, see van der auwera & plungian, 1998).18 it is beyond the scope of this paper to detail similarities and differences between diachronic lexicographic maps, traditional diachronic semantic maps and our diachronic maps of modality. moreover, it must be underlined that the domain is rapidly evolving, as shown by the “lexical diachronic semantic maps” and the state of the art outlined by georgakopoulos and polis (2021). notably, their model is based on network graphs and quantitative methods. considering these recent developments, we can say that, while traditional lexicographic maps represent the etymological history of a lemma by focusing on formal and semantic changes and their relationships, traditional diachronic semantic maps—which often rely on cross-linguistic data—represent the organisation of semantic space with reference to meanings or functions and their relationships. our diachronic maps of modality combine some of the features of the two traditional types of maps. as diachronic lexicographic maps and diachronic semantic maps, our modal maps present the diachronic feature too, but they offer a more precise anchoring to the temporal dimension thanks to the chronological bar. this represents a drastic difference in particular with typological 17 translation by rouse and smith (1924). 18 one of their maps is reproduced in dell’oro (2019, p. 14). other important references for the semantic map model are croft (2001) and haspelmath (2003). https://doi.org/10.5334/johd.58 12marongiu & dell’oro journal of open humanities data doi: 10.5334/johd.58 maps: the latter are usually meant to be universally valid generalisations, and therefore are not tied to a specific temporal dimension. moreover, our maps contain much more data about the represented items, as illustrated in (2) and (3). another similarity shared by all the aforementioned types of maps is the representation of the relationships between the meanings or functions (or more generally the represented items). however, in our maps, thanks to their interactivity, the relationships between the meanings only appear when clicking on an arrow. this way, the visualisation is not weighed down by an excessive amount of information, and the user can select specific items and focus on the relations of each one of them individually. to conclude, while lexicographic and semantic maps are intended to offer synthetic overviews, our maps are rather conceived as tools to interactively explore various types of information. in fact, the major contribution of our map model to the practice of designing maps is probably the fact of allowing the user to explore a big amount of data at the same time. (4.2) applications our maps have a wide range of applications that go beyond the scope of the woposs project itself. they could complement the available lexicographic resources for these lemmas, providing a finer analysis of their modal readings. as shown above, e.g., for possum, this type of analysis is not present in the thll nor the old, even when they mention the basic modal meaning(s) for a lemma. we also find that the maps can represent a benchmark for scholars interested in different topics, such as modality and evolution of modal pathways, semantic change, history of the latin language, latin linguistics. this is especially due to the fact that they summarise in an interactive image a considerable amount of information that is gathered from many different sources. the consistent description provided for each lemma can orient scholars in their research, saving them a time-consuming task, selecting the most relevant information and providing additional analysis on modality that cannot be found in any of these sources. if necessary, the files of our maps can be downloaded and changed according to the requirements of various circumstances. another field of application for the maps could be the classroom. the benefits of involving visualisation techniques in the teaching and learning process are known in literature (arcavi, 2003; jessop, 2008; rattya, 2013; klerkx, verbert & duval, 2014 among others). the existing maps can be used to teach about modal categories and subcategories and their use in context, but also about more general (historical, semantic and partially syntactic, via the collocations) aspects of the latin language. for instance, they could be used to introduce pupils and students to the use of impersonal verbs. our set of modal maps features some impersonal verbs or constructions, e.g., respectively decet, licet, oportet and aequus est, necesse est, meum est among others. thanks to the option ‘constructions/collocations’ and to the attestations associated with each meaning, the learners can access the syntactic aspects of these verbs and constructions. moreover, the interactivity of the maps makes them suitable tools for individual learning: the levels and/or types of information presented in the map (etymology, semantics, chronology, constructions and collocations) are well separated, and learners can easily adapt the visualisation to their learning stage and objectives. acknowledgements we wish to thank helena bermúdez sabel for her contribution to the development of the model on which the maps are based and for her technical assistance in publishing the modal maps on the website of the woposs project. funding information this project has received funding from the swiss national science foundation (snsf project n° 176778) and it is hosted at the institut des sciences du langage (university of neuchâtel). https://doi.org/10.5334/johd.58 http://www.snf.ch/fr/pages/default.aspx https://www.unine.ch/islc/home.html https://www.unine.ch/ 13marongiu & dell’oro journal of open humanities data doi: 10.5334/johd.58 competing interests the authors have no competing interests to declare. author contributions paola marongiu: conceptualisation, data curation, investigation, methodology, visualisation, writing – original draft, writing – review & editing francesca dell’oro: conceptualisation, funding acquisition, investigation, methodology, project administration, supervision, visualisation, writing – original draft, writing – review & editing author affiliations paola marongiu orcid.org/0000-0002-5060-3307 institut des sciences du langage, university of neuchâtel, neuchâtel, switzerland francesca dell’oro orcid.org/0000-0002-8343-356x institut des sciences du langage, university of neuchâtel, neuchâtel, switzerland references arcavi, a. (2003). the role of visual representations in the learning of mathematics. educational studies in mathematics, 52(3), 215–241. doi: https://doi.org/10.1023/a:1024312321077 van der auwera, j., & plungian, v. a. (1998). modality’s semantic map. linguistic typology 2(1), 79–124. doi: https://doi.org/10.1515/lity.1998.2.1.79 bermúdez sabel, h., dell’oro, f., & marongiu, p. (2020, july 22–24). visualisation of semantic shifts: the case of modal markers [poster session]. dh2020, ottawa, canada. doi: https://doi.org/10.17613/ scy4-br70 bybee, j. l., perkins, r., & pagliuca, w. (1994). the evolution of grammar: tense, aspect and modality in the languages of the world. chicago: university of chicago press. croft, w. (2001). radical construction grammar: syntactic theory in typological perspective. oxford: oxford university press. doi: https://doi.org/10.1093/acprof:oso/9780198299554.001.0001 dell’oro, f. (2019). woposs guidelines for annotation. université de lausanne. doi: https://doi. org/10.5281/zenodo.3560951 dell’oro, f. (2020, july 22–24). woposs – a world of possibilities. an open access annotated corpus to study the diachrony of modality in the latin language [conference presentation]. dh2020, ottawa, canada. doi: https://doi.org/10.17613/pp6j-0p86 dell’oro, f., bermúdez sabel, h., & marongiu, p. (2020, december 5–6). implemented to be shared: the woposs annotation of semantic modality in a latin diachronic corpus [conference paper]. sharing the experience: workflows for the digital humanities. proceedings of the dariah-ch workshop 2019. neuchâtel, switzerland. doi: https://doi.org/10.5281/zenodo.3739440 dell’oro, f., bermúdez sabel, h., & marongiu, p. (forthcoming). pygmalion in the classroom: a tool to draw lexicographic diachronic maps and their application to didactics. in m. márquez cruz & v. ferreira martins (eds.), la lexicografía didáctica: recursos para el aprendizaje de lenguas. madrid: guillermo escolar editor. ernout, a., & meillet, a. (2001). dictionnaire étymologique de la langue latine: histoire des mots (reprint of the 4th ed. with additions and corrections by jacques andré). paris: klincksieck. georgakopoulos, a., & polis, s. (2021). lexical diachronic semantic maps. mapping the evolution of time-related lexemes. journal of historical linguistics, 11(3), 367–420. doi: https://doi.org/10.1075/ jhl.19018.geo haspelmath, m. (2003). the geometry of grammatical meaning: semantic maps and cross-linguistic comparison. new psychology of language, 2, 211–242. doi: https://doi.org/10.5281/zenodo.831410 jessop, m. (2008). digital visualization as a scholarly activity. literary and linguistic computing, 23(3), 281–93. doi: https://doi.org/10.1093/llc/fqn016 klerkx, j., verbert, k., & duval, e. (2014). enhancing learning with visualization techniques. in j. m. spector, m. d. merrill, j. elen, & m. j. bishop (eds.) handbook of research on educational communications and technology, (pp. 791–807). new york, ny: springer. doi: https://doi. org/10.1007/978-1-4614-3185-5_64 magni, e. (2005). modality’s semantic maps. an investigation of some latin modal forms. journal of latin linguistics, 9(1), 325–336. doi: https://doi.org/10.1515/joll.2005.9.1.325 magni, e. (2010). mood and modality. in p. baldi & p. cuzzolin (eds.), new perspectives on historical latin syntax. constituent syntax: adverbial phrases, adverbs, mood, tense, 2, (pp. 193–275). berlin: mouton de gruyter. doi: https://doi.org/10.1515/9783110215458.193 https://doi.org/10.5334/johd.58 https://orcid.org/0000-0002-5060-3307 http://orcid.org/0000-0002-5060-3307 https://orcid.org/0000-0002-8343-356x http://orcid.org/0000-0002-8343-356x https://doi.org/10.1023/a:1024312321077 https://doi.org/10.1515/lity.1998.2.1.79 https://doi.org/10.17613/scy4-br70 https://doi.org/10.17613/scy4-br70 https://doi.org/10.1093/acprof:oso/9780198299554.001.0001 https://doi.org/10.5281/zenodo.3560951 https://doi.org/10.5281/zenodo.3560951 https://doi.org/10.17613/pp6j-0p86 https://doi.org/10.5281/zenodo.3739440 https://doi.org/10.1075/jhl.19018.geo https://doi.org/10.1075/jhl.19018.geo https://doi.org/10.5281/zenodo.831410 https://doi.org/10.1093/llc/fqn016 https://doi.org/10.1007/978-1-4614-3185-5_64 https://doi.org/10.1007/978-1-4614-3185-5_64 https://doi.org/10.1515/joll.2005.9.1.325 https://doi.org/10.1515/9783110215458.193 14marongiu & dell’oro journal of open humanities data doi: 10.5334/johd.58 to cite this article: marongiu, p., & dell’oro, f. (2022). from static to interactive maps: drawing diachronic maps of (latin) modality with pygmalion. journal of open humanities data, 8: 2, pp. 1–14. doi: https://doi.org/10.5334/johd.58 published: 12 january 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. meiser, g. (1998). historische lautund formenlehre der lateinischen sprache. darmstadt: wissenschaftliche buchgesellschaft. meyer-lübke, w. (1935). romanisches etymologisches wörterbuch (3rd ed.). heidelberg: c. winter. núñez, s. (1991). semántica de la modalidad en latín. granada: universidad de granada. nuyts, j. (2016). analyses of the modal meanings. in j. nuyts & j. van der auwera (eds.), the oxford handbook of modality and mood. oxford: oxford university press. doi: https://doi.org/10.1093/ oxfordhb/9780199591435.013.1 old = glare, p. g. w. (ed.) (1968). oxford latin dictionary. oxford: oxford university press/clarendon press. pygmalion. (2020a). dell’oro, f., rimaz, l. bermúdez sabel, h., & marongiu, p. (2020a). pygmalion-simple 1.0. a tool to draw interactive and diachronic semantic maps. woposs. a world of possibilities. swiss national science foundation. pygmalion. (2020b). dell’oro, f., rimaz, l., bermúdez sabel, h., & marongiu, p. (2020b). pygmalionmodal 1.0. a tool to draw interactive and diachronic semantic maps of modality. woposs. a world of possibilities. swiss national science foundation. pygmalion. (2021). dell’oro, f., rimaz, l., bermúdez sabel, h., & marongiu, p. (2021). pygmalion-colour 1.0. a tool to draw interactive and diachronic semantic maps of modality. woposs. a world of possibilities. swiss national science foundation. rattya, k. (2013). languaging and visualisation method for grammar teaching: a conceptual change theory perspective. english teaching: practice and critique, 12(3), 87–101. rouse, w. h. d., & smith, m. f. (eds.). (1924). lucretius. on the nature of things. cambridge, ma: harvard university press. thll = thesaurusbüro münchen internationale thesaurus-kommission (ed.) (1900–). thesaurus linguae latinae. berlin: de gruyter. de vaan, m. (2008). etymological dictionary of latin. leiden: brill. walde, a., & hoffmann, j. b. (1938–1956). lateinisches etymologisches wörterbuch (3rd ed., 3 vols.). heidelberg: c. winter. https://doi.org/10.5334/johd.58 https://doi.org/10.5334/johd.58 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1093/oxfordhb/9780199591435.013.1 https://doi.org/10.1093/oxfordhb/9780199591435.013.1 teaching linked open data using bibliographic metadata research paper corresponding author: terhi nurmikko-fuller centre for digital humanities research, australian national university, canberra, australia terhi.nurmikko-fuller@anu. edu.au keywords: linked open data; bibliographic metadata; pedagogy; participant evaluations to cite this article: nurmikko-fuller, t. (2022). teaching linked open data using bibliographic metadata. journal of open humanities data, 8: 6, pp. 1–11. doi: https://doi.org/10.5334/johd.60 teaching linked open data using bibliographic metadata terhi nurmikko-fuller abstract this paper describes ld4dh, the linked data for digital humanities: publishing, querying, and linking on the semantic web workshop at the digital humanities oxford summer school. it includes a description of the general structure of the workshop, how it has changed over the course of the last seven years, between 2015 and 2021, and evaluates the differences between in-person delivery in 2018–2019 and the online mode in 2020– 2021. discussion is centred on the description of the data as well as the illustration of the processes, methods, and software used throughout the workshop. the paper concludes with a summary of participant evaluation, and reflects on the opportunities and challenges of teaching linked open data to a mixed cohort of predominantly humanities researchers and professionals from the cultural heritage sector. mailto:terhi.nurmikko-fuller@anu.edu.au mailto:terhi.nurmikko-fuller@anu.edu.au https://doi.org/10.5334/johd.60 https://orcid.org/0000-0002-0688-3006 2nurmikko-fuller journal of open humanities data doi: 10.5334/johd.60 1 introduction the linked data for digital humanities: publishing, querying, and linking on the semantic web (henceforth, ld4dh1) workshop has formed part of the proceedings of the digital humanities oxford summer school (dhoxss) since 2012. i am an alumna of the workshop myself, having attended it as a participant in its first iteration. at that time in my academic career, ld4dh was a space to acquire essential practical skills for implementing linked open data (lod). having since become the convener and tutor of the same workshop, it has now become an annual highlight, an opportunity to fully immerse myself in the methodology, to discuss research, and to engage with diverse groups of researchers, academics, and glam (galleries, libraries, archives, and museums) sector professionals. the workshop aims to provide participants with an understanding of the theories behind lod as an information publication paradigm, and then build on that foundation with practical and hands-on activities. this deliberate pedagogical structure is designed to reflect the insight that neither the use and implementation of digital methods, nor the critical evaluation of the projects and platforms that have been developed using those methods can be taught exclusively in abstract terms (brier, 2012). in recognition of the role of collaboration and coauthoring in digital humanities (dh) research (needham & haas, 2019), workshop participants are encouraged to work together and communicate openly as a group. since 2015, i have taught ld4dh with john pybus and graham klyne, both from the oxford e-research centre at the university of oxford. the secret to our successful delivery of the workshop has not rested only on our friendship, or our common interest in lod, but also because of our differences in interests, expertise, and academic backgrounds. this diversity within the tutor group enables us to discuss each topic from different perspectives. there is no guarantee of unanimous agreement and that gives the learners access to a greater diversity of ideas. we can thus more confidently cater for the needs and intellectual preferences of diverse cohorts. in recognition of the challenges of the course, and the role that a pleasant and supportive learning environment can play in successful information and skills-acquisition (imlawi & gregg, 2014), there has been a deliberate attempt to create a jovial and friendly atmosphere. humour is used to promote openness between the teachers and the learners, and to make acronyms and concepts more memorable. an examples of this is theme-specific clothing, such as a pair of golden trousers worn in homage to the query language sparql2 (pronounced “sparkle”) and a skirt with owls, in reference to the web ontology language (owl).3 2 structure of the week ld4dh is a five-day workshop. each day follows the same three-session pattern, with a different topic: a 90-minute theory session followed by a two-hour practical one. the final hour of the day is a lecture by a guest speaker discussing the use of lod in their area of research, often their own project (table 1). although there has been some flexibility to the list of speakers, in most years the topics covered numismatics (prof andrew meadows, university of oxford, speaking about nomisma.org4), digital musicology (dr kevin page, university of oxford, reporting on a range of projects), and digital libraries (prof stephen downie, university of illinois urbana-champaign, summarising on the work of the hathitrust research center5), as well as geolocation and digital mapping (dr valeria vitale, now at the alan turing institute in london, and chair of the pelagios network6). other colleagues who have contributed to ld4dh include dr daniel bangert (talking about the jazzcats project7), dr paula granados garcía, (from the open university, who gave a summary of the experience of an alumna of the workshop), as well as dr athanasios velios and 1 the acronym is the name of the slack channel (https://ld4dh-dhoxss.slack.com) and the twitter hashtag (#ld4dh) for this workshop. 2 https://www.w3.org/tr/rdf-sparql-query/. 3 https://www.w3.org/owl/. 4 http://nomisma.org/. 5 https://www.hathitrust.org/htrc. 6 https://pelagios.org/. 7 http://jazzcats.cdhr.anu.edu.au/. https://doi.org/10.5334/johd.60 http://nomisma.org https://ld4dh-dhoxss.slack.com https://www.w3.org/tr/rdf-sparql-query/ https://www.w3.org/owl/ http://nomisma.org/ https://www.hathitrust.org/htrc https://pelagios.org/ http://jazzcats.cdhr.anu.edu.au/ 3nurmikko-fuller journal of open humanities data doi: 10.5334/johd.60 prof donna kurtz, who have spoken on the topic of oxlod8 and its precursor projects, such as claros,9 respectively. the common thread throughout all these diverse projects is the illustration of practical use of lod in the humanities. all speakers favour open licensing for data and software, promote collaboration, and have developed tools that enable users to engage with it without the need to learn programming. these talks strongly support the philosophy of ld4dh, and serve to provide a complementary and enriching context for the learners. over the years, ld4dh has undergone several tweaks, rearrangements, and changes. the most recent version has been a response to the covid-19 crisis, and the move to an entirely online delivery. the workshop became an hour-long lecture on the fundamentals of lod for a general audience, followed an afternoon hands-on session for those who had opted to enrol in it. the latter is delivered by dominic oldman and diana tanase, both of the british museum, using researchspace.10 at the time of writing, expectations are high for an in-person event in 2022, which would see a return to the pre-covid-19 mode of delivery of the ld4dh workshop. 3 data since 2016, the workshop has centred on the data of the elephãt11 (early english print in hathitrust, linked semantic worksets prototype) project.12 this prototype (which was funded through the andrew w. mellon foundation workset creation for scholarship analysis project award) combines bibliographic metadata from two very different types of collections: the behemoth hathitrust digital library (htdl), and the rather more boutique early english books online text creation partnership (eebo-tcp13). the aim of the elephãt project was to see whether two digital library collections (which at a distance appeared to share similarities, but on closer inspection had many idiosyncratic features) could be bridged at the metadata level (page, nurmikko-fuller, cole, & downie, 2017). both the htdl and eebo-tcp are aggregators: the eebo-tcp contains information from some 150 sources – the number of institutions (each of which contain a multitude of collections and sources) that form the hathitrust is closer to 250. the considerable variation both within and between these two large projects is evident from the metadata. although the data for the elephãt project thus consisted of two aggregated datasets, for the purposes of the ld4dh workshop, the focus has been exclusively on the data from the eebotcp. the reason for this are two-fold: first, throughout the project, the htdl data was modelled and provided by the hathitrust research centre (htrc) team, whilst the eebo-tcp data was worked on by scholars at oxford, meaning the team at oxford had the opportunity to gain 8 https://www.glam.ox.ac.uk/oxford-linked-open-data-pilot. 9 https://eng.ox.ac.uk/claros/. 10 http://researchspace.org/. 11 the project workset viewer is availabe at https://eeboo.oerc.ox.ac.uk/. 12 the project has been reported on by page, k. and willcox, p. in the 2015 project report, available from https://www.ideals.illinois.edu/bitstream/handle/2142/79017/eleph%c3%a3t%20final%20report-with_ appendix-20150615.pdf. 13 https://textcreationpartnership.org/tcp-texts/eebo-tcp-early-english-books-online/. monday tuesday wednesday thursday friday 9:00– 10:00 registration 9:00– 10:30 ontologies (theory) producing rdf (theory) sparql (theory) british museum (intro, theory) 10:30– 12:00 introduction to ld4dh 11:00– 13:00 ontologies (praxis) producing rdf (praxis) sparql (praxis) british museum (praxis) lunch 13:30– 15:30 introduction to lod 14:30– 15:30 guest speaker (musicology) guest speaker (libraries) guest speaker (alumna) guest speaker (museums) 16:00– 17:00 guest speaker (numismatics) table 1 table illustrating the daily structure of the workshop. each day consists of a theory session, a handson session (praxis), and a talk by a guest speaker. https://doi.org/10.5334/johd.60 https://www.glam.ox.ac.uk/oxford-linked-open-data-pilot https://eng.ox.ac.uk/claros/ http://researchspace.org/ https://eeboo.oerc.ox.ac.uk/ https://www.ideals.illinois.edu/bitstream/handle/2142/79017/eleph%c3%a3t%20final%20report-with_appendix-20150615.pdf https://www.ideals.illinois.edu/bitstream/handle/2142/79017/eleph%c3%a3t%20final%20report-with_appendix-20150615.pdf https://textcreationpartnership.org/tcp-texts/eebo-tcp-early-english-books-online/ 4nurmikko-fuller journal of open humanities data doi: 10.5334/johd.60 familiarity with that dataset, thus making it easier to work with in the context of the workshop. second, of the htdl data, 66% remains subject to copyright restrictions, limiting access and use (jett, nurmikko-fuller, cole, page, & downie, 2016). the eebo-tcp data on the other hand consisted of 25,000 records which became publicly available in 2015. the eebp-tcp data has a number of idiosyncracies resulting from the combination of historical data and the processes of aggregation. this manifests in the dataset containing several categories for the same concept, e.g. discrete id numbers and titles. as the data is derived from historical sources it contains a significant quantity of complex and messy details that do not comfortably sit with modern metadata categories (see for example column “publisher” in table 2, which provides a sample of the dataset). rather than provide the learners with the entire, rather complex tei p5 xml14 files, a simplified .csv version of the data has been used for the hands-on activities at ld4dh. these tabular datasets were initially generated as part of the workflow for the elephãt project using a set of python scripts to pull out the data for author, publication place, publisher, date, six distinct id numbers, and three separate titles. the various .csv files, the python script and the custom-build project ontology eeboo are all available from the project github page.15 it is worth noting that the data wrangling at the stage of generating the .csv files did not involve any semantics. this is significant as the match between the modern metadata category and the data contained within it is not always exact. for example, the “imprint” category (displayed in table 2 as “publisher”) contains a large degree of additional information about the historical printing process, its funding model, and even geographical location, as these details were recorded in the original historical text. an example of this is from the record a00648/stc 10783/estc s114801. the imprint data contains information about the individual carrying out the printing (“g.eld”); the individual commissioning the print (“roger barnes”); the location of the shop that sold it (“s. dunstans church-yard”); and the name of the street of said shop (“fleet street”). indeed, so rich is this information that in 2016, we carried out an investigation into the extraction of specific details from this data category using natural language processing (khan, nurmikko-fuller, & page, 2016). due to the dataset’s internal richnesss and diversity, many learners opt to engage in some degree of data wrangling themselves, although it is possible to complete the workflow process without an additional step of data tidying (beyond the minting of uris). many participants 14 that is to say, information that was captured as xml, in adherence to the text encoding initiative’s (or tei’s) p5 guidelines. for more information about the tei’s p5 guidelines, see https://tei-c.org/guidelines/p5/. 15 https://github.com/oerc-elephat/preprocessed-elephant. author pubplace publisher date id0 id1 id2 fennor, william london : barnes, and are sold at his shop in s. dunstans churchyard in fleetstreet 1615 a00648 stc 10783 estc s114801 bacon, francis, 1561–1626 london : printed [by richard field] for felix norton and are to be sold in pauls church-yard at the signe of the parrot 1604 a01003 stc 1111 estc s104433 forser, edward, 1553?–1630 london : printed by b. a[lsop] for nathaniel butter, and are to be sold at his shop, at the pyed bull, neere saint austens gate 1624 a01075 stc 11189 estc s119405 bacon, francis, 1561–1626 london : printed by i. okes, for humphrey mosley, at the princes armes in pauls church-yard 1638 a01446 stc 1157 estc s100504 anonymous londini : in officina iohannis haviland 1626 a01639 stc 1177 estc s115271 godwin, francis, 1562–1633 in vtopia [i.e. london?] [j.bill] 1629 a01809 stc 11944 estc s118694 godwin, francis, 1562–1633 [oxford] : j. barnes 1603] a01812 stc 11948 estc s118380 table 2 a sample of the project data showing the information categories captured in the .csv file. the punctuation marks serve to capture uncertainty about the date, the “publisher” column cells contain several data points each. https://doi.org/10.5334/johd.60 https://tei-c.org/guidelines/p5/ https://github.com/oerc-elephat/preprocessed-elephant 5nurmikko-fuller journal of open humanities data doi: 10.5334/johd.60 also opt to spend some of their free time engaged in additional research around the subject, merging their aim of technological up-skilling (during the workshop) and their existing sense of needing or desiring to have an understanding of the data prior to working with it. many workshop participants opt to separate out surname and first name, or at the very least, the date and the name (see column “author” in table 2). although the punctuation (such as the colon and square brackets) are meaningful as indicators of ambiguity for the library professionals who created and painstakingly curated the original eebo-tcp data, these additional characters can be problematic in later stages of the rdf production workflow. for this reason, many learners choose to edit these characters out, essentially applying a reductionist approach to simplify the messy historical data into the categories of modern information representation systems. 4 theory each ld4dh workshop combines theory and praxis. the former includes aspects and considerations that are relevant to the entire dh community, far beyond the scope of the niche group researchers who choose to use the lod methodology in their research. a major component of this theoretical part of the workshop is to equip the learners with enough information regarding the pragmatics, the challenges, and the opportunities presented by lod so that they can critically evaluate the method for its strengths and weaknesses. this in turn enables the participants to make an informed decision as to whether or not to engage with it in their research beyond dhoxss. it is not always the right tool. there are no silver bullets. another aspect of the theoretical component is jargon-busting. these sessions are crucial in establishing a shared vocabulary, to facilitate communication, and to support engagement with the material. these sessions also enable participants to engage in meaningful conversations with other members of community of lod practitioners. they also help to boost confidence in terms of using the appropriate terminology when discussing their project, and their technical needs, with their colleagues and the it service provision at their home institutions. we introduce core concepts such as the five star linked open data standard;16 the idea of knowledge graphs;17 and the rdf triple.18 all the examples of data as rdf that the participants encounter in the workshop are expressed in one specific syntax, .ttl (pronounced “turtle”, and one of several possible options, the most common alternatives currently being json-ld and rdf-xml)19 to provide learners with a sense of consistency between examples but specific activities during the week also enable them to learn about the possibility of using different syntaxes for representing rdf. among the participants at each iteration of ld4dh, there has been a small minority of those who attend for reasons other than wanting to learn how to use lod in their research. these include industry representatives, and the occasional “scout” – those who had been sent by their superiors to find out what lod is all about. to address their needs (as well as help those participants interested primarily in the the research potential of this method), the ld4dh lesson materials contain information to help them engage with the it support services at their own institutions. these cover the practical considerations of setting up a lod project that go beyond issues like researcher aims and institutional policies (both of which i discuss at length in nurmikko-fuller, in press), such as the need for a server (and a person to manage that server!); and the process by which to decide which triplestore is best for the project, and so on. we also compare lod to markup languages (such as xml) and standard relational databases. these discussions can help those who have prior experience of either of the two alternatives to quickly visualise the differences between them and rdf. the theoretical component of the workshop discusses vexed ethical issues associated with the use of this digital methodology. one of these (and arguably the one that is easiest for all of to relate to at a personal level), is the enormous potential it has to invade individual privacy. at the core of the linked data paradigm is a potential privacy crisis. it is the promise that this method can 16 https://www.w3.org/2011/gld/wiki/5_star_linked_data. 17 https://www.ontotext.com/knowledgehub/fundamentals/what-is-a-knowledge-graph/. 18 https://www.w3.org/tr/rdf11-primer. 19 https://www.w3.org/tr/turtle/. https://doi.org/10.5334/johd.60 https://www.w3.org/2011/gld/wiki/5_star_linked_data https://www.ontotext.com/knowledgehub/fundamentals/what-is-a-knowledge-graph/ https://www.w3.org/tr/rdf11-primer https://www.w3.org/tr/turtle/ 6nurmikko-fuller journal of open humanities data doi: 10.5334/johd.60 unlock knowledge by bringing information from disparate but complementary datasets together. when dealing with data such as historical records, this is of an immense benefit: scholars are able to create much more comprehensive pictures of the past by bringing together information from several different sources. but as i argue in my forthcoming book (nurmikko-fuller, in press), what if this technology is uncritically applied to us? many of us categorise information in specific places, and recoil from the idea that a third party would have access to all that information simultaneously. imagine finding out someone else was accessing our financial information, health records, employment history, and social media habits. even if information is held in separate databases with different details removed to anonymise the data, linked data has potential for bringing all these fragments together, thus effectively removing any and all anonymisation. the discussion of the theoretical foundations is supported, enriched, and diversified with daily hands-on activities, exposing the participants to both theory and praxis. the activities constitute the structural backbone of the workshop, and are arranged to follow the order of a rdf production workflow. 5 activities there are two types of workflow that structure ld4dh. at the macro-level, there is the rdf production process, which gives the workshop its cohesion. at the micro-level, each handson session has its own specific workflow, with task-appropriate software, and session-specific learning objectives. throughout the week, participants focus on the same dataset, but move from familiarising themselves with the data (as illustrated in table 2 above) to modelling the content (converting .csv to .ttl). towards the end of the week, they progress from rdf production to writing sparql queries. this workflow has been reported on in the context of specific projects (nurmikko-fuller, bangert, & abdul-rahman, 2017; nurmikko-fuller, bangert, dix, weigl, & page, 2018), so i will limit the discussion to an outline the of activities to illustrate the learning objectives that bring the workshop together. there are four discrete tasks as summarised in table 3. the first task requires participants to engage with data represented as rdf through a non-sparql endpoint (the so-called followyour-nose approach to information discovery). in the past, this activity has focused on the use of the pubby ui,20 however, from 2022 onwards the plan it to use the four different uis available for dbpedia:21 dbpedia’s own resource page,22 openlink faceted browser23, openlink structured data editor,24 and the lodlive browser.25 this activity is most useful to those participants who have prior knowledge of database design and management, as it helps them to start thinking about and understanding the notion of information captured in rdf as an interconnected graph. as part of this session, participants also practice converting between different syntaxes of rdf such as turtle, rdf-xml, and json-ld using easyrdf.26 20 https://github.com/cygri/pubby. 21 the homepage for dbpedia is at https://dbpedia.org, but the dropdown list for the three uis is best accessed through a page for a resource. an example of such a page might be https://dbpedia.org/page/oxford. 22 as above, for oxford. 23 for example https://dbpedia.org/describe/?uri=http%3a%2f%2fdbpedia.org%2fresource%2foxford. 24 https://osde.demo.openlinksw.com/#/editor?uri=http:%2f%2fdbpedia.org%2fdata%2foxford. ttl&view=statements. 25 http://en.lodlive.it/?http%3a%2f%2fdbpedia.org%2fresource%2foxford. 26 https://www.easyrdf.org/converter. day task software monday follow-your-nose approach dbpedia interfaces, easyrdf tuesday design and implement ontologies pen and paper, protégé wednesday producing instance-level rdf web-karma, blazegraph thursday using triplestores and sparql sparql playground, blazegraph friday exploring the british museum’s collections researchspace table 3 table of the assigned tasks for each day of the five days of ld4dh. in the middle are the tasks for each day; the right-hand column lists the specific software used for each task. https://doi.org/10.5334/johd.60 https://github.com/cygri/pubby https://dbpedia.org https://dbpedia.org/page/oxford https://dbpedia.org/describe/?uri=http%3a%2f%2fdbpedia.org%2fresource%2foxford https://osde.demo.openlinksw.com/#/editor?uri=http:%2f%2fdbpedia.org%2fdata%2foxford.ttl&view=statements https://osde.demo.openlinksw.com/#/editor?uri=http:%2f%2fdbpedia.org%2fdata%2foxford.ttl&view=statements http://en.lodlive.it/?http%3a%2f%2fdbpedia.org%2fresource%2foxford https://www.easyrdf.org/converter 7nurmikko-fuller journal of open humanities data doi: 10.5334/johd.60 the second task is to develop an ontological structure. as part of this process, participants must first familiarise themselves with the data (as illustrated in table 2). they are also asked to formulate the research questions their ontological structure would be able to answer, and spend some time reading over the scope notes or specifications of other existing ontologies, so as to establish an overview as to how this type of data has been represented by others. in the case of the elephãt project data, learners focus almost exclusively on bibliographic metadata ontologies such as bibframe,27 fabio,28 frbroo,29 mods/rdf,30 mads/rdf,31 and schema.org.32 they can also opt to incorporate aspects of these ontologies into their own ontological model, and to examine other vocabularies, schemas, and ontologies such as foaf.33 the activity includes a “show and tell” session where learners take turns to present their ontological models for peer-review by the others in the class. most participants are eager to engage with the praxis part of each day. engaging with software and producing results or finding answers creates a sense of doing dh. for the ontology development stage however, the “software” of choice is pen and paper. drawing and redrawing concepts is a lengthy, iterative process, and throughout the session the learners discuss and change aspects such as type of information category. some examples might include: is “person” sufficiently detailed? do we need “author” and “publisher” as different types of people? but what about when either is an institution? should the data be modelled as authors and publishers as types of “agent”? the aim is to create a schema-level representation of the data – to define the possible information categories and the relationships between those categories that are present in the dataset. it is only once a consensus of sorts has been reached – normally under increasing time pressure as the end of the workshop approaches – that the participants progress to the implementation stage. this is done using protégé.34 it was chosen for two reasons: first, it is a popular tool, used across various different disciplines beyond humanities, arts, and social sciences; and, the point-and-click ui means that users do not need to acquire additional (potentially distracting) programming skills to complete this stage of the workflow. once complete, the ontological model is exported from protégé as a .ttl file. this syntax of rdf is selected because it is most suitable for use in the next stage of the workflow. the third step is to combine the original dataset (available as a .csv and illustrated in table 2) and the ontology (exported as a .ttl) to produce instance-level rdf. the tool of choice at dhoxss for this part of the process has been web-karma,35 a free and open software from the university of southern california. like protégé, this tool also has a point-and-click ui, which makes the mapping between the .csv and the .ttl file possible without the need for programming. it provides a visualisation of the resulting graph, which is a clear and convenient way to check the progress and accuracy of the data mapping, and fix any possible errors. the fourth and final stage of the process is two-fold. first, participants upload their rdf into a triplestore, and second, they learn to explore the new knowledge graph using the sparql protocol and rdf query language (sparql).36 a conscious decision has been made not to explore the protocol aspect, focusing exclusively on the query language. our decision reflects existing advice (ducharme, 2013) that describes the protocol as “rules for how a client program and a sparql processing server exchange sparql queries and results. these rules are … mostly 27 https://www.loc.gov/bibframe/. 28 https://sparontologies.github.io/fabio/current/fabio.html. 29 http://www.cidoc-crm.org/frbroo/home-0. 30 https://www.loc.gov/standards/mods/modsrdf/. 31 https://www.loc.gov/standards/mads/rdf/. 32 https://schema.org/. 33 http://xmlns.com/foaf/spec/. 34 https://protege.stanford.edu/. 35 https://usc-isi-i2.github.io/karma/. 36 sparql is a recursive acronmy, meaning that the ‘s’ in sparql stands for “sparql”! it combines a protocol and a query language for rdf, but for the purposes of ld4dh, we have chosen to focus on the query language aspect exclusively. https://doi.org/10.5334/johd.60 http://schema.org https://www.loc.gov/bibframe/ https://sparontologies.github.io/fabio/current/fabio.html http://www.cidoc-crm.org/frbroo/home-0 https://www.loc.gov/standards/mods/modsrdf/ https://www.loc.gov/standards/mads/rdf/ https://schema.org/ http://xmlns.com/foaf/spec/ https://protege.stanford.edu/ https://usc-isi-i2.github.io/karma/ 8nurmikko-fuller journal of open humanities data doi: 10.5334/johd.60 an issue for sparql processor developers”. given that the ld4dh workshops are not for sparql processor developers, the protocol is not covered. anecdotally, most participants have appeared to exhibit the most uncertainty and lack of confidence when asked to engage with sparql. they seemed to regard it as the most technical of the tasks. undoubtedly this was due in part to a lack of readily available wysiwyg (or what you see is what you get) uis or graphical user-interfaces (guis). these would make the task less daunting by hiding the code behind a more familiar search box. the solution for this was the introduction of the sparql playground37 into the curriculum – this provided participants with simple and easy-to-read examples that allowed them to build up their familiarity with sparql in a step-by-step process. for a number of years, the triplestore of choice was virtuoso38 based on two factors: first, it, like the other tools encountered in the context of ld4dh, was at one time a free tool (in the sense of both gratis and libre39); second, it was the triplestore of choice for the original elephãt project. this benefited the workshop as the teaching staff were familiar with the triplestore and its idiosyncrasies. in 2018, the decision was made to switch to blazegraph.40 it emerged as the triplestore of choice because it remains free and open, as well as being relatively intuitive to manage. by the end of thursday afternoon, the learners have completed an entire workflow for converting tabular data into a knowledge graph. they have familiarised themselves with the data; produced an ontological model to enable them to represent the information within that dataset in a meaningful way, and to answer their desired research questions; they have produced instance-level rdf; and, successfully uploaded those triples into a triplestore, and completed sparql queries over them. in many ways, thursday represents the pinnacle of the dhoxss experience: it is the day where the challenges are the greatest, the frustrations the deepest, the euphoria of success the highest. the week culminates in a day-long exploration of british museum data. friday’s activities are primarily focused on applying the skills that have been acquired throughout the week, as opposed to up-skilling in a new area or technical ability. in 2018 and 2019, the ld4dh workshop concluded with a hands-on practical session (a mini-workshop, of types) exploring the british museum’s collection using the researchspace tool.41 the day is largely run by dominic oldman and diana tanase of the british museum, and presents the learners with an opportunity to apply their knowledge to a genuine, real-world lod project, and to see how much they have learnt in the course of the week. they are also able to assess which skills they find most useful, relevant, and worth developing further. we also provide links and suggestions for additional tools (such as open refine42 for tidying data) and publications (ducharme, 2013; wood, zaidman, ruth, & hausenblas, 2013; van hooland & verborgh, 2014), as well as solving idiosyncratic problems (usually connected to projects the learners are working on outside of the ld4dh workshop). the afternoon trip to the royal oak43 is, of course, purely optional, but well-attended. 6 evaluations there are two things that most learners at the ld4dh have in common: they have identified lod as a methodology they are interested in or that might have value for them. at the same time, it is very rare to have participants who have prior experience of the methodology (see 37 https://sparql-playground.sib.swiss/. 38 https://virtuoso.openlinksw.com/. 39 for those interested in the topics of open access and open source, the wikipedia article on gratis and libre provides an succinct and easy-to-read summary of the differences between the two, and their application to intellectual property, computer code, and other relavant outputs” https://en.wikipedia.org/wiki/gratis_ versus_libre. 40 https://blazegraph.com/. 41 note that this refers specifically to the researchspace tool (https://researchspace.org/), and not to the british museum’s defunct orginal sparql endpoint (which at one time was available from http://collection. britishmuseum.org/). at the time of writing, the latter has been inaccessible for at least half a decade, and has never been used for the exercises at ld4dh. 42 https://openrefine.org/. 43 the pub on woodstock road, which, according to signage within the building was in the 1770s “a desolate spot”. https://doi.org/10.5334/johd.60 https://sparql-playground.sib.swiss/ https://virtuoso.openlinksw.com/ https://en.wikipedia.org/wiki/gratis_versus_libre https://en.wikipedia.org/wiki/gratis_versus_libre https://blazegraph.com/ https://researchspace.org/ http://collection.britishmuseum.org/ http://collection.britishmuseum.org/ https://openrefine.org/ 9nurmikko-fuller journal of open humanities data doi: 10.5334/johd.60 for example, the independent reports from 201744 and 201845). the workshop presents an opportunity for up-skilling, but it is also an intense experience. participant feedback was available for four years, 2018–2021 inclusive. this provides a balanced set of two years when the workshop was taught in person in oxford, and two of which were online. the number of respondents was small, but increased annually: only three of the 20 participants filled in the feedback form in 2018. this increased to six in 2019, eleven in 2020, and twelve in 2021, resulting in just 32 responses across four years (the size of the group is largely capped at 20 due to the limitations of available teaching space, but some additional students attend each year). the questions also changed with the move to the online medium: in 2018 and 2019 there is no data regarding the country of origin of the participants, but the corresponding information from 2020–2021 (“the covid-years”) shows a spread of eleven different countries (the netherlands, uk, italy, india, canada, portugal, germany, china, switzerland, mexico, and spain). participants from earlier years (although this is not captured in the survey data explicitly) are known to include those from at least france, austria, norway, and sweden, as well as the uk. none of the feedback forms collected demographic details of participants such as age or gender, focusing instead on levels of professional development and domain. with just one exception (a software researcher), all participants across all years were either academics (including students, early career, mid-career, and late career) or glam-sector professionals. of the 32 respondents, nine described themselves as falling into more than one category, most frequently as being both researchers and practitioners – this is not surprising given the nature of dh more generally and the interest and uptake of lod specifically. with such a low response rate (32 attendees) it is difficult to draw conclusions of any statistical significance. having said that, the respondents were a heterogeneous group and, albeit selfselected, in some respects they could be seen as a maximum diversity sample. at the very least, they offer impressions worth noting. the feedback was overwhelmingly positive, with most critical feedback reflecting the challenges of having to move to the online delivery method with short notice in 2020. throughout the four years, almost all aspects of the linked data workshop were categorised as either “good” or “excellent” (and only two as “satisfactory” and none as “poor”). qualitatively, the comments from participants include phrases such as “inspiring” (2018), “brilliant”, “excellent”, “great job” (2019), “successful event” (2020), and “extremely well moderated and extraordinarily well organised”, “…just great. it is very helpful…”, “very interesting talks and very good overall experience”, and “excellent organisation. wonderful you get to have the presentations” (2021). in 2018, the benefit of having several tutors in particular were highlighted: “the workshop was inspiring and challenging (in the best way). terhi, john, and graham were so generous with their time and knowledge. i enjoyed having their different points of view. i have already recommended the summer school to several colleagues”. two aspects of the workshop received criticisms in subsequent years. the first highlights the importance of expectation management. unfortunately, the survey data does not cover those iterations of the workshop which took place in the facilities at the university of oxford: in those years, the software was pre-installed on desktop machines in a small computer lab. in 2018 and 2019, participants were asked to arrive at the summer school with the software pre-installed on their personal devices: not all participants complied with this requirement in either year. this represented a major challenge for the organisers: at least one tutor and often more than one had to shift their focus from teaching content or explaining tasks to the whole group to focus on problems arising from an individual machine. in some cases participants with institutional laptops had limited access rights, preventing software installation and/or the downloading of prerequisite libraries. other participants attended with a tablet rather than a laptop, others refused to power up their machines with an alternative operating system from the usb-stick they had been given. participant expectation can also present some challenges: there is an underlying assumption that the tutors of the workshop are also experts at installing 44 https://dhh.uni.lu/2017/07/12/dhoxss-2017-linked-open-data/. 45 https://www.hirmeos.eu/2018/08/07/discovering-linked-open-data-at-the-digital-humanities-at-oxfordsummer-school/. https://doi.org/10.5334/johd.60 https://dhh.uni.lu/2017/07/12/dhoxss-2017-linked-open-data/ https://www.hirmeos.eu/2018/08/07/discovering-linked-open-data-at-the-digital-humanities-at-oxford-summer-school/ https://www.hirmeos.eu/2018/08/07/discovering-linked-open-data-at-the-digital-humanities-at-oxford-summer-school/ 10nurmikko-fuller journal of open humanities data doi: 10.5334/johd.60 the necessary software on any and all machines, regardless of operating system, prerequisites, or administrative restraints. feedback from 2018 illustrates this point: “the software tools did not work and i felt this could have been sorted out in advance”.46 the main difference between 2018–2019 and the covid-years was the move to online delivery. in an era of zoom-fatigue and relentless online delivery, we may be quick to categorise the latter as less preferable. the participant feedback paints a more complex picture, however. in 2018–2019, there was negative feedback about the physical room and conditions in which the workshop took place, an aspect of in-person teaching, which we may be prone to forget: “room was physically uncomfortable and layout did not suit the style of workshop, hence the low score for learning environment” (2018); “the teaching was excellent but the size of the room for the number of attendants and facilitators was not appropriate. it was very difficult to move around the room, see presentations from various angles of the room, for facilitators to communicate to attendees in small groups/individually and for us to break out to do group work. the unexpected heat wave made this even more unbearable” (2019); and “our room was way too small for the amount of people, at times it was very loud” (2019). the sudden move to online delivery for ld4dh in 2020 elicited negative feedback on some aspects of the workshop, in particular the hands-on element. it was inevitably more difficult to provide a seamless experience without the necessary time to develop the appropriate mode of delivery: “the linked data workshop felt like another lecture and was not really hands-on” and “the interactive workshop i attended seemed completely unprepared for teaching in an online environment” (2020). having learnt our lesson, and perhaps reflecting evolving attitudes as to the benefits of online learning, the feedback in 2021 was very positive: “the theoretical part of the morning sessions connected perfectly to better understand the practical part of the afternoon workshops. congratulations!” (2021). a very welcome result of the move to the online medium was that it opened the workshop up to an international audience (as illustrated by the inclusion of participants, for the first time, from china and india) as well as for at least one neurodivergent attendee: “…format worked well. as someone who is autistic, aspects of this worked better than in person. it would be great if there was a way to make your next in person event more accessible to neurodivergent participants by including some hybrid elements from the online event. you might see if a few neurodivergent people in dh could make specific suggestions to help” (2020). future iterations of ld4dh will seek to find ways to replicate some of these successes and affordances, and to continue to cater for the needs of diverse cohorts. 7 conclusion thinking back to my experience as a participant of the lod workshop in 2012 has provided an opportunity to stand back and evaluate how it has evolved during my time as a lecturer and how it meets the needs of those who participate today. none of the lecturers wore sparkly trousers, for one. software develops and changes, new projects emerge, some of the conceptual and philosophical debates remain the same. what else is different? how has our pedagogy evolved? has the market shifted, or has the typical participant changed? i believe that the feedback from the participants, with all its caveats, shows that the approaches we have applied to ld4dh have been successful in meeting and even exceeding the expectations of our diverse cohort of students. but the workshop is only one part of the much greater experience of dhoxss itself. the unique and undoubtedly strongest asset of the summer school is that it brings together some of the very best of the dh community. it creates an international, open, and dynamic learning environment for participants, providing ample opportunities for up-skilling, knowledge transfer, and networking. all these aspects have contributed to success. and so, as is so often the case with examples of successes in dh, at the core here too is the most important thing that makes the summer school what it is: the people. 46 please note that in 2018 the participants were asked to arrive with the workshop with the prerequisite software already installed. https://doi.org/10.5334/johd.60 11nurmikko-fuller journal of open humanities data doi: 10.5334/johd.60 to cite this article: nurmikko-fuller, t. (2022). teaching linked open data using bibliographic metadata. journal of open humanities data, 8: 6, pp. 1–11. doi: https://doi.org/10.5334/johd.60 published: 10 march 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. competing interests the author has no competing interests to declare. author contributions conceptualization, formal analysis, investigation, methodology, supervision, writing – original draft, writing – review & editing. author affiliation terhi nurmikko-fuller orcid.org/0000-0002-0688-3006 centre for digital humanities research, australian national university, canberra, australia references brier, s. (2012). where’s the pedagogy? in m. k. gold (ed.), the role of teaching and learning in the digital humanities, 1, 390–401. minneapolis, mn: university of minnesota press. doi: https://doi.org/10.5749/ minnesota/9780816677948.003.0038 ducharme, b. (2013). learning sparql: querying and updating with sparql 1.1. sebastopol, ca: o’reilly media. imlawi, j., & gregg, d. (2014). engagement in online social networks: the impact of self-disclosure and humor. international journal of human-computer interaction, 30(2), 106–125. doi: https://doi.org/10 .1080/10447318.2013.839901 jett, j., nurmikko-fuller, t., cole, t. w., page, k., & downie, j. s. (2016). enhancing scholarly use of digital libraries: a comparative survey and review of bibliographic metadata ontologies. in proceedings of the 16th acm/ieee joint conference on digital libraries (pp. 35–44). newark, nj: acm. doi: https://doi.org/10.1145/2910896.2910903 khan, n. j., nurmikko-fuller, t., & page, k. (2016). baby elephãt: building an analytical bibliography for a prosopography in early english imprint data. in iconference 2016 proceedings. urbana, il: ischools. doi: https://doi.org/10.9776/16588 needham, j., & haas, j. c. (2019). collaboration adventures with primary sources: exploring creative and digital outputs. the journal of interactive technology and pedagogy, 14(9). retrieved from https:// jitp.commons.gc.cuny.edu/collaboration-adventures-with-primary-sources-exploring-creative-anddigital-outputs/ nurmikko-fuller, t. (in press). linked data for digital humanities. oxford, uk: routledge. nurmikko-fuller, t., bangert, d., & abdul-rahman, a. (2017). all the things you are: accessing an enriched musicological prosopography through jazzcats. in proceedings of the international conference of digital humanities (pp. 554–556). montreal, canada: alliance of digital humanities organizations. retrieved from https://dh2017.adho.org/abstracts/305/305.pdf nurmikko-fuller, t., bangert, d., dix, a., weigl, d., & page, k. (2018). building prototypes aggregating musicological datasets on the semantic web. bibliothek forschung und praxis, 42(2), 206–221. doi: https://doi.org/10.1515/bfp-2018-0025 page, k., nurmikko-fuller, t., cole, t. w., & downie, j. s. (2017). building worksets for scholarship by linking complementary corpora. in proceedings of the international conference of digital humanities (pp. 319–321). montreal, canada: alliance of digital humanities organizations. retrieved from https://dh2017.adho.org/abstracts/606/606.pdf van hooland, s., & verborgh, r. (2014). linked data for libraries, archives and museums: how to clean, link and publish your metadata. london, uk: facet. wood, d., zaidman, m., ruth, l., & hausenblas, m. (2013). linked data: structured data on the web. new york, ny: manning publications. https://doi.org/10.5334/johd.60 https://doi.org/10.5334/johd.60 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://orcid.org/0000-0002-0688-3006 https://orcid.org/0000-0002-0688-3006 https://doi.org/10.5749/minnesota/9780816677948.003.0038 https://doi.org/10.5749/minnesota/9780816677948.003.0038 https://doi.org/10.1080/10447318.2013.839901 https://doi.org/10.1080/10447318.2013.839901 https://doi.org/10.1145/2910896.2910903 https://doi.org/10.9776/16588 https://jitp.commons.gc.cuny.edu/collaboration-adventures-with-primary-sources-exploring-creative-and-digital-outputs/ https://jitp.commons.gc.cuny.edu/collaboration-adventures-with-primary-sources-exploring-creative-and-digital-outputs/ https://jitp.commons.gc.cuny.edu/collaboration-adventures-with-primary-sources-exploring-creative-and-digital-outputs/ https://dh2017.adho.org/abstracts/305/305.pdf https://doi.org/10.1515/bfp-2018-0025 https://dh2017.adho.org/abstracts/606/606.pdf social reform international congresses and organizations (1846–1914): from sources to data research paper corresponding author: christophe verbruggen department of history – ghentcdh, ghent university, ghent, be christophe.verbruggen@ugent.be keywords: social reform; transnational history; network analysis; social internationalism; collective action to cite this article: verbruggen, c., deroo, f., blomme, h., d’haeninck, t., thiry, a., van diem, l., vandersmissen, j., mestdagh, e., billiet, b., wolff, j., chambers, s., de potter, p. d., carlier, j., van praet, c., leonards, c., & randeraad, n. (2022). social reform international congresses and organizations (1846–1914): from sources to data. journal of open humanities data, 8: 13, pp. 1–13. doi: https://doi. org/10.5334/johd.69 social reform international congresses and organizations (1846–1914): from sources to data christophe verbruggen florian deroo hans blomme thomas d’haeninck amandine thiry lisa van diem jan vandersmissen eline mestdagh bo billiet judith wolff sally chambers pieterjan de potter julie carlier carmen van praet chris leonards nico randeraad *author affiliations can be found in the back matter of this article abstract tic-collaborative was a collaborative digital humanities project that focused on transnational intellectual cooperation (tic) in the long nineteenth century, in particular on transnational connections in the field of social reform. the dataset contains information on over 1650 international congresses and 450 organizations and conference series related to the social question. the project focussed on the low countries and a selection of reform areas. mailto:christophe.verbruggen@ugent.be https://doi.org/10.5334/johd.69 https://doi.org/10.5334/johd.69 https://orcid.org/0000-0003-0849-6365 https://orcid.org/0000-0001-8463-8367 https://orcid.org/0000-0001-8735-7755 https://orcid.org/0000-0001-7796-7269 https://orcid.org/0000-0002-7322-2208 https://orcid.org/0000-0003-0003-1360 https://orcid.org/0000-0002-2430-475x https://orcid.org/0000-0002-1289-6575 https://orcid.org/0000-0002-8257-7668 https://orcid.org/0000-0001-8009-7954 2verbruggen et al. journal of open humanities data doi: 10.5334/johd.69 (1) introduction: the foundations of social internationalism at the international peace congress of 1849, the french writer victor hugo affirmed a hope that animated both the organizers and attendees of this international gathering: “yes, the period of revolutions is drawing to a close the era of improvements is beginning.” (peace congress committee, 1849, p. 13). from the middle of the 19th century onwards, a growing number of intellectuals sought to advance ideas, initiatives and organizational forms that would optimize and elevate their societies. their optimism about the benefits of industrialization and urbanization was beset by fears of instability and decline. the harmful side effects of modern transformations were becoming difficult to ignore. poverty plagued the neighborhoods of laborers living and working in dire circumstances. how the young were to be educated or what position women were to have in modern society were disquieting questions yet to be settled. in 1848, a wave of revolutions had swept europe, while the competition between nation states made war a looming threat. various inheritors of the enlightenment’s belief in progress turned to a different solution: social reform. for advocates of reform, the improvement of society was to be gradual, guided by science, and advanced through collaboration. congresses were in essence laboratories for the development and diffusion of reformist ideas, and hence represent an outstanding platform for further research into transnational exchange of reform agendas (rasmussen, 1990; randeraad, 2015). the tic-collaborative project1 focused on this diffuse transnational field of social reform, whose actors and activities tended to merge the political categories that mark standard historiographies of the period (topalov, 1999). by collecting and enriching a dataset of international organizations and congresses associated with social reform, tic sought to map cooperation across national lines and across thematic categories. for social reform, it was intended as a comprehensive effort, both geographically and in terms of content. it was meant to counter a wide set of tensions and crises that became known as the ‘social question’, a catch-all term used by scholars as well as contemporaries (randeraad & leonards, 2010). from hygiene and criminality to education and world peace, the social question encompassed a range of issues that needed to be diagnosed, studied and addressed if ‘mankind’ was to reach its pinnacle of civilization and harmony. whereas most european governments still held onto the non-interventionism of classical liberalism, social reformers recognized that economic and technical innovations by themselves did not inevitably benefit collective well-being. instead, they argued that different aspects of social life could, and should, be actively steered. in this top-down vision of social change, reformers saw a role for themselves as engaged citizens working towards the piecemeal and scalable advancement of local, national and supranational communities. despite the various professional backgrounds of its proponents, social reform was well embedded in bourgeois circles (van praet 2019, d’haeninck 2018). the progressive worldview of reformers tended to range from unorthodox liberalism to moderate socialism and drew from ideologies such as pacifism, internationalism and feminism. the cause of social reform thus formed a web of linked and mutually reinforcing commitments. especially in its early phase, reformers often had all-embracing and eclectic interests. what they shared, however, was not just a general problem the social question but also the ambition to solve it in a gradual, orderly, collaborative and scientifically-sound manner. the accumulation and exchange of knowledge on social issues was thus central to their self-imposed task. before these issues could be brought under public scrutiny and eliminated, the various qualms of society had to be identified and studied. urban centers, for example, were thus scrutinized for traces of decay or ‘backwards’ elements not yet touched by the blessings of modern civilization. to disseminate their diagnoses and remedies, reformers organized congresses that were regularly open to participants from across the globe. originating in progressive liberalism, but soon also embraced by socialists, these congresses were used as part of a repertoire of tools to contest existing power relations and the social and cultural status quo. they were political in the sense that they were often inspired by certain worldviews and aimed at concrete reforms in the form of social legislation. however, before the 20th century, political parties were rarely the main 1 https://www.ghentcdh.ugent.be/projects/tic-collaborative (last accessed 07/03/2022). https://doi.org/10.5334/johd.69 https://www.ghentcdh.ugent.be/projects/tic-collaborative 3verbruggen et al. journal of open humanities data doi: 10.5334/johd.69 organizational unit (with the exception of socialists in some cases). the international catholic world had its own networks for the exchange of information and community building (liedtke & weber, 2009; viaene, 2015), just like various other christian denominations. freemasons, on the other hand, were ubiquitous in congresses outside their own, but generally chose to remain discreet about their affiliation (jansen 2015; randeraad & leonards 2020). the model for the efforts to supplement existing communication flows (such as correspondence or magazines) with structural meetings was modern science, with its systematic method, claims of universality, and ideals of collaborative knowledge building and identity formation. in tandem with the emerging disciplines of the social sciences, reformers were convinced that social and economic life was also subject to certain ‘laws’ that could be discerned through the observation and comparison of social ‘facts’. likewise, the solutions proposed by social reformers were tested in an experimental fashion. plans for educational methods, housing projects or penitentiary reforms were ideally tried out, compared, adjusted and then propagated to be tested in other environments. however, many of the propositions that were diligently worked out, and extensively debated by reformers, were never put into practice. before the late 19th century, their impact on governmental policies and legal frameworks was limited or indirect at best. to increase their leverage, reform-minded intellectuals and activists organized themselves in leagues and associations. these organizations which could be internationally oriented sought to share their expertise in order to pressure (local) governments and raise public awareness about the topics that concerned them. if the optimization of social life was the horizon of reform, then collaboration was considered to be a crucial vehicle to achieve this goal. often, it was even strategically necessary to link up with like-minded others across borders. keck and sikkink (1998) have described this as the ‘boomerang pattern.’ when it comes to social issues, the problem was very much local or national, but the solution lay partly in finding allies beyond one’s own state borders. julia moses, for example, has shown that the international congresses on accidents at work held between 1889 and 1914 never pushed for a binding international convention, but “provided a forum that governments could choose to ignore, manipulate or search for new ideas” (moses, 2015, p. 62). randeraad and leonards (2020) have also argued that many delegates used their participation in international congresses not only to promote causes of reform across borders, but also to strengthen their authority in their own country. the first international congresses on topics such as peace and penitentiary reform were held in the 1840s. the earliest congress in our selection is the first penitentiary congress, which took place in frankfurt am main in 1846. yet their number and importance grew rapidly (figure 1). no fewer than 1,655 congresses dealing with social change were organized before 1914. depending on the selection criteria of what qualifies as ‘social reform’, this number could be even higher. peaks in the organization of congresses often coincided with world expositions, events which similarly trod a fine line between internationalist ideals and national competition. the overwhelming majority of these congresses were held in europe, with paris, london and brussels as the major hubs of the congress scene. the prominence of brussels is hardly a surprise. the new belgian state actively developed and supported the phenomenon to gain international influence (aubin 2020; verbruggen, laqua & deneckere, 2012). at these events, experts, activists, intellectuals and officials gathered to exchange knowledge and propose guidelines and action plans around a certain topic. these topics were as varied as the social problem that the congresses were meant to pinpoint and counteract. during the 1910 world exposition in brussels, for example, congresses were held on food hygiene, vegetarianism, mathematics education, tuberculosis, cremation, family education, feminism, occupational diseases and many other topics and areas of social reform. after a congress, the different contributions and the ensuing discussions were often published in the form of a congress report. as part of the tic project, we assembled and made accessible a digital collection of over 1000 congress reports, with extensive metadata, that give an insight into the proceedings of these events. local situations and numbers were compared, and the effectiveness and (international) suitability of different measures were discussed. this could lead to serious debates among the predominantly white male reformers who frequented the congresses (women only started attending by the late 19th century). nonetheless, the ideal of an international consensus reached through open debate and rational dialogue was the driving https://doi.org/10.5334/johd.69 4verbruggen et al. journal of open humanities data doi: 10.5334/johd.69 force behind the expanding circuit of congresses. collaboration, across borders and in the name of science and universal welfare, was itself also a rationale that these events were meant to embody. to institutionalize these collaborative bonds, many congress series transformed into international organizations from the 1880s onwards. international congresses also became increasingly specialized and professionalized by the belle epoque. the communities of experts formed through the congresses now began to professionalize their activities, a process which concurred with and stimulated the maturation of new knowledge fields into separate disciplines (schroeder-gudehus 1990). (2) sources and data enrichment a thorough assessment of these early sites of international sociability can not only help to reexamine the welfare arrangements before the welfare state from a transnational perspective, but can also shed light on the ascendancy of scientific internationalism and international organizations. focussing on initiatives related to social reform, the goal of the tic project was therefore to assemble a database containing systematic information on the phenomena of international congresses and organizations from the 1840s until the beginning of the first world war. we focussed on the attendance of reformers originating from the low countries as well as in-depth thematic sub-projects (see below), however, for the overall mapping of conferences related to social reform and the compilation of bibliographic metadata, the aim from the outset was to be as exhaustive as possible. the foundational sources of our dataset were the publications of the union of international associations (uia) and its ‘complete’ list of conferences from 1681 onwards (uia, 1960; uia 1957; uia, 1964). in 1907, henri la fontaine, paul otlet and cyrille van overbergh set up the office central des associations internationales. besides the development of a centre for collections about internationalism in brussels (including what la fontaine called a ‘documentary repertory of internationalism’), one of the main goals was the foundation of the union of international associations (uia), an ‘association of associations’ that would coordinate the relations and interests between international organizations throughout the world. in a short period of time the uia became a direct forerunner of the league of nations’ committee on intellectual cooperation. it issued a periodical, la vie internationale, and published, in collaboration with the pacifist alfred fried, two yearbooks (1905; 1906; 1907; 1910–1911): the annuaire de la vie internationale, which gave an overview of the various international associations. the yearbook and the uia still exist today and many studies on international non-governmental figure 1 social reform international congresses and organizations, 1846–1914. https://doi.org/10.5334/johd.69 5verbruggen et al. journal of open humanities data doi: 10.5334/johd.69 organizations rely on the uia data and the typologies drawn up by them (see laqua, van acker & verbruggen, 2019 for an overview).2 data collections with a global or even universal scope will always remain incomplete or questionable (bloodgood & schmitz, 2013). in the case of the uia, this is partly caused by the fact that from its inception in 1907, most of the data was provided by the organizations themselves, as acknowledged by the uia (uia, 2015). moreover, the categorization of ‘genuinely’ international organizations itself “has been, and remains, a political project that embodies and shapes worldviews” (saunier, 2019). a well-known example, for instance, is the lack of religious organizations and fraternities engaged in transnational advocacy networks, health, education and many other fields related to social reform (bush, 2007). this being said, the data offers a representative data source for the study of various long-term developments (eg. boli & thomas, 1999; grandjean & van leeuwen, 2019), especially with regard to europe and north america, which can be further enriched with original archival research and/or used as a starting point for further research. additionally, because greater emphasis was placed on the listing of organizations rather than congresses, we had no choice but to set up a new database, powered by nodegoat (see below).3 aware of both the potential and the caveats of uia-related published sources and datasets, we consulted the original sources produced by the organizations (congress proceedings, reports of assemblies, charters of foundation, etc.) and other contemporary published (army medical library 1938; baldwin 1907a, b; bishop, 1958; coorman 1968; doré 1923; eijkman, 1910; 1911; faries, 1915; gregory, 1939; judge 1978; lyons 1963; matthews, 1940 and wallace & singer 1970). this enabled us to correct mistakes and to go beyond the lacunas and the contradictions between the reference books (e.g. event dates, alternative names, etc.). an important added value of our list of congresses, compared to previously published datasets, is the systematic addition and verification of bibliographic data related to published congress reports. all of the publications are provided with detailed metadata, provenance and their worldcat permalink. a second major added value is that the data has been enriched with thematic categories, allowing improved (faceted) search and other queries. over 600 categories and themes have been attributed to the organizations and congresses (see figure 2). for the categorisation and assignment of thematic keywords reflecting the areas of social reform (e.g. labour legislation), 2 digitally available on http://www.uia.org (last accessed 07/03/2022). 3 https://nodegoat.net (last accessed 07/03/2022). figure 2 most prominent tagged areas of social reform, main causes or main ideological background of social reform international organizations and conference series, 1846–1914. https://doi.org/10.5334/johd.69 http://www.uia.org https://nodegoat.net 6verbruggen et al. journal of open humanities data doi: 10.5334/johd.69 main causes (e.g. anti-alcoholism, pacifism or feminism) or main ideological background (e.g. freemasonry), we used previous work done by contemporaries such as winifred gregory and pieter hendrik eijkman. their approach is in line with the use of as many self-definitions as possible, as reflected, for instance, in official conference titles. in gregory’s international congresses and conferences, a thematic index was published (gregory, 1938) as a guide to the events. he acknowledges that because some of the indexing was done from title only, several errors, omissions and inconsistencies are present. we have made several corrections and have added more detailed themes to the broad categories. bearing in mind that two sources are never alike and that some of our choices are subject to discussion, our rule of thumb was to attribute a theme to a congress when a session of the congress was organized or a debate was held on the particular theme. a final major enrichment that contributes to the understanding of social internationalism, is the addition of information regarding organizations related to the congresses, resulting in a subset that can be queried and analysed independently. the difference between an ‘international conference series’ and an ‘international organization’ is not always unambiguous. congresses played an important role in the establishment of governmental and non-governmental international organizations. after one or more meetings around a certain topic or point of action, it was not uncommon to set up an organization with a fixed structure, a permanent office and communication outlets. series of international congresses can thus be considered to be precursors of both the governmental and non-governmental international organizations that would come to dominate internationalism in the 20th century. the first international sanitary conference (1851), can for instance be seen as a direct forerunner of the world health organization (who). in assembling our dataset, the qualification ‘international organization’ was used if there was a standing committee for the organization of the conference, a permanent secretary or a bureau for the exchange of information (for instance, the ‘commission internationale des congrès d’éducation familiale’). again, international religious orders, fraternities and secular institutes have not been included in a systematic way as they are out of the research scope. it should also be noted that the distinction between governmental and non-governmental organizations is also difficult to make for events and organizations in the 19th century. unless explicitly stated in the names or mission statement of congresses, no systematic distinction has been made between the two. some local and national governments (especially those of smaller countries and france) were actively involved in many initiatives. after the first world war, when internationalism further institutionalised and international relation theory developed, the distinction between organs of inter-governmental organizations (igos) and international nongovernmental organizations (ingos), as well as the self-definitions became clearer (davies, 2019; about secretariats: reinalda, 2019). in lonsea league of nations search engine (directed by madeleine herren), these distinctions are made more systematically.4 lonsea is complementary to our project because it is mainly based on the handbook of international organisations which was edited by the league of nations from 1921 to 1938. (3) in search of long-term patterns and dynamics: following the actors the biographical data, memberships and congress attendance in our dataset are less comprehensive or complete as not every reform area has been studied to the same level of detail. the dataset consists of the result of several sub-studies and phd research in which reasoned choices were made in each case. three different thematic fields of reform were researched in the sub-projects of the phd-students: (1) the first sub-project (amandine thiry) challenged the predominant narratives of the belgian prison reform. (2) the second sub-project (thomas d’haeninck) applied social network analysis (sna) techniques to a large sample of attendees (> 7200 people) who attended international congresses concerned with the moral improvement of society. (3) a third project on colonial hygiene also resulted in a significant amount of information that has been added to the collective dataset (a discussion of the topic in: vandersmissen and van diem, 2021). in addition to this research, we integrated previous 4 http://www.lonsea.de (last accessed 07/03/2022). https://doi.org/10.5334/johd.69 http://www.lonsea.de 7verbruggen et al. journal of open humanities data doi: 10.5334/johd.69 research on belgian feminists and their ‘coalition partners’ among for instance pacifists and the temperance movement (carlier, 2010). most importantly, we integrated the datasets of nico randeraad and chris leonards on a wide range of reform themes, such as penitentiary reform, public and private charity, and social and moral statistics (leonards 2015, leonards and randeraad 2015). by bringing together all this data from various sub-studies, new possibilities emerge. clusters of co-memberships at congresses give us a lot of insight into the different ‘causes’ that actors were likely to share (for this approach and a ‘genealogy of causes’, see: verbruggen, blomme & d’haeninck, 2020). we looked, for example, at educational internationalism at the intersection of the conferences related to moral reform, education, women’s rights and prison reform (see figure 3, discussed in thiry et al 2018). tic-collaborative used nodegoat as a data management system to collaborate and enrich datasets. nodegoat (developed by lab 1100) is a “web-based database management, analysis and visualization platform.” we experimented with harvesting meaningful information from (semi-)structured documents, but most of the data was collected via the traditional full-text search and manual labour intensive (meta)data creation. in the early data collection phase of the project, nodegoat had no api or linked data ingestion module, so in collaboration with lab1100, preparations were made to improve the capabilities to query sparql endpoints from nodegoat. this eventually resulted in a linked data module.5 still, most of the disambiguation of persons and linking objects of persons to external identifiers had to be done manually. an interesting proof of concept of the added value of linking data was offered in a clariah-nl 5 https://nodegoat.net/blog.s/12/linked-data-vs-curation-island (last accessed 07/03/2022). figure 3 co-membership network of 283 congresses related to education (1846–1914). a. congrès international pour l’étude des questions relatives au patronage des détenus et la protection des enfants moralement abandonnées (1890, antwerp). b. congrès international de pédologie (1911, brussels). c. congrès international de l’enseignement (1880, brussels). d. congrès international de l’éducation populaire (1910, brussels). e. congrès international pour l’étude des questions relatives au patronage des condamnés, des enfants moralement abandonnés, des vagabonds et des aliénés (1905, liège). f. congrès international du droit des femmes (1889, paris). cluster 1: white nodes (including b, c and d) groups congresses mainly related to education and freemasonry. cluster 2: black nodes (including a and e) groups congresses held on penitentiary reform, charity and social welfare, patronage, and child protection. cluster 3: grey nodes (including f) groups congresses related to women’s rights, women’s protection and feminism. https://doi.org/10.5334/johd.69 https://nodegoat.net/blog.s/12/linked-data-vs-curation-island 8verbruggen et al. journal of open humanities data doi: 10.5334/johd.69 pilot project (2tbi).6 the main objective was to link a research database of persons who were internationally active in the 19th and early 20th centuries, with the person entity repository of currently online available biographical resources in the netherlands. the result was a group of 1115 dutch social reformers, whom we can now easily trace in various datasets and link to existing biographical resources (randeraad, 2018). in order to achieve this, we set up a resourcesync connection between the nodegoat software and anansi, one of the data hubs of the dutch clariah infrastructure. (4) from sources to data and data archiving in addition to the sources that were already available via digital libraries and archives, the researchers have identified complementary sources in belgian archives and libraries, notably proceedings preserved in kbr, royal library of belgium, uc louvain library, ghent university library and the mundaneum (mons). many of these sources were digitized by the project and its partners to produce high-quality ocr scans of the documents. mass digitization as such was not the project’s central aim, rather we wanted to create a virtual infrastructure which offers access to digitized versions of previously dispersed material via a central observatory. with this objective in mind, a detailed analysis was undertaken of the then available ‘text platforms’. a range of existing platforms and technologies (including islandora, veridian, textgrid, textimage linking environment (tile), transcribe bentham, shared canvas, annotated books online and trove) were evaluated against a range of criteria, including organizational aspects, data aspects (e.g. import, export formats and metadata capabilities), user management capabilities and annotation features (including ocr correction). as a result of this evaluation, islandora, an open-source software framework for the collaborative management and discovery of digital assets, was selected for the tic platform. islandora is an open source digital repository system based on fedora commons, drupal and a host of additional applications. however, during the development and implementation process, a number of issues began to arise, which became more pronounced when using the platform: shortcomings and bugs (e.g. import), the performance and scalability of the ingestion process (e.g. the ocr process and the uploading of text often took many hours, even days) and the management and customisation of the platform. from-scratch development of a new platform would be unnecessary, as there are many existing solutions that already partially fulfilled the requirements of the corpus management platform. we therefore looked for promising existing solutions and reached out to them with the tic requirements as a guideline. in collaboration with the ghent centre for digital humanities and in the framework of dariah-be and clariah-vl, a survey was conducted at ghent university to gain insight into the current data collection and digitisation processes of researchers involved in text analysis. after additional discussions with the involved stakeholders (the ghent centre for digital humanities and the ghent university library), it was decided to continue with the implementation of the international image interoperability framework (iiif), an open framework and standardised method to publish image-based resources, to be viewed, cited and annotated by any iiif-compatible image-viewing application (overall architecture in figure 4).7 notwithstanding some limitations (for example its by definition focus on images), it is the most advanced and widely supported framework that allows researchers to build corpora across institutional borders and collections. this valuable corpus is made accessible via iiif and is preserved by the ghent university library. as a collection it is accessible on request, downloadable and full text searchable via shared canvas.8 shared canvas is a iiif-enabled service provided by ghent university library for heritage institutions, museums and libraries that want to make their images widely available in a controlled and standardized manner. it is powered by the institutional repository system librecat. in addition to this, the tic-collection has been used to test the new iiif corpus management platform (cmp) that has been co6 https://www.clariah.nl/projects/2tbi-towards-an-international-biographical-infrastructure (last accessed 07/03/2022). 7 https://iiif.io (last accessed 07/03/2022). 8 https://sharedcanvas.be (last accessed 07/03/2022). https://doi.org/10.5334/johd.69 https://www.clariah.nl/projects/2tbi-towards-an-international-biographical-infrastructure https://iiif.io https://sharedcanvas.be 9verbruggen et al. journal of open humanities data doi: 10.5334/johd.69 developed by clariah-vl.9 the platform, now named madoc, incorporates a range of open source services, technologies, and recommendations including iiif, w3c web annotation data model and related protocols, and linked open data.10 notwithstanding the fact that part of the biographical data had already been successfully published in anansi and linked to other datasets, we opted for a different lod platform with more possibilities.11 druid is a scalable triplestore by triply based on the hdt (header dictionary triples) technology (meroño-peñuela et al, 2020). the platform was also developed in the context of clariah-nl and is currently maintained by the international institute for social history (iish) and triply, a clariah spin off company.12 in terms of content, the iish is an obvious partner for making the tic-dataset available for further use or integration with other datasets. to allow additional verification and processing of the data, the data was exported from nodegoat as csv files and converted to a turtle file to be uploaded in druid using a custom python script using rdflib.13 congresses, congress series, organisations, publications and places were mapped to the classes event, eventseries, organization, creativework and city of the schema.org vocabulary. tags were mapped to skos concept14 and linked to entities using dublin core metadata initiative (dcmi) metadata terms subject property.15 last but not least and as an illustration of the potential of the data, we staged an online exhibition on social reformers powered by omeka-s. the interactive exhibition on couples for progress has been built around the maps, graphs and other visualisations made by the researchers connected to the project in collaboration with our belgian partners amsab institute for social history and liberas. the exhibition makes use of the entire iiif and linked open data ecosystem, finally tying up all ends. via iframes, the exhibition also refers to visualizations in tableau and a public interface including sample graphs in the project’s nodegoat installation that will remain open to new research in the coming years.16 however, even without this 9 https://www.ghentcdh.ugent.be/projects/madoc-iiif-annotation-and-crowdsource-platform (last accessed 07/03/2022). 10 https://madoc.netlify.app (last accessed 07/03/2022). 11 https://druid.datalegend.net/clariah-vl/tic-collaborative (last accessed 07/03/2022). 12 https://triply.cc (last accessed 07/03/2022). 13 https://github.com/rdflib/rdflib (last accessed 07/03/2022). 14 http://www.w3.org/2004/02/skos/core#concept (last accessed 07/03/2022). 15 http://purl.org/dc/terms/subject (last accessed 07/03/2022). 16 https://nodegoat.ugent.be/viewer.p/1/300/types/all/list/ (last accessed 07/03/2022). nodegoat collaborative database, lod ingestion and visualisation platform data storytelling and virtual exhibits corpus building, annotation and data collection repositories and data archiving omeka s exhibition platform tableau data storytelling shared canvas iiif repository ghent university + other iiif compliant image repositories druid linked open data publishing islandora text corpus management platform (discontinued, replaced by madoc) dataverse figure 4 tic data architecture and archiving scheme. https://doi.org/10.5334/johd.69 http://schema.org https://www.ghentcdh.ugent.be/projects/madoc-iiif-annotation-and-crowdsource-platform https://madoc.netlify.app https://druid.datalegend.net/clariah-vl/tic-collaborative https://triply.cc https://github.com/rdflib/rdflib http://www.w3.org/2004/02/skos/core#concept http://purl.org/dc/terms/subject https://nodegoat.ugent.be/viewer.p/1/300/types/all/list/ 10verbruggen et al. journal of open humanities data doi: 10.5334/johd.69 software, the data will remain findable, accessible, interoperable and reusable (fair) through the iish dataverse repository.17 acknowledgements we would like to thank pim van bree and geert kessels (lab1100) for their willingness to develop nodegoat in accordance with the needs of the project. in addition, we thank frederick lamsens (ghentcdh) for the development of the omeka-s site. finally we thank dries moreels, hendrik defoort and nicolas franck (university library ghent) for their support. funding information the tic collaborative project was funded by belspo (research project br/121/a3/tic-belgium) and supported by dariah-vl: virtual research environment service infrastructure (hercules foundation/research foundation flanders (fwo): g0h3717n)/clariah-vl: open humanities service infrastructure (fwo: international research infrastructures (iri) programme: i001419n & i000921n). competing interests the authors have no competing interests to declare. author contributions christophe verbruggen: conceptualization, funding acquisition, formal analysis, investigation, project administration, resources, supervision, data curation, writing – original draft, writing – review & editing florian deroo: investigation, writing – original draft, writing – review & editing hans blomme: conceptualization, formal analysis, data curation thomas d’haeninck: formal analysis, investigation, data curation amandine thiry: data curation and investigation lisa van diem: data curation and investigation jan vandersmissen: data curation and investigation eline mestdagh: data curation bo billiet: data curation judith wolff: data curation sally chambers: writing – original draft, writing – review & editing pieterjan de potter: software julie carlier: investigation carmen van praet: investigation chris leonards: conceptualization, formal analysis, investigation, resources, writing – review & editing nico randeraad: conceptualization, funding acquisition, formal analysis, investigation, project administration, resources, supervision, writing – original draft, writing – review & editing author affiliations christophe verbruggen orcid.org/0000-0003-0849-6365 department of history ghentcdh, ghent university, ghent, be florian deroo orcid.org/0000-0001-8463-8367 department of history ghentcdh, ghent university, ghent, be hans blomme orcid.org/0000-0001-8735-7755 department of history ghentcdh, ghent university, ghent, be 17 see: https://iisg.amsterdam/en/data/datasets and https://datasets.iisg.amsterdam (last accessed 07/03/2022). https://doi.org/10.5334/johd.69 https://orcid.org/0000-0003-0849-6365 https://orcid.org/0000-0001-8463-8367 https://orcid.org/0000-0001-8735-7755 https://orcid.org/0000-0001-8735-7755 https://iisg.amsterdam/en/data/datasets https://datasets.iisg.amsterdam 11verbruggen et al. journal of open humanities data doi: 10.5334/johd.69 thomas d’haeninck department of history ghentcdh, ghent university, ghent, be amandine thiry orcid.org/0000-0001-7796-7269 department of history ghentcdh, ghent university, ghent, be; centre d’histoire du droit et de la justice, uclouvain, be lisa van diem department of history, faculty of arts & social sciences, maastricht university, maastricht, nl jan vandersmissen orcid.org/0000-0002-7322-2208 department of history ghentcdh, ghent university, ghent, be eline mestdagh orcid.org/0000-0003-0003-1360 department of history ghentcdh, ghent university, ghent, be bo billiet department of history ghentcdh, ghent university, ghent, be judith wolff european university institute, fiesole, it sally chambers orcid.org/0000-0002-2430-475x department of history ghentcdh, ghent university, ghent, be pieterjan de potter orcid.org/0000-0002-1289-6575 department of history ghentcdh, ghent university, ghent, be julie carlier orcid.org/0000-0002-8257-7668 department of conflict and development studies, ghent university, ghent, belgium carmen van praet department of history ghentcdh, ghent university, ghent, be chris leonards department of history, faculty of arts & social sciences, maastricht university, maastricht, nl nico randeraad orcid.org/0000-0001-8009-7954 department of history, faculty of arts & social sciences, maastricht university, maastricht, nl references army medical library (u. s.). (1938). congresses: tentative chronological and bibliographical reference list of national and international meetings of physicians, scientists, and experts. washington: government printing office. url: http://archive.org/details/58530460r.nlm.nih.gov aubin, d. (2020). congress mania in brussels, 1846–1856: soft power, transnational experts, and diplomatic practices. historical studies in the natural sciences, 50(4), 340–363. doi: https://doi. org/10.1525/hsns.2020.50.4.340 baldwin, s. e. (1907a). the international congresses and conferences of the last century as forces working toward the solidarity of the world. the american journal of international law, 1(3), 565–578. doi: https://doi.org/10.2307/2186820 baldwin, s. e. (1907b). the international congresses and conferences of the last century as forces working toward the solidarity of the world: appendix. the american journal of international law, 1(3), 808–829. doi: https://doi.org/10.2307/2186845 bishop, w. j. (1958). bibliography of international congresses of medical sciences. bibliographie des congrès internationaux des sciences médicales. springfield: thomas. boli, j., & thomas, g. m. (1999). constructing world culture: international nongovernmental organizations since 1875. stanford: stanford university press. bloodgood, e. a., & schmitz, h. p. (2013). the ingo research agenda: a community approach to challenges in method and theory. in reinalda, b. (ed.), routledge handbook of international organization (pp. 93–105). london: routledge. doi: https://doi.org/10.4324/9780203405345 bush, e. l. (2007). measuring religion in global civil society. social forces, 85(4), 1645–1665. doi: https:// doi.org/10.1353/sof.2007.0063. carlier, j. 2010. moving beyond boundaries: an entangled history of feminism in belgium, 1890–1914. [phd thesis, ghent university]. url: http://hdl.handle.net/1854/lu-4174736 coorman, d. (1968). proeve van retrospectieve bibliografie van verslagen en mededelingen van congressen in belgië, 1875–1957. brussel: belgische commissie voor bibliografie. davies, t. (2019). the uia and the development of international relations theory. in d. laqua, w. van acker & c. verbruggen (eds.), international organizations and global civil society: histories of the union of international associations (pp. 155–170). london: bloomsbury academic. doi: https://doi. org/10.5040/9781350055643.0019 d’haeninck, t. (2018). sociale hervormers op zoek naar een moreel reveil: een sociaal-historische en transnationale studie naar de aanwezigheid van belgen op internationale hervormingscongressen, 1850–1914. [phd thesis, ghent university]. url: http://hdl.handle.net/1854/lu-8577125 https://doi.org/10.5334/johd.69 https://orcid.org/0000-0001-7796-7269 https://orcid.org/0000-0002-7322-2208 https://orcid.org/0000-0003-0003-1360 https://orcid.org/0000-0003-0003-1360 https://orcid.org/0000-0002-2430-475x https://orcid.org/0000-0002-1289-6575 https://orcid.org/0000-0002-8257-7668 https://orcid.org/0000-0001-8009-7954 https://orcid.org/0000-0001-8009-7954 http://archive.org/details/58530460r.nlm.nih.gov https://doi.org/10.1525/hsns.2020.50.4.340 https://doi.org/10.1525/hsns.2020.50.4.340 https://doi.org/10.2307/2186820 https://doi.org/10.2307/2186845 https://doi.org/10.4324/9780203405345 https://doi.org/10.1353/sof.2007.0063 https://doi.org/10.1353/sof.2007.0063 http://hdl.handle.net/1854/lu-4174736 https://doi.org/10.5040/9781350055643.0019 https://doi.org/10.5040/9781350055643.0019 http://hdl.handle.net/1854/lu-8577125 12verbruggen et al. journal of open humanities data doi: 10.5334/johd.69 doré, r. (1923). essai d’une bibliographie des congrès internationaux. paris: é. champion. url: http:// archive.org/details/essaidunebibliog00doruoft eijkman, p. h. (1910). l’internationalisme médical. la haye: bureau préliminaire de la fondation pour l’internationalisme. eijkman, p. h. (1911). l’internationalisme scientifique (sciences pures et lettres). la haye: bureau préliminaire de la fondation pour l’internationalisme. url: http://archive.org/details/ linternationali00reingoog faries, j. c. (1915). the rise of internationalism. new york : w.d. gray. url: http://archive.org/details/ cu31924030443083 fried, a. h. (1905). annuaire de la vie internationale. bruxelles: office central des institutions internationales. url: http://gallica.bnf.fr/ark:/12148/bpt6k57277351 fried, a. h. (1906). annuaire de la vie internationale. bruxelles: office central des institutions internationales. url: http://gallica.bnf.fr/ark:/12148/bpt6k5688624m fried, a. h. (1907). annuaire de la vie internationale. bruxelles: office central des institutions internationales. http://gallica.bnf.fr/ark:/12148/bpt6k5652035n fried, a. h. (1911). handbuch der friedensbewegung. new york: garland pub. http://archive.org/details/ handbuchderfrie00friegoog grandjean, m., & van leeuwen, m. h. (2019). mapping internationalism: congresses and organizations in the nineteenth and twentieth centuries. in d. laqua, w. van acker & c. verbruggen (eds.), international organizations and global civil society: histories of the union of international associations (pp. 225–242). london: bloomsbury academic. doi: https://doi.org/10.5040/9781350055643.0022 gregory, w. (1938). international congresses and conferences, 1840-1937; a union list of their publications available in libraries of the united states and canada. new york: the h.w. wilson company. gregory, w. (1939). second catalogue of publications of international congresses and conferences. dordrecht: springer. doi: https://doi.org/10.1007/978-94-015-3592-2 jansen, jan c. (2015). in search of atlantic sociability: freemasons, empires, and atlantic history. bulletin of the german historical institute 57, 75–99. judge, a. (1978). types of international organization: detailed overview. url: https://www. laetusinpraesens.org/docs70s/typeap4.php laqua, d., van acker, w., & verbruggen, c. (2019). introduction : reconstructing the identities of an international non-governmental intelligence agency. in d. laqua, w. van acker, & c. verbruggen (eds.), international organizations and global civil society: histories of the union of international associations (pp. 1–14). london: bloomsbury academic. doi: https://doi. org/10.5040/9781350055643.0008 leonards, c. (2015). visitors to the international penitentiary congresses; cooperation and competition of state and society on a transnational knowledge platform dealing with penitentiary care. österreichische zeitschrift für geschichtswissenschaften, 26(3), 80–101. doi: https://doi. org/10.25365/oezg-2015-26-3-5 leonards, c., & randeraad, n. (2020). circulations charitables: les congrès internationaux de réforme sociale (1876–1913). in c. topalov (ed.), philanthropes en 1900: londres, new york, paris, genève (pp. 435–467, 606–616). saint-étienne: créaphis éditions. leonards, c. g. t. m., & randeraad, n. (2010). transnational experts in social reform, 1840–1880. international review of social history, 55(2), 215–239. doi: https://doi.org/10.1017/s0020859010000179 leonards, c. g. t. m., & randeraad, n. (2015). building a transnational network of social reform in the 19th century. in d. rodogno, b. struck, & j. vogel (eds.), shaping the transnational sphere. experts, networks and issues from the 1840s to the 1930s (pp. 111–131). new york: berghahn. liedtke, r. & weber, k. (ed.). (2009). religion und philanthropie in den europäischen zivilgesellschaften: entwicklungen im 19. und 20. jahrhundert. paderborn: schöningh. doi: https://doi. org/10.30965/9783657763849 lyons, f. s. l. (1963). internationalism in europe: 1815–1914. leyden: a. w. sythoff. matthews, m. a. (1940). the peace movement: select list of references on the work of national and international organizations for the advancement of peace; with special attention to the movement in the united states. washington, dc: carnegie endowment for international peace. meroño-peñuela, a., de boer, v., van erp, m., zijdeman, r., mourits, r., melder, w., rijpma, a., & schalk, r. (2020). clariah: enabling interoperability between humanities disciplines with ontologies. in applications and practices in ontology design, extraction, and reasoning (pp. 73–90). doi: https://doi. org/10.3233/ssw200036 moses, j. (2015). policy communities and exchanges across borders. the case of workplace accidents at the turn of the twentieth century. in d. rodogno, b. struck, & j. vogel (eds.), experts, networks and issues from the 1840s to the 1930s (pp. 60–81). new york: berghahn. randeraad, n. (2015). triggers of mobility: international congresses (1840–1914) and their visitors. jahrbuch für europäische geschichte, 16, 63–82. doi: https://doi.org/10.1515/9783110415162-005 randeraad, n. (2018). dutch social reformers in transnational space, 1840–1914: reflections on the clariah research pilot 2tbi. https://www.clariah.nl/projecten/research-pilots/2tbi https://doi.org/10.5334/johd.69 http://archive.org/details/essaidunebibliog00doruoft http://archive.org/details/essaidunebibliog00doruoft http://archive.org/details/linternationali00reingoog http://archive.org/details/linternationali00reingoog http://archive.org/details/cu31924030443083 http://archive.org/details/cu31924030443083 http://gallica.bnf.fr/ark:/12148/bpt6k57277351 http://gallica.bnf.fr/ark:/12148/bpt6k5688624m http://gallica.bnf.fr/ark:/12148/bpt6k5652035n http://archive.org/details/handbuchderfrie00friegoog http://archive.org/details/handbuchderfrie00friegoog https://doi.org/10.5040/9781350055643.0022 https://doi.org/10.1007/978-94-015-3592-2 https://www.laetusinpraesens.org/docs70s/typeap4.php https://www.laetusinpraesens.org/docs70s/typeap4.php https://doi.org/10.5040/9781350055643.0008 https://doi.org/10.5040/9781350055643.0008 https://doi.org/10.25365/oezg-2015-26-3-5 https://doi.org/10.25365/oezg-2015-26-3-5 https://doi.org/10.1017/s0020859010000179 https://doi.org/10.30965/9783657763849 https://doi.org/10.30965/9783657763849 https://doi.org/10.3233/ssw200036 https://doi.org/10.3233/ssw200036 https://doi.org/10.1515/9783110415162-005 https://www.clariah.nl/projecten/research-pilots/2tbi 13verbruggen et al. journal of open humanities data doi: 10.5334/johd.69 to cite this article: verbruggen, c., deroo, f., blomme, h., d’haeninck, t., thiry, a., van diem, l., vandersmissen, j., mestdagh, e., billiet, b., wolff, j., chambers, s., de potter, p. d., carlier, j., van praet, c., leonards, c., & randeraad, n. (2022). social reform international congresses and organizations (1846–1914): from sources to data. journal of open humanities data, 8: 13, pp. 1–13. doi: https://doi. org/10.5334/johd.69 published: 12 may 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. rasmussen, a. (1990). jalons pour une histoire des congrès internationaux au xixe siècle: régulation scientifique et propaganda intellectuelle. relations internationales, 62, 115–133. reinalda, b. (2019). looking for information on international secretariats: digging deeper into the yearbook of international organizations. in d. laqua, w. van acker & c. verbruggen (eds.), international organizations and global civil society: histories of the union of international associations (pp. 204–224). london: bloomsbury academic. doi: https://doi.org/10.5040/9781350055643.0021 saunier, p. (2019). everything one wants to know about international organizations? a critical biography of the yearbook of international organizations, 1909–2017. in d. laqua, w. van acker & c. verbruggen (eds.), international organizations and global civil society: histories of the union of international associations (pp. 171–203). london: bloomsbury academic. doi: https://doi. org/10.5040/9781350055643.0020 schroeder-gudehus, b. (1990). les congrès scientifiques internationaux. relations internationales, 62, 111–222. thiry, a., d’haeninck, t., & verbruggen, c. (2018). 8. (re‐)educational internationalism in the low countries, 1850–1914. in de spiegeleer, christoph & liberaal archief (eds.), the civilising offensive: social and educational reform in 19th century belgium (pp. 189–216). berlin, boston: de gruyter oldenbourg. doi: https://doi.org/10.1515/9783110581546-008 topalov, c. (1999). laboratoires du nouveau siècle(la nébuleuse réformatrice et ses réseaux en france, 1880–1914). civilisations et sociétés. uia. (2015). ‘editorial problems and policies’ (appendix 10.3). in yearbook of international organizations, 2015–16, (pp. 15–21). leiden: brill. uia (union des associations internationals). (1957). les 1.978 organisations internationales fondées depuis le congrès de vienne: liste chronologique. bruxelles: union des associations internationales. url: https://archive.org/details/les1978organisat00unio uia (union of international associations). (1960). les congrès internationaux de 1681 à 1899: liste complète. bruxelles: union des associations internationales. url: https://uia.org/sites/uia.org/files/ journals/international_associations_1960_3.pdf uia (union of international associations). (1964). les congrès internationaux de 1900 à 1919: liste complète. bruxelles: union des associations internationales. vandersmissen, j., & van diem, l. (2021). ‘les indes africaines’ versus ‘le congo minotaure’: debatten over klimaat, acclimatisatie, hygiëne en de idealisering van het imperiaal project van leopold ii in congo, 1876–1908. belgisch tijdschrift voor nieuwste geschiedenis, 51(3), 8–36. van praet, c. (2019). bewogen burgers: negentiende-eeuwse liberalen en de sociale kwestie in belgië (1846–1886). gent: liberas. verbruggen, c., laqua, d., & deneckere, g. (2012). belgium on the move: transnational history and the belle epoque. in d. laqua, c. verbruggen, & g. deneckere (eds.), revue belge de philosophie et histoire, 90(4), 1213–1226. doi: https://doi.org/10.3406/rbph.2012.8283 verbruggen, c., d’haeninck, t., & blomme, h. (2020). mobility and movements in intellectual history: a social network approach. in f. kerschbaumer, l. von keyserlingk, m. stark, & m. during (eds.), the power of networks: prospects of historical network research (pp. 125–150). abingdon: routledge. viaene, v. (2015). professionalism or proselytism? catholic ‘internationalists’ in the nineteenth century. in d. rodogno, b. struck, & j. vogel (eds.), experts, networks and issues from the 1840s to the 1930s (pp. 23–43). new york: berghahn. wallace, m., & singer, j. d. (1970). intergovernmental organization in the global system, 1815–1964: a quantitative description. international organization, 24(2), 239–287. url. doi: https://doi. org/10.1017/s002081830002590x https://doi.org/10.5334/johd.69 https://doi.org/10.5334/johd.69 https://doi.org/10.5334/johd.69 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.5040/9781350055643.0021 https://doi.org/10.5040/9781350055643.0020 https://doi.org/10.5040/9781350055643.0020 https://doi.org/10.1515/9783110581546-008 https://archive.org/details/les1978organisat00unio https://uia.org/sites/uia.org/files/journals/international_associations_1960_3.pdf https://uia.org/sites/uia.org/files/journals/international_associations_1960_3.pdf https://doi.org/10.3406/rbph.2012.8283 https://doi.org/10.1017/s002081830002590x https://doi.org/10.1017/s002081830002590x crosslinguistic semantic textual similarity of buddhist chinese and classical tibetan research paper corresponding author: marieke meelen university of cambridge, gb mm986@cam.ac.uk keywords: cross-linguistic sts; information retrieval; buddhist chinese; classical tibetan; translation studies to cite this article: felbur, r., meelen, m., & vierthaler, p. (2022). crosslinguistic semantic textual similarity of buddhist chinese and classical tibetan. journal of open humanities data, 8(1): 23, pp. 1–14. doi: https://doi.org/10.5334/ johd.86 crosslinguistic semantic textual similarity of buddhist chinese and classical tibetan rafal felbur marieke meelen paul vierthaler *author affiliations can be found in the back matter of this article abstract in this paper we present the first-ever procedure for identifying highly similar sequences of text in chinese and tibetan translations of buddhist sūtra literature. we initially propose this procedure as an aid to scholars engaged in the philological study of buddhist documents. we create a cross-lingual embedding space by taking the cosine similarity of average sequence vectors in order to produce unsupervised similar cross-linguistic parallel alignments at word, sentence, and even paragraph level. initial results show that our method lays a solid foundation for the future development of a fully-fledged information retrieval tool for these (and potentially other) low-resource historical languages. mailto:mm986@cam.ac.uk https://doi.org/10.5334/johd.86 https://doi.org/10.5334/johd.86 https://orcid.org/0000-0002-0555-9992 https://orcid.org/0000-0003-0395-8372 https://orcid.org/0000-0002-2135-9499 2felbur et al. journal of open humanities data doi: 10.5334/johd.86 1 introduction buddhist sūtra texts, which are fundamental sources for understanding the beliefs that once dominated, and largely continue to dominate, asian societies, present formidable challenges to the modern researcher. like oral literature, the sūtras are authorless and textually fluid and their content is complex and can be rather formulaic (silk, 2020). as a result, it is often impossible to determine the ‘original’ form of a given work. the situation is complicated further by the huge volume of these documents and the linguistic diversity of their extant versions: for most, only fragments survive in the languages of their original composition (i.e. sanskrit or other indic languages) and all we have are their translations, mainly into chinese and tibetan. in this paper we present a novel method1 designed to help researchers tackle these challenges more effectively than has been possible to date. this is a method for automatic detection of cross-linguistic semantic textual similarity (sts) across historical chinese and tibetan buddhist textual materials. it aims to enable philologists to take any passage in a chinese buddhist translation text, and to quickly locate tibetan-language parallels to it anywhere in the tibetan buddhist canon. the novelty of our contribution is its cross-linguistic capability for historical, low-resource and under-researched languages. although in both of the languages in question, buddhist chinese and classical tibetan, searching for parallel passages (i.e. monolingual alignment) is possible (klein, dershowitz, wolf, almogi, & wangchuk, 2014; nehrdich, 2020, as well as, in a crude but effective way, through the user interfaces of cb reader, in both its web-based and desktop versions, or the sat daizōkyō text database), cross-linguistic semantic textual similarity and information retrieval (i.e. cross-linguistic ‘alignments’) in buddhist texts have long remained an unsolved task. for a limited number of edited texts in sanskrit and tibetan an attempt at automatic crosslinguistic alignment has recently been made by nehrdich (2020)2 using the yasa sentence aligner.3 however, this method depends on the availability of texts in which words and sentences have been manually pre-segmented, which is not the case for the vast majority of texts we are targeting. furthermore, being designed for sanskrit and tibetan, this method is not currently applicable to our highly specific buddhist chinese. in short, no advanced cross-linguistic information retrieval techniques have yet been developed for any historical languages. both the tibetan and buddhist chinese texts under investigation pose particular challenges because e.g. of their different scripts, the lack of word segmentation and sentence boundaries, as well as due to the highly specific buddhist terms and (often deliberately) obscure double meanings etc. in this paper we build on the extant work on these languages by vierthaler and gelein (2019) and vierthaler (2020) (for alignment and segmentation of buddhist chinese) and meelen and hill (2017), faggionato and meelen (2019) and meelen, roux, and hill (2021) (for segmentation and pos tagging of old and classical tibetan) to develop the first-ever buddhist chinese-tibetan cross-linguistic sts pipeline, creating unsupervised cross-linguistic alignments for words, sentences, and whole paragraphs of these buddhist texts, and potentially of contemporaneous non-buddhist materials as well. our proposed procedure for these highly specific buddhist chinese or tibetan texts will be an important asset for anyone working with under-researched and low-resource historical languages. 2 method in recent years, large digitisation projects have provided online access to huge buddhist chinese and buddhist tibetan corpora: digitized versions of over 70,000 traditional woodblock print pages in the tibetan case, as well as, on the chinese side, of some 80,000 typeset print pages of the modern taishō canon, in addition to growing quantities of other canonical and extracanonical materials. in this section we show how we developed our procedure step-by-step. figure 1 shows the full pipeline of our proposed procedure, starting with tokenisation of the individual chinese and tibetan corpora and ending with the full output ranked after clustering and optimisation of cosine similarity scores of target outputs. 1 all code available on https://github.com/vierth/buddhist_chinese_classical_tibetan (last accessed: 8 august 2022). 2 https://github.com/sebastian-nehrdich/sanskrit-tibetan-etexts (last accessed: 8 august 2022). 3 http://rali.iro.umontreal.ca/rali/?q=en/yasa (last accessed: 8 august 2022). https://github.com/vierth/buddhist_chinese_classical_tibetan https://github.com/sebastian-nehrdich/sanskrit-tibetan-etexts http://rali.iro.umontreal.ca/rali/?q=en/yasa 3felbur et al. journal of open humanities data doi: 10.5334/johd.86 2.1 tokenisation while tokenisation and sentence segmentation are not usually significant hurdles when working with documents written in western languages, in which words are delineated by white space, these are not trivial tasks for either premodern chinese, including buddhist chinese, or classical tibetan. neither language uses clear morphological markers or white space to indicate words, and in many cases it is not easy to even divide a text into sentences or utterances. accordingly, before we can develop a model, we must first preprocess our corpora to include token and sentence boundaries. tokenisation is especially challenging on the chinese side. for the chinese, we use chinese buddhist translation texts from the kanseki repository (wittern, 2016).4 these texts are mostly provided with punctuation, which makes sentence level segmentation relatively simple. complications arise, however, when it comes to segmentation on the word level of these materials. while much effort is currently being invested in attempts to develop tools that will segment chinese texts into words (some of them specifically designed to segment buddhist materials, e.g. wang, 2020), these tools remain unusable to us, since the underlying models themselves are often not openly released, and the training data used to create them is often not available. for this reason, we had to devise our own strategy for tokenising the chinese buddhist translation texts. in doing so, we used three different approaches and compared their efficiency: word-based tokenisation, character-based tokenisation, and a hybrid approach. for the first approach, we began by creating word-based embeddings on the basis of two glossaries of buddhist terms (inagaki, 1978; yokoyama & hirosawa, 1996). this allowed us to scan each sentence in our texts for buddhist terms listed in these glossaries, prioritising longer sequences of characters. once the buddhist vocabulary was identified, the remaining sequences not found in the glossaries were parsed into words using a classical chinese tokeniser5 (see qi, zhang, zhang, bolton, & manning, 2020). because this word-based tokeniser introduced significant noise into our downstream tasks, we tested two other tokenisation approaches: a character-based approach that treats individual characters as tokens, and a hybrid approach that uses the word-based tokenisation described above, but which parses sequences not found in the glossaries simply as individual characters (i.e. without using the classical chinese tokeniser). we also enhanced the dictionaries, using more advanced glossaries by karashima seishi (karashima, s., 1998, 2001, 2010) for our first test, which we will refer to as ‘hybrid 1’, and an even further extended dictionary including the da zhidu lun (li, 2011) glossary which we will refer to as ‘hybrid 2’. on the tibetan side, tokenisation was converted to a syllable-tagging and recombination task with the actib scripts6 developed by meelen et al. (2021). as for sentence segmentation, we could use the technique developed by meelen and roux (2020) and optimised by faggionato, hill, and meelen (2022) to create sentence boundaries in tibetan, which is good, but not 100% accurate. existing automatic aligners rely on sentence boundaries, so accuracy is of crucial importance. another issue that arises in this context is the difference between the chinese and tibetan texts we focus on specifically, as there are often multiple tibetan sentences corresponding to one sentence in buddhist chinese. for these reasons, our procedure is solely based on semantic textual similarity, thereby bypassing the need for sentence boundaries altogether. 4 kanseki repository http://web.archive.org/web/20210418080358/http://blog.kanripo.org/ (last accessed: 8 august 2022). the texts themselves are hosted on github: https://github.com/kanripo (last accessed: 8 august 2022) and derive from work done by the cbeta project. 5 as distributed through the stanza python library. https://stanfordnlp.github.io/stanza/available_models.html (last accessed: 8 august 2022). 6 https://github.com/lothelanor/actib (last accessed: 8 august 2022). figure 1 pipeline for overall procedure of cross-lingual buddhist chinese & classical tibetan alignment. http://web.archive.org/web/20210418080358/http://blog.kanripo.org/ https://github.com/kanripo https://stanfordnlp.github.io/stanza/available_models.html https://github.com/lothelanor/actib 4felbur et al. journal of open humanities data doi: 10.5334/johd.86 2.2 developing embeddings there are many ways to acquire useful vector representations of words, known as word embeddings, which in turn can be used to aid downstream tasks like text classification, stylometric analysis, sentiment analysis, and, crucially for us, information retrieval, and its specific application in automatic textual alignments. these ways range from the straightforward count vector models that simply track word frequency across a corpus, to more advanced algorithms like google’s word2vec and facebook’s fasttext, which use neural networks to develop models that can predict words based on a set of context words (continuous bag of words, or cbow), or that can predict context words when given an input term (skip-gram). state-of-the-art word representations can be attained using transformer-based algorithms like bert (devlin, chang, lee, & toutanova, 2019) and ernie (zhang et al., 2019), which learn word representations by predicting masked words. in our procedure, in order to balance sophistication against complexity, we have elected to use fasttext to create the embeddings that will drive our approach.7 in addition to selecting the most adequate embedding method, it is essential to choose the most appropriate textual corpus as a basis for the embeddings. since our goal was to create an embedding model that will be useful for the specific goal of aligning chinese and tibetan buddhist translation texts, we chose a corpus that contains just the type of language that is specifically used in these texts. this is essential because the idiom and style of buddhist texts is usually markedly different from that used in the broader language as a whole. accordingly, for chinese, we used buddhist texts contained within the kanseki repository, encompassing the taishō edition of the chinese buddhist canon and a variety of supplementary materials, for a total of 4,137 documents containing 174m characters (20,775 unique). for tibetan, we used the sūtra translations in the kangyur (the electronic derge version of the ekangyur collection), as well as electronic versions of commentarial and other texts in the entire etengyur to create a corpus that is large enough to create word embeddings. the ekangyur consists of around 27 m tokens and the etengyur consists of around 58m tokens (see meelen & roux, 2020); these together represent 31k unique tokens. because we are attempting to develop a system that is not dependent on a priori knowledge of which chinese text ‘should’ align with which tibetan text, we trained two separate embeddings, one on the chinese buddhist texts, and one on the tibetan. that is, we took each corpus independently and fed the corpora into the fasttext algorithm with the same settings, creating two independent spaces of 100 dimensions each. we then projected the resulting embeddings into the same space, creating a combined embedding space, discussed in section 2.3. 2.3 combining embeddings for creating the combined embedding space, we adopted the approach of glavaš, francosalvador, ponzetto, and rosso (2018),8 which is in turn an implementation of the linear translation matrix approach suggested by mikolov, le, and sutskever (2013). in effect, our method takes an embedding space for each language and then relies on a bilingual glossary to create a linear projection. this projection casts the two spaces into a shared space, one which preserves internal linguistic similarity while trying to bring the glossary terms as close together as possible.9 using the two embedding spaces created in the previous step, we can then apply the aforementioned yokoyama-hirosawa and inagaki glossaries, which provide chinese and tibetan translation pairs. we then identify every pair for which we have an embedding in both chinese and tibetan and use all these pairs together to create a projection into a shared embedding space. 7 while it might be ideal to use a transformer model, there are no available models trained on buddhist chinese or classical tibetan specifically and existing models for modern chinese or even tibetan are not suitable for the task since the [classical] langauges differ too much compared to the corresponding contemporary varieties. we therefore leave transformers for future research and use fasttext rather than word2vec as it learns sub-word level representations of terms, which in the end creates a slightly more flexible model. 8 following the method they describe in glavaš et al. (2018), we adapted their translation matrix code (https://bitbucket.org/gg42554/cl-sts/src/master/code/ [last accessed: 8 august 2022]) for this project. 9 it is possible that orthogonal constraints on the translation matrix and other normalisations could improve the resulting embedding space, as is suggested by xing, wang, liu, and lin (2015). however, this would require extensive refactoring of code and is planned for the future. https://bitbucket.org/gg42554/cl-sts/src/master/code/ 5felbur et al. journal of open humanities data doi: 10.5334/johd.86 in cases where the translation glossary includes a multi-character chinese term not found in the embedding space, but where all constituent characters are present, an embedding is derived by averaging the vectors for all the characters within the word. we can glean some insight into the quality of the new shared embedding space by looking at the cosine similarity between known translation pairs from the glossaries, as shown in table 1. the results listed in table 1 show that the different chinese tokenisation approaches used lead to different rates of similarity in the shared embedding space. for word-based embeddings and to a lesser extent ‘hybrid 2’, these results also indicate that, in general, the larger the tokenisation dictionary, the higher the similarity. although word-based tokenisation performs slightly better at this initial step, it does not work as well as the hybrid approaches for our downstream tasks, as shown in section 3 below. as a further sanity check, we visualised some embeddings to see whether similar words indeed exist in close proximity to each other. the resulting visualisation is presented in figure 2, which demonstrates this for some sample vectors for animals, directions, numbers, and seasons.10 all these categories are nicely clustered together as expected. the only outlier is tibetan nya sha, which was labelled as an animal, but it actually means ‘fish (as) meat’, i.e. fish that will be eaten. it is therefore not entirely surprising that it would be farther away from the rest of the animal words, which are not used as food. figure 3 is a zoomed-in view of the “animal” cluster from figure 2, with english translations for the vectors. this zoomedin view shows that tibetan and chinese equivalents are placed relatively close together, as expected. 10 the embeddings exist in 100-dimensional space, and we have used tsne to reduce the dimensionality in order to visualize the relationships. please note that this preserves local similarity but obscures global differences. table 1 summary of cosine similarity scores of tibetanchinese glossary pairs within the new embedding spaces according to chinese tokenisation method. shows the highest scoring pair, lowest scoring pair, and some descriptive statistics. higher scores with lower standard deviation indicate a more accurate embedding space. chinese embedding type most similar least similar median mean std character 0.9 –0.2 0.66 0.64 0.12 hybrid1 0.9 0.19 0.66 0.65 0.11 hybrid2 0.91 0.22 0.66 0.64 0.11 word 0.92 0.3 0.67 0.67 0.11 figure 2 a sample of embeddings selected from the cross-lingual tibetan-chinese space. this includes a selection of animal, numerical, seasonal, and directional words. བ ན ི གབཞིབ དབ ོང གཉིས་ཀ བ ད རད ིདད ན ི ཐི་བལོག་ རང་ང་མ ང་པོ ངང་པགགཟིགབོང་ ཉ་ཤ དོམ ན་ལག བཤརང ོ 0 10 20 30 −40 −30 −20 −10 0 dim1 d im 2 a sample of term embeddings in cross-lingual space animals numbers seasons directions 6felbur et al. journal of open humanities data doi: 10.5334/johd.86 there is room for improvement in the quality of the shared embedding space, but the real test is the space’s utility for the task at hand, which is identifying textual sequences with similar semantic meaning across languages. 2.4 identifying similar sequences with the combined and checked word embeddings in hand, we are ready to apply our procedure to what has been the main goal all along, i.e. searching for sequences of text in both tibetan and chinese that carry similar meanings. in this pilot study we use as our source texts three chinese sūtras from the mahāratnakūṭa (mrk) collection, which have been manually divided into sections.11 we then tokenise each section into either characters, words, or buddhist terms (as in our two hybrid embedding approaches). then we fetch the vector for each token in the section and average the vectors together to create a vector representation of the entire section. we then define tibetan texts parallel to the chinese sūtras as the ‘target.’ we divided this target text into sections as well: we did this by using a sliding window of text from a tibetan candidate document, the length of which window is based on the length of the chinese section, adjusted by some length factor. we then calculate the cosine similarity between the chinese section in question and all tibetan sections. finally, we have the system rank the suggested results based on highest cosine similarity of the combined embeddings, and report the results. the highestscoring sections are likely to have similar meaning. 2.5 parameter settings, clustering & optimisation when we looked closely at the generated results, we found that we could improve their quality by optimising the test parameter settings, specifically the length of the tibetan search window. one reason why such optimisation proved advantageous may be the fact that the tibetan text is always more elaborate than the chinese, meaning that for every chinese passage of n tokens, the parallel tibetan will include roughly 50% more tokens. in order to accommodate this difference, we extended the tibetan search window by a fixed rate (proportional rates proved inefficient, hence we rejected them), in order to ensure the results would cover the entire chinese input. significantly shorter chinese input phrases required a different rate still, since they tend to be proportionally even longer in tibetan than are longer chinese phrases. in section 3.3 we discuss the parameter options to optimise results for different input lengths. 11 please see section 3.1 of the alignment scoring manual handy and meelen (2022): https://zenodo.org/ record/6782150#.yu5uicba5pq (last accessed: 8 august 2022). figure 3 a zoomed in detail of some of the animal words from the cross-lingual embedding space shown in figure 1, including english translations. dog ིdog sheep pigeon ཐི་བpigeon snake ལsnake ant ོག་ རant mosquito ང་ bee/fly cow/buffalo beast/animal bee ང་མbee scorpion elephant ང་པོelephant dog goose ངང་པgoose tiger གtiger leopard གཟིགleopard insect/animals donkey བོང་ donkey fish bear དོམbear horse horse bird magpie ན་ལགmagpie 16 18 20 22 24 26 28 −42 −41 −40 −39 −38 −37 −36 −35 −34 −33 dim1 d im 2 detail of animal embeddings in cross-lingual space https://zenodo.org/record/6782150#.yu5uicba5pq https://zenodo.org/record/6782150#.yu5uicba5pq 7felbur et al. journal of open humanities data doi: 10.5334/johd.86 2.6 sample output figure 4 shows an excerpt of a sample output file with the chinese input (shown in line 1), the tibetan target (shown in line 2), further information on location, ranking, similarity scores, etc. as well as the clustered outputs and information on how well they fit with the target. alignments are identified by their unique alignment codes, e.g. ‘t2.a1’ refers to ‘alignment number 1 in text 2’. a complete overview of all manual alignments used for evaluation (see section 2.7) can be found in the supplementary files. 2.7 evaluation method our alignment outputs automatically receive similarity scores, which allows them to be automatically ranked. this in turn is useful to philologists, as it allows for displaying any number of ‘top’ alignments, depending on the task at hand (e.g. top 5, 10 or 15). in order to evaluate our automatic chinese-tibetan alignment outputs, we compared them to a manually-created gold standard. this gold standard refers to a set of data produced by expert philologists,12 who manually aligned three of our source and target texts and provided alignment scores based on machine translation evaluation techniques. producing these manual alignments was a non-trivial task, for two reasons. first, while nominally speaking the chinese and tibetan texts in question are translations of the same indic buddhist scripture, in no case can we assume that the two were in fact translated from the same original source in sanskrit or another indic source language; indeed the two texts in each pair often differ from each other strikingly, in some cases entirely. second, the very process of manually scoring the proposed alignments, with the aim to identify ‘near-perfect’ pairs, is also to a considerable degree subjective, so much so that even experienced philologists with excellent knowledge of both languages can differ in judgement. in order to mitigate both of the problems listed above, we created a detailed annotation and scoring guide, with diagnostics and precise decision-making criteria, as well as examples.13 in addition, we had a random number of alignments double-checked by multiple annotators, in order to check for consistency.14 all in all, the philologists identified 80 near-perfect alignment pairs for three chinese input texts and their corresponding tibetan targets (42 for text 1; 21 for text 2; 17 for text 3). these 80 alignments constituted our gold standard, which we used in testing the effectiveness and accuracy of our procedure. this manually-developed gold standard is available for only three chinese texts and their tibetan counterparts at present, which is why we focus on these three pairs of texts only in the evaluation of this pilot study. the three texts in question are: 1. xulai jing 須賴經 (t329), from the late 3rd-early 4th century, and the des pas zhus pa (d71), from ca. the late 8th century—translations, into chinese and tibetan respectively, of the *sūrata-paripṭcchā (henceforth ‘text 1’) 12 these philologists were from the erc-funded openphilology project (https://openphilology.eu/team [last accessed: 8 august 2022]). 13 this guide is available on zenodo: https://zenodo.org/record/6782150#.yr3fimbrzpq (last accessed: 8 august 2022), cf. handy and meelen (2022). 14 a comprehensive inter-annotator agreement study could further improve the results, but was beyond the scope of the present pilot study. figure 4 sample output for alignment t2.a1. https://openphilology.eu/team https://zenodo.org/record/6782150#.yr3fimbrzpq 8felbur et al. journal of open humanities data doi: 10.5334/johd.86 2. genghe shang youpoyi hui 恒河上優婆夷會 (t310 [31]), from the early 8th century, and the gang ga’i mchog gis zhus pa (d75), roughly a century later–translations of the *gaṇgottarā-paripṭcchā (henceforth ‘text 2’) 3. shande tianzi hui 善徳天子會 (t310 [35]), from the early 8th century, and the sangs rgyas kyi yul bsam gyis mi khyab pa bstan pa (d79), roughly a century later, translations of the *acintyabuddhaviṭaya-nirdeṇa (henceforth ‘text 3’) all three texts survive in their entirety only in the chinese and tibetan translations, with no known complete sanskrit or other indic language versions, they also differ in many ways. one of these ways is especially consequential for our results: text 1 is mainly narrative, and consists of stories that illustrate moral points, while the latter two are more abstract-philosophical, and contain a narrower set of more technical metaphysical concepts. we weigh the implications of this difference in section 3.1. for this pilot study, we use the chinese sentence as input and let the system find tibetan equivalents that are semantically as similar as possible, ideally capturing the exact target that the philologists identified in the gold standards. 3 results in this section we present the results of using the different methods of creating buddhist chinese embeddings described above in section 2.2. as these embeddings were not yet optimised, a comparison of the effectiveness of the different methods when applied to each of our three texts can give us further insight into which method is best suited for the task at hand. tibetan word embeddings were already optimised (see meelen, 2022), including the addition of specialist (buddhist) terms. in the remainder of this section, we first present the aggregate results per text, and then zoom in on select ‘interesting’ results in order to discuss how they may have been affected by the different embedding methods used, as well as by the unique characteristics of the inputs qua vocabulary, style, and grammar. 3.1 results per text table 2 shows what percentage of outputs for each text was ranked first or in the top 5/10/15; a separate listing is given for each of the four chinese embedding methods. ideally, the system would automatically rank the exact tibetan target ‘first,’ so that philologists can instantly find the tibetan equivalents of the chinese inputs they are looking for. however, since this is not likely to happen always, or even frequently, a dedicated user interface for philologists should display the top 5/10/15 (depending on preference), which the user would then go through by hand. for this reason, we list not only the percentage of target alignments that were automatically ranked first, but also those where the target was found in the top 5/10/15, as well as the average ranking of the target result and the number of cases in which the target alignment in tibetan was not found in the top 15 (i.e. ranked ‘zero’).15 table 2 shows that the results for text 2 are always better than those for text 1 and text 3: the average rank is higher (ranging from 1.24 with character embeddings to 2.48 for word embeddings); there are no zero results with any of the embedding methods used; and it has the highest percentage of perfectly matched target results in the top ranks (with almost all targets found in the top 5 with any embedding method). in practice, this means that philologists inputting chinese passages from text 2 are very likely to be presented with exact tibetan targets (i.e. semantically similar passages or target alignments as identified manually by philologists) when searching the entire text. the results for texts 1 and 3 are not as outstanding, but are still very good, with average ranking between 3.3–4.6 (as against 1.2–2.4 for text 2). still, for both text 1 and 3, we came across some problematic cases in which the system found no tibetan equivalent in the top 15 of the ranked results, as well as ones in which the character-embedding method yielded zero results. these problematic cases are particularly interesting to us: by looking at what went wrong we may understand how to improve our system. one example of such a problematic case with a ‘zero result’ is alignment 20 in text 1 (t1.a20), as shown in example 1. the highest-ranked match for this input based on character-embeddings is shown in 1c. 15 note that ‘zero’ could mean, for example, that the target was ranked 16th, which is not such a bad result. however, if a targeted interface for philologists only displays the top 15 results than anything ranked lower could not be considered. 9felbur et al. journal of open humanities data doi: 10.5334/johd.86 (1) (a) (b) (c) the system ranked 1c first, suggesting that it matches the input very closely, while even a quick look reveals that this is not at all the case. the colour coding in the examples shows, however, that the highest-ranked output contains multiple matches for a number of individual key terms present in the input, such as ‘your highness/majesty’, ‘wealth/precious jewel,’ etc. although meaning-wise, the high-ranked output suggested by the system differs from the input, these key terms do occur multiple times in both. this latter fact may have contributed decisively to the relatively high cosine similarity score of 0.90184 (standard deviation of similarity score: 0.02, avg similarity score 0.85 for this alignment). we may add here that this problem seems to persist also with other embedding methods: for instance, for hybrid-2-embeddings, the highest-ranked result is an altogether different passage of the text than the one discussed immediately above, but this one too contains the very same crucial key terms ‘your majesty’ and ‘precious jewel’ multiple times. in addition to such individual cases, we also need to account for the differences in the quality of results between our three texts. one of the reasons for these differences may be the fact that texts 1 and 3 are much longer than text 2 (text 1 has 4,463 tibetan tokens; text 2 has 2,484, and text 3 has 10,930), while at the same time the individual chinese inputs for test 2 are much shorter, which is a reflection of both the internal features of the text, and of the personal preferences of the philologist who aligned it. generally, the longer the text, the more difficult it is to rank the target match first (especially when the input passages are short), simply because there are many more competing matches than there are in a shorter text. we will discuss this further in section 3.3 below. another reason might be the subjectivity of the manual alignments, which depend to some extent on the discretion of the philologist, as mentioned before. in addition, currently our only measure for evaluating the accuracy of the results in this pilot study is the ranking of the target tibetan. this ranking is, however, not always entirely reliable, and it can be easily influenced (or distorted?) by a number of factors, e.g. text length, how repetitive/diverse the content is, etc. one important aspect our current evaluation metric disregards is how closely non-target results with high cosine similarity scores reflect the semantic content of the input. although we currently do not have the required data to evaluate this automatically, we will shed some light on this in section 3.5. text – chi. embedding type % rank1 %rank5 %rank10 %rank15 av. rank #zero text 1 – character 30.95 69.05 78.57 92.86 4.33 2 text 1 – hybrid 1 35.71 69.05 88.1 92.86 3.56 0 text 1 – hybrid 2 40.48 73.81 90.48 95.24 3.4 0 text 1 – word 38.1 61.9 76.19 85.71 3.92 2 text 2 – character 76.19 100 100 100 1.24 0 text 2 – hybrid 1 52.38 100 100 100 2 0 text 2 – hybrid 2 61.9 100 100 100 1.57 0 text 2 – word 42.86 95.24 100 100 2.48 0 text 3 – character 35.29 47.06 52.94 70.59 4.58 1 text 3 – hybrid 1 35.29 64.71 82.35 88.23 3.53 0 text 3 – hybrid 2 35.29 58.82 82.35 82.35 3.36 0 text 3 – word 11.76 52.94 70.59 70.59 3.92 2 table 2 results for all texts with four embedding methods for the chinese input. 10felbur et al. journal of open humanities data doi: 10.5334/johd.86 3.2 the effect of different chinese embedding methods one variable parameter in our results consists of the different methods of creating chinese embeddings, as described in section 2.2 above. ‘hybrid 2’ embeddings are essentially ‘hybrid 1’ embeddings extended with additional buddhist terms from the da zhidu lun glossary. therefore, whenever ‘hybrid 2’ embeddings yielded better results for certain alignments than did hybrid-1-embeddings, we expect this is because these alignments contain terminology that is only found in the da zhidu lun glossary. one clear example of this is alignment 21 in text 1 (ranked first with hybrid 2, but sixth with hybrid 1). this alignment contains 如來 ‘tathāgata’ which, among the glossaries we used, is only found in the da zhidu lun glossary, and not in the karashima lists upon which the ‘hybrid 1’ embeddings were based. this example is shown in 2, along with its tibetan target: (2) (a) (b) figure 5 shows the results (up to top-10 ranks) from table 1 in a chart organised by type of chinese embedding. though this pattern of superiority of hybrid-2 over hybrid-1 embeddings is expected and indeed quite common in our results, we also found one counterexample to it, namely the short alignment 11 in text 2 (shown in 3). in this case, hybrid-1 performed best (target ranked 5th), while hybrid-2-embeddings had the target ranked 11th. this is unexpected, because the input contains 攀縁 ‘in accordance with conditions,’ which is found in the karashima lists, but not in the da zhidu lun glossary. this means that this particular term was included in both hybrid-1 and hybrid-2 embeddings and there must be another, as of yet unidentified, reason why the hybrid-1 embeddings yield a better result here. (3) (a) (b) another category of results consists of those in which character embeddings performed best. in these cases we expect to be dealing with inputs that contain few multi-character proper nouns and specialist buddhist terms, which is indeed the usual pattern. nonetheless, we found a number of exceptions, e.g. alignment 12 of text 2. the input here does contain some figure 5 top-ranked results for each chinese embedding method by text. 11felbur et al. journal of open humanities data doi: 10.5334/johd.86 technical multi-character terms (世尊 ‘bhagavān’, 能知 ‘knowable’ and 能得 ‘graspable’). this might lead one to expect that hybrid embeddings would perform best. this, however, is not the case: character embeddings proved superior. the reason for this is not entirely clear, although it may have something to do with the fact that all the terms listed above also make sense if they are split up into single characters (‘world-honour,’ ‘able-know,’ ‘able-grasp’ respectively). a similar explanation can be offered for alignment 33 in text 1 (ranked 14th with char vs 24/35th with hybrid-1/2 embeddings), so this phenomenon does not appear to be text-specific. other cases of better performance of char-embeddings include: • text 1: alignments 27 (ranked 2nd with char vs 4/7th with hybrid-1/word) and 32 (ranked 2nd with char vs 7th in hybrid-1 and word); • text 3: alignments 12 and 15 (both ranked 1st with char vs 3rd/4th with hybrid-1/word), and also 7 (ranked 8th with char vs 19/36th with hybrid-1/word), 13 (ranked 1st with char vs 3rd/6th with hybrid-1/hybrid-2) and 14 (ranked 1st with char vs 6th/3rd with hybrid/word). some of these cases are especially difficult to interpret. for instance, alignments 27 and 32 of text 1 contain multi-character proper names, like 波斯匿 ‘prasenajit.’ these are expected to pose difficulties for char-embeddings, for, while they can be read as individual characters, this would result in jibberish: 波-斯-匿 is ‘wave-this-conceal.’ similarly, alignments 12 and 15 of text 3 contain the long phonetic transcription of a sanskrit name, 文殊師利 ‘mañjuśrī’, which, if read as individual characters, would make little sense (‘literature-distinct-teacher-benefit’), and which therefore can only be ‘misleading’ for alignment purposes. as for alignments 7, 13 and 14 of text 3, the fact that char-embeddings performed best may be related to the fact that the inputs are extremely short, consisting only of max 7 characters (see section 3.3). these types of unexpected examples form a minority, however, and while further analysis of such cases is a desideratum, it can only be performed at a later stage, using a larger dataset. overall, we can conclude that in the three texts we have investigated for this pilot study, the enhanced hybrid-2 embeddings generally perform better for alignments that contain specialist buddhist terminology, and that in the absence of such terminology, char embeddings perform equally well or better, which is exactly what we expected. 3.3 the effect of input length some texts exhibit a relatively high degree of repetition of short, generic clauses. this presents a challenge for the alignment procedure as it is unclear which passage is the target identified by philologists if multiple passages with very similar meanings are present in the text. this problem pertains especially to texts 2 and 3, where aligned segments are relatively short. especially in text 3, we have short recurring inputs like ‘x said’, e.g. alignment 7 with input 諸比丘言 ‘all the monks said’ (ranked 8th) or alignment 11 with input 汝等應知 ‘you all should know’ (partial match ranked 12th, because the tibetan target contains an additional vocative ‘friends!’ ). while short inputs pose challenges to our procedure, very long inputs usually lead to good results. one example of this is alignment 10 in text 3, which contains a very long tantric incantation. as input length clearly affects our results we included the option of adjusting several minor parameters in order to improve the results of variable input lengths as follows: • the proportion by which to adjust long phrases (as they are generally longer in tibetan than in chinese); • the proportion by which to adjust short phrases (as short chinese phrases are often significantly longer in tibetan); • the length threshold for what constitutes a “short phrase”; • how far apart results can be clustered together in the final analysis (results within n words of each other get reported as a single result). of all these minor parameters, we observed that the greatest impact on the results could be generated by adjusting the parameters for long and short phrases. this is most clearly seen in examples from text 2. text 2 has the longest input alignments in general (with a median length of 21 characters; text 1 has a median of 12.5, and text 3 a median of 10), and alignments 4, 6 and 15 of this text demonstrate the importance of adjustments according to phrase length. 12felbur et al. journal of open humanities data doi: 10.5334/johd.86 with the new settings of a 50% increased adjustment length for short phrases from chinese to tibetan, instead of the much longer, 130%/140%/160% options we tested before, the rankings of results improved significantly (ranking improvement of 14 → 3 for alignment 4; 11 → 2 for alignment 6 and 6 → 2 for alignment 15). for some alignments, however, reducing the phrasal length settings resulted not in higher rankings, but in lower ones, although these differences were much smaller than the gains observed for the other alignments (ranking 1 → 3 for alignment 1; 1 → 2 for alignments 10 and 17). our current corpora are too small to justify any generalisations here. however, based on the results of our pilot study we can conclude that it is certainly worthwhile to allow for the adjustment of additional parameters, and that the most optimal settings are a function of input length and content (i.e. how common the key terms of the input are and how often they reoccur in the text). 3.4 the effect of manual annotation one limitation of the current pilot study lies in the manual annotation: the alignment scores of each of our texts were added by three different philologists. for text 1, we asked the same annotator to provide scores for his alignments on two different occasions, at least 1 year apart. we observed that some alignments he had at first identified as perfect equivalents (score 5), were scored 4 in the second round of manual annotation. this shows the important issue of subjectivity in manual scoring. this issue can only be effectively addressed through rigorous and repeated large-scale inter-annotator agreement checks. however, at present such checks are almost impossible for logistical reasons: they require timeand labour-intensive participation of multiple philologists who are experts in both classical languages as well as in the highly complex buddhist content of the texts, and such participation is extremely difficult to secure. in view of this, while in future work we hope to include at least partial inter-annotator agreement scores, in the present pilot study we had to settle for the sub-optimal single-scored method. 3.5 measuring the success of actual semantic similarity alignment 20 from text 1 illustrated in example 1 above already showed that frequently-occurring key terms could have a negative impact on ranking: whenever key terms occur repeatedly, the chances of multiple outputs with high cosine similarity scores increase, and the chances of a high ranking for just one specific output (corresponding to the target) decrease. in this section we briefly demonstrate that although lower rankings may initially indicate a bad result, this does not necessarily mean that our system is performing badly: high-ranked outputs may not be the exact target (as identified by expert philologists in our gold standard), but they could still convey the same or a very similar meaning. we can see this in particular for alignments where the average cosine similarity results are low. consider, for example, alignment 8 from text 1: (4) (a) (b) (5) the average cosine similarity of this alignment with the hybrid-2 embeddings is only 0.80 (standard deviation of 0.02). the target is ranked 2nd with a cosine similarity of 0.85807, but the highest-ranked output shown in (5) scored 0.88005. the color coding shows that this output contains two of the key terms present in the chinese input. since the chinese input is relatively short, overlap in two such highly specific terms can yield relatively high similarity and thus lead to a highly-ranked result. 4 conclusion in this paper we presented the first-ever procedure for identifying highly similar sequences of text in chinese and tibetan translations of buddhist sūtra literature. our pilot study is based on creating a cross-lingual embedding space by taking the cosine similarity of average sequence 13felbur et al. journal of open humanities data doi: 10.5334/johd.86 vectors in order to produce unsupervised similar cross-linguistic parallel alignments at word, sentence, and even paragraph level. we evaluate the results of the pilot study comparing three buddhist texts that are manually aligned by expert philologists. initial results show that our method lays a solid foundation for the future development of a fully-fledged information retrieval tool for these (and potentially other) low-resource, historical languages. we will address questions of scalability and of further philological use cases in future research. supplementary files supplementary materials are deposited on zenodo: • alignment scoring manual (handy & meelen, 2022): https://doi.org/10.5281/zenodo.6782150 • buddhist chinese embeddings (vierthaler, 2022): https://doi.org/10.5281/zenodo.6782932 • classical tibetan embeddings (meelen, 2022): https://doi.org/10.5281/zenodo.6782247 acknowledgements thanks to the british academy and to the european research council (erc) for financial support, as well as to gregory forgues & jonathan a. silk for manual alignments. funding information this work was supported by the european research council (erc) under the horizon 2020 program (advanced grant agreement no 741884). competing interests the authors have no competing interests to declare. author affiliations rafal felbur orcid.org/0000-0002-0555-9992 leiden university, nl marieke meelen orcid.org/0000-0003-0395-8372 university of cambridge, gb paul vierthaler orcid.org/0000-0002-2135-9499 college of william and mary, us references devlin, j., chang, m.-w., lee, k., & toutanova, k. (2019, june). bert: pre-training of deep bidirectional transformers for language understanding. in proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171–4186). minneapolis, minnesota: association for computational linguistics. retrieved from https://aclanthology.org/n19-1423 (last accessed: 8 august 2022) doi: https://doi.org/10.18653/v1/n19-1423 faggionato, c., hill, n., & meelen, m. (2022, june). nlp pipeline for annotating (endangered) tibetan and newar varieties. in proceedings of the workshop on resources and technologies for indigenous, endangered and lesser-resourced languages in eurasia within the 13th language resources and evaluation conference (p. 1–6). marseille, france: european language resources association. faggionato, c., & meelen, m. (2019). developing the old tibetan treebank. in n. t. angelova mitkov (ed.), proceedings of recent advances in natural language processing (p. 304–312). varna: incoma. doi: https://doi.org/10.26615/978-954-452-056-4_035 glavaš, g., franco-salvador, m., ponzetto, s. p., & rosso, p. (2018). a resource-light method for cross-lingual semantic textual similarity. knowledge-based systems, 143, 1–9. doi: https://doi. org/10.1016/j.knosys.2017.11.041 handy, c., & meelen, m. (2022, june). mrk alignment scoring guidelines. zenodo. retrieved from https:// doi.org/10.5281/zenodo.6782150 (last accessed: 8 august 2022). inagaki, h. (1978). index to the larger sukhāvatīvyūha-sūtra. a tibetan glossary with sanskrit and tibetan equivalents. tokyo: nagata bunshudo. https://doi.org/10.5281/zenodo.6782150 https://doi.org/10.5281/zenodo.6782932 https://doi.org/10.5281/zenodo.6782247 https://orcid.org/0000-0002-0555-9992 https://orcid.org/0000-0002-0555-9992 https://orcid.org/0000-0003-0395-8372 https://orcid.org/0000-0003-0395-8372 https://orcid.org/0000-0002-2135-9499 https://orcid.org/0000-0002-2135-9499 https://aclanthology.org/n19-1423 https://doi.org/10.18653/v1/n19-1423 https://doi.org/10.26615/978-954-452-056-4_035 https://doi.org/10.1016/j.knosys.2017.11.041 https://doi.org/10.1016/j.knosys.2017.11.041 https://doi.org/10.5281/zenodo.6782150 https://doi.org/10.5281/zenodo.6782150 14felbur et al. journal of open humanities data doi: 10.5334/johd.86 to cite this article: felbur, r., meelen, m., & vierthaler, p. (2022). crosslinguistic semantic textual similarity of buddhist chinese and classical tibetan. journal of open humanities data, 8(1): 23, pp. 1–14. doi: https://doi.org/10.5334/ johd.86 published: 04 october 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. karashima, s. (1998). a glossary of dharmarakṭa’s translation of the lotus sutra: zheng fahua jing ci dian. tokyo: the international research institute for advanced buddhology, soka university. karashima, s. (2001). a glossary of kumārajīva’s translation of the lotus sutra: myōhō rengekyō shiten. tokyo: the international research institute for advanced buddhology, soka university. karashima, s. (2010). a glossary of lokakṭema’s translation of the aṭṭasāhasrikā prajñāpāramitā. tokyo: the international research institute for advanced buddhology, soka university. klein, b. e., dershowitz, n., wolf, l., almogi, o., & wangchuk, d. (2014). finding inexact quotations within a tibetan buddhist corpus. in 9th annual international conference of the alliance of digital humanities organizations, dh 2014, lausanne, switzerland, 8–12 july 2014, conference abstracts. li, q. (2011). da zhidu lun cidian 大智度論辭典. electronic resource. retrieved from https://www.dropbox. com/s/ocsagb529k3e70v/dzdl.bgl?dl=0 (last accessed: 1 june 2021). meelen, m. (2022). tibetan language models: from distributional semantics to facilitating tibetan nlp. accepted submission to iats 2022. meelen, m., & hill, n. (2017). segmenting and pos tagging classical tibetan using a memory-based tagger. himalayan linguistics, 16(2). doi: https://doi.org/10.5070/h916234501 meelen, m., & roux, é. (2020). meta-dating the parsed corpus of tibetan (pactib). in proceedings of the 19th workshop on treebanks and linguistic theories (pp. 31–42). doi: https://doi.org/10.18653/ v1/2020.tlt-1.3 meelen, m., roux, é., & hill, n. (2021). optimisation of the largest annotated tibetan corpus combining rule-based, memory-based, and deep-learning methods. acm transactions on asian and low-resource language information processing (tallip), 20(1), 1–11. doi: https://doi. org/10.1145/3409488 mikolov, t., le, q. v., & sutskever, i. (2013). exploiting similarities among languages for machine translation. corr, abs/1309.4168. retrieved from http://arxiv.org/abs/1309.4168 (last accessed: 8 august 2022). nehrdich, s. (2020). a method for the calculation of parallel passages for buddhist chinese sources based on million-scale nearest neighbor search. journal of the japanese association for digital humanities, 5(2), 132–153. doi: https://doi.org/10.17928/jjadh.5.2_132 qi, p., zhang, y., zhang, y., bolton, j., & manning, c. d. (2020). stanza: a python natural language processing toolkit for many human languages. arxiv preprint arxiv:2003.07082. doi: https://doi. org/10.18653/v1/2020.acl-demos.14 silk, j. a. (2020). tekisuto sokei no nai kōtei: bukkyō kyōten to yudayakyō rabi bunken kenkyū ni okeru honbun hihan, soshite ‘hirakareta bunkengaku’ dejitaru hyūmanitīzu purojekuto” テキスト祖型のない校訂: 佛敎經典とユダヤ敎ラビ文獻硏究における本文批評、そして「開かれた文獻學」デジタルヒューマニティーズプロジェクト[editing without an ur-text: buddhist sūtras, rabbinic text criticism, and the open philology digital humanities project]. tōyō no shisō to shūkyō 東洋の思想と宗敎, 37, 22–58. vierthaler, p. (2020). a simple dictionary-based tokenizer for classical chinese text. retrieved from https://github.com/vierth/dictionary_parser (last accessed: 8 august 2022). vierthaler, p. (2022, june). buddhist chinese word embeddings. zenodo. retrieved from https://doi. org/10.5281/zenodo.6782932 (last accessed: 8 august 2022). vierthaler, p., & gelein, m. (2019, 3 22). a blast-based, language-agnostic text reuse algorithm with a markus implementation and sequence alignment optimized for large chinese corpora. journal of cultural analytics, 4(2). doi: https://doi.org/10.22148/16.034 wang, y.-c. (2020). word segmentation for classical chinese buddhist literature. journal of the japanese association for digital humanities, 5(2), 154–172. doi: https://doi.org/10.17928/jjadh.5.2_154 wittern, c. (2016). the kanseki repository: a new online resource for chinese textual studies. digital scholarship in history and the humanities. xing, c., wang, d., liu, c., & lin, y. (2015, may–june). normalized word embedding and orthogonal transform for bilingual word translation. in proceedings of the 2015 conference of the north american chapter of the association for computational linguistics: human language technologies (pp. 1006–1011). denver, colorado: association for computational linguistics. retrieved from https:// aclanthology.org/n15-1104 (last accessed: 8 august 2022). yokoyama, k., & hirosawa, t. (1996). index to the yogācārabhūmi, chinese-sanskrit-tibetan: 漢梵蔵対照瑜伽師地論総索引. tokyo: sankibō busshorin. https://doi.org/10.5334/johd.86 https://doi.org/10.5334/johd.86 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://www.dropbox.com/s/ocsagb529k3e70v/dzdl.bgl?dl=0 https://www.dropbox.com/s/ocsagb529k3e70v/dzdl.bgl?dl=0 https://doi.org/10.5070/h916234501 https://doi.org/10.18653/v1/2020.tlt-1.3 https://doi.org/10.18653/v1/2020.tlt-1.3 https://doi.org/10.1145/3409488 https://doi.org/10.1145/3409488 http://arxiv.org/abs/1309.4168 https://doi.org/10.17928/jjadh.5.2_132 https://doi.org/10.18653/v1/2020.acl-demos.14 https://doi.org/10.18653/v1/2020.acl-demos.14 https://github.com/vierth/dictionary_parser https://doi.org/10.5281/zenodo.6782932 https://doi.org/10.5281/zenodo.6782932 https://doi.org/10.22148/16.034 https://doi.org/10.17928/jjadh.5.2_154 https://aclanthology.org/n15-1104 https://aclanthology.org/n15-1104 mining an english-chinese parallel dataset of financial news research paper corresponding author: nicolas turenne bnu-hkbu united international college, uic, division of science and technology, zhuhai guangdong, china nicolas.turenne@univ-eiffel.fr keywords: english-chinese; text mining; clustering; classification; patterns to cite this article: turenne, n., chen, z., fan, g., li, j., li, y., wang, s., & zhou, j. (2022). mining an englishchinese parallel dataset of financial news. journal of open humanities data, 8: 9, pp. 1–12. doi: https://doi. org/10.5334/johd.62 mining an english-chinese parallel dataset of financial news nicolas turenne ziwei chen guitao fan jianlong li yiwen li siyuan wang jiaqi zhou abstract parallel text datasets are a valuable for educational purposes, machine translation, and cross-language information retrieval, but few are domain-oriented. we have created a chinese–english parallel dataset in the domain of finance technology, using the financial times website, from which we grabbed 60,473 news items from between 2007 and 2021. this dataset is a bilingual chinese–english parallel dataset of news in the domain of finance. it is open access in its original state without transformation, and has been made not for machine translation as has been used, but for intelligent mining, in which we conducted many experiments using up-to-date text mining techniques: clustering (topic modeling, community detection, k-means), topic prediction (naive bayes, svm, lstm, bert), and pattern discovery (dictionary based, time series). we present the usage of these techniques as a framework for other studies, not only as an application but with an interpretation. *author affiliations can be found in the back matter of this article mailto:nicolas.turenne@univ-eiffel.fr https://doi.org/10.5334/johd.62 https://doi.org/10.5334/johd.62 https://orcid.org/0000-0003-1229-5590 2turenne et al. journal of open humanities data doi: 10.5334/johd.62 1 introduction the investigation of classical and new text mining methods using a bilingual dataset can enhance the meaningfulness of comparisons of these techniques. the original way to use a parallel text dataset is to benefit from its construction, by which the texts are supposed to be strictly similar, leading us to expect that exploratory results from text mining will be similar too. we decided to explore a parallel dataset from a domain to extract knowledge from a technical area (e.g., finance). the choice of the pair chinese–english has several motivations: firstly, the data is more easily available; secondly, there is a demand for english and chinese tools and datasets, as english is already the lingua franca in many areas (political, economical, cultural, and scientific), and we also see an increasing interest in chinese, which is now being taught at schools in western countries. one can keep in mind 1.41 billion people speak chinese as their first or second language, while this is 1.35 billion for english (the overlap is no more than 20%). secondly, china and the usa, as the areas of the native speakers, are drivers for the world economy. the language of business and finance has always attracted interest, since the movement of stock indexes can be an indicator, a ‘barometer,’ of the general trend in the economy. when we look at the availability of domain-specific parallel corpora, the majority of them are constructed around the following drivers: biomedicine (neves, yepes, & névéol, 2016), digital humanities/culture (christodoulopoulos & steedman, 2014), city, transport (lefever, macken, & hoste, 2009), food, the environment (xiong, 2013), ict (labaka, alegria, & sarasola, 2016), digital humanities/law, and governance (steinberger et al., 2006). concerning chinese–english, chang (2004) from peking university made one of the first large scale chinese–english parallel corpora from html files with alignments at the paragraph and sentence levels, leading to a size of 10 million chinese characters about different genres (news, technical articles, subtitles). concerning the domain of finance, there are some small corpora for different pairs of languages, but not chinese–english, (arcan, thomas, de brandt, & buitelaar, 2013; bick & barreiro, 2015; smirnova & rackevičienė, 2020; tiedemann, 2012; volk, amrhein, aepli, müller, & ströbel, 2016). the largest one is the sedar dataset,1 containing 8.6 million french–english sentence pairs in the finance domain from pdf files of the regulations of the province of quebec (ghaddar & langlais, 2020). to our knowledge, the dataset discussed in our article represents new available material for the community. the question we address is to consider the state of the art techniques and the main contemporary approaches to text mining, and see what finally we can extract from a dataset of news in a specialized domain such as fintech. knowing that each news item contains the same version in chinese and english, another question to explore is the following:”are the efficiency and extraction exactly the same or do some cultural aspects influence the translation and so the lexical and semantic content? in this way, the general dataset we present in this article can be seen as a gold standard for the output of calibrated measures for all kinds of techniques. in general, studies use text collection within the framework of a specific method such as disinformation analysis (turenne, 2018) or the development of medical drugs (kolchinsky, lourenco, wu, & rocha, 2015), or for a specific task such as part of speech (pos) tagging (akbik, blythe, & vollgraf, 2018) or named entity extraction (chiu & nichols, 2016). in this article, we also take a domain dataset (namely, fintech) and a specific genre of document (news), but we do not have a specific task to improve. we try easy tasks intuitively and directly usable on such a dataset: clustering (named entity and word), classification (topic and sentiment), and pattern extraction (word life and citation). we made the dataset using the financial times website from which we grabbed 60,473 news items from between 2007 and 2021, each containing a version in english and its translation into chinese. we focus on three families of techniques within the text-mining framework: (i) pre-processing techniques; (ii) supervised approaches involving deep learning techniques such as lstm, bert, cnn and also svm, naïve bayes, and random forest; and (iii) unsupervised techniques involving k-means, community detection, biclustering, co-cord analysis, and topic modeling (turenne et al., 2020). this paper is divided into the following sections: we discuss the dataset and its sub-datasets, describe the stateof-the-art research based on bilingual corpora, machine learning, and natural language processing, and then present the results of our experiments. 1 https://github.com/autorite/sedar-bitext (last accessed: 01.03.2022). https://doi.org/10.5334/johd.62 https://github.com/autorite/sedar-bitext 3turenne et al. journal of open humanities data doi: 10.5334/johd.62 2 related work 2.1 parallel language dataset building zhao and vogel (2002) is probably one of the pioneering studies about combining a parallel chinese–english dataset and mining approach. they used 10 years of the xinhua bilingual news collection, but that is not available. koehn (2005) is a large-scale document multilingual and parallel dataset containing ∼60 million words on average per language for 21 european languages, but nothing about chinese. in the same way, we find a topic detection and tracking repository.2 it contains 30k in chinese and english, but not in parallel. christodoulopoulos and steedman (2014) and sturgeon (2021) are open data repositories and digital humanities projects. they contain books with english–chinese versions but their content is closely related to philosophy, religion, and difficult-to-understand contemporary thinking: for example, manual annotation for classification is not easy. the uci machine learning repository (dua & graff, 2017) and kaggle3 are repositories of datasets, and many of them are used for the evaluation of algorithms. there are no english–chinese parallel corpora. zhai, liu, zhong, illouz, and vilnat (2020) made a dataset considering 11 genres (constructed based on existing work: art, literature, law, material for education, microblogs, news, official documents, spoken, subtitles, science, and scientific articles) and made a parallel english–chinese dataset with 2,200 sentences to test the translation of literals. tian et al. (2014) presents um-corpus,4 designed for sentence machine translation (smt) research. it contains 15 million english–chinese parallel sentences and treats eight genres: news, spoken, laws, theses, educational materials, science, speech/subtitles, and microblog. globally, the dataset contains 2.2 million sentences in both languages (450,000 for news alone). this dataset is freely available but named entities are anonymized. 2.2 building domain-specific parallel datasets in this section we present an extensive literature review of domain-specific datasets, their language pairs, and topics. we observed an increased interest in domain-specific parallel datasets in the past year. the main use of such material is to make a specialized learning dataset to improve a statistical machine translation system and to do cross-lingual information retrieval (mcenery & xiao, 2007) from a computational point of view, to extract automatically or semi-automatically a specialized lexicon in different languages (rosemeyer & enrique-arias, 2016) from a linguistic point of view. in the following review, we consider as domain-specific a dataset focused on all aspects of one topic. a text genre, such as news or technical publications, is considered as a domain. 2.2.1 digital humanities: culture in this domain we have found 20 datasets, of which large pair datasets are as follows. in the area of religious studies, christodoulopoulos and steedman (2014) is about the bible in 100 languages. we also find the chinese–english (sturgeon, 2021), the arabic–english (hamoud & atwell, 2017), a presentation of the same ancient religious texts in different germanic dialects (dipper & schultz-balluff, 2013), and a parallel dataset of english and persian religious texts (beikian & borzoufard, 2016). in literary studies fraisse, tran, jenn, paroubek, and fishkin (2018) created a massively parallel dataset of translated american literary texts, with 23 languages. altammami, atwell, and alsalka (2020) present a bilingual parallel english–arabic dataset of narratives reporting different aspects of muhammad’s life. in the domain of tourism and traveling, espla-gomis et al. (2014) built a domain-specific english―croatian parallel dataset from different websites, ponay and cheng (2015) made an english–tagalog dataset, bureros, tabaranza, and roxas (2015) created a english–cebuano dataset, woldeyohannis, besacier, and meshesha (2018) made an amharic–english dataset, srivastava and sanyal (2015) made a small parallel english–hindi dataset, and boldrini and ferrández (2009) got 4500 questions/answers from customers about tourism in spanish translated into english. about literary texts, rovenchak (2021) published a bamana–french analysis concerning bamana tales, kenny (1999) describes gepcolt, an electronic collection of some fourteen works of contemporary german-language fiction alonside their translations into english, 2 http://projects.ldc.upenn.edu/tdt3-tdt4 (last accessed: 01.03.2022). 3 https://www.kaggle.com/datasets (last accessed: 01.03.2022). 4 http://nlp2ct.cis.umac.mo/um-corpus/index.html (last accessed: 01.03.2022). https://doi.org/10.5334/johd.62 http://projects.ldc.upenn.edu/tdt3-tdt4 https://www.kaggle.com/datasets http://nlp2ct.cis.umac.mo/um-corpus/index.html 4turenne et al. journal of open humanities data doi: 10.5334/johd.62 giouli, glaros, simov, and osenova (2009) made a greek–bulgarian dataset about cultural, literary and folk texts, kashefi (2020) made a persian–english dataset with masterpieces of literature, frankenberg-garcia (2009) built a parallel dataset of english and portuguese literary texts, miletic, stosic, and marjanović (2017) made paracolab a dataset of english, french and serbian literary books, guzman (2013) describes a dataset of literary texts with versions in spanish, french, german, and catalan. 2.2.2 finance d.-y. lee (2011) used an interesting approach, for korean and english, to improve financial phrase translation, but the corpora are comparable without being really parallel. there are some parallel corpora about finance, with a limited size, such as smirnova and rackevičienė (2020), who made a dataset of european documents in english translated to french and lithuanian related to finance, but the size is relatively small, consisting of 154 documents from 2010 to 2014. bick and barreiro (2015) made a portuguese–english parallel dataset of about 40,000 sentences in the legal-financial domain, coming from a company translation memory. we will next mention four notable parallel corpora about finance, for which we will give the details below: the ecb dataset,5 the dbpedia-linguee dataset, the csb dataset,6 and the sedar dataset.1 all of them have been made for automatic translation and cross-lingual information retrieval purposes. in the opus project (tiedemann, 2012), we can find the ecb dataset, employing 19 european languages and concerning financial and legal newsletters from the european central bank. as an example, it contains 113,000 english–german pairs of sentences. arcan et al. (2013) used dbpedia datasets to extract the titles of relevant wikipedia articles, and the linguee database, obtaining 193,000 aligned sentences (english–german, english– french, and english–spanish) to find translations of financial terms. the credit suisse bulletin dataset (csb) is based on the world’s oldest banking magazine, published by credit suisse since 1895 in both german and french (volk et al., 2016). the sedar dataset (i.e., the system for electronic document analysis and retrieval) contains 8.6 million french–english sentence pairs in the finance domain from pdf files of regulations of the province of quebec (ghaddar & langlais, 2020). however, all these datasets are about pairs of european languages. guo (2016) describes how it can be feasible to make a domain-specific chinese–english parallel dataset in the financial service domain, but it is restricted to giving guidelines about which tool to use to get raw data and how to use a parallel dataset, with the description and availability of the dataset. we have seen in this review that, firstly, domain-specific datasets are for different topics of societal challenges. secondly, although the finance domain is not lacking in datasets, english–chinese is not covered yet. 2.3 parallel language dataset exploration parallel corpora have been investigated to make alignments between sentences. wu and xia (1994) is a pioneering work using parallel sentences in the framework of automatic translation. they used literal translations of sentences from the parliamentary proceedings of the hong kong legislative council, with five million words, to predict the chinese translation of each english entry. in yang and li (2003), an alignment method is presented at different levels (title, word, and character) based on dynamic programming (dp). lu, tsou, jiang, kwong, and zhu (2010) used a non-open dataset of 157,000 files, with both chinese and english versions. more recently, schwenk, chaudhary, sun, gong, and guzmán (2021) have made an alignment process over 85 languages and 135 million sentences from wikipedia (available as open data), but they found only 790 sentences for english–chinese, which is very few for a text mining workflow. li, wang, huang, and zhao (2011) used a linear combination and minimum sample risk (msr) algorithm to make a matching between named entities (person, organization) and obtained an f-score of 84%. a pioneering work in text mining and english–chinese texts is probably c.-h. lee and yang (2000), who used a neural network clustering method called self-organizing maps to extract clusters from an english–chinese parallel dataset (this parallel dataset is made with sinorama magazine articles with 50,000 sentences)7 but their conclusion only reveals the potential of the 5 https://www.ecb.europa.eu/press/key/html/downloads.en.html (last accessed: 01.03.2022). 6 http://csb.access.ch (last accessed: 01.03.2022). 7 https://www.taiwan-panorama.com/en/home/about (last accessed: 01.03.2022). https://doi.org/10.5334/johd.62 https://www.ecb.europa.eu/press/key/html/downloads.en.html http://csb.access.ch https://www.taiwan-panorama.com/en/home/about 5turenne et al. journal of open humanities data doi: 10.5334/johd.62 approach. lan and huang (2017) construct a bilingual english–chinese latent semantic space and also select k-means initial cluster centers, but the interpretation of the clustering is not very clear. 3 the dataset 3.1 data collection we extracted news from the financial times and ft chinese, both freely available news located at the financial times website.8,9,10 the news was collected for the period from 2007 to 2021. after collating the links, the pages were downloaded with ‘wget’ and stripped of html. the encoding of the files was normalized to utf-8 (r package ‘httpr’). cloud computing under the slurm framework was used to parallelize the nlp preprocessing. in all, we got an uncleaned raw text dataset with 90,003 documents. 3.2 data preprocessing we carried out sentence segmentation, word splitting, and named entity extraction. for linguistic preprocessing, we used regular expressions for field extraction, sentence and paragraph splitting. we used jieba and spacy algorithms for tokenization and tagging, and the stanford ner framework for named entity extraction. the use of html was helpful to automatically extract from each news item its timestamp, title (in both languages), text body (in both languages), and topic tags. but in some cases, a translation was not available, so we took it as is. we tried to carry out a paragraph alignment between two equivalent documents in chinese and english. splitting into paragraph is also quite easy using line break markers. however, in some cases the number of paragraphs does not match, and we did not achieve this alignment because of the expensiveness of a human validation. we proceeded to clean the documents using two rules: (1) each one had to have both english and chinese versions; (2) only files with a text body containing more than two characters were kept. we got a cleaned raw text dataset of 60,473 documents. the dataset is available at https://doi.org/10.5281/zenodo.5591908 3.3 data statistics the dataset contains various metadata, such as title and text body both in english and chinese, the time of publication, and some topic tags. table 1 shows the extraction of elementary linguistic features. 3.4 categories of finance domain we made different samples for topic prediction using classification methods. this is the list of the 10 topic-metadata tags contained in the documents, used by the financial times to annotate the area of each news item. a news item can contain several tags: book, business, culture, economy, lifestyle, management, markets, people, politics, or society. there were 8 https://www.ft.com/ (last accessed: 01.03.2022). 9 https://www.ftchinese.com/ (last accessed: 01.03.2022). 10 this is an example of a parallel archived news link: http://www.ftchinese.com/story/001015037/ce?archive (last accessed: 01.03.2022). lang. token np multiwd parag.s sent. ne hanzi english 2,598,309 1,672,577 2,376,424 272,756 597,372 1,190,682 0 chinese 7,480,139 1,491,790 3,466,453 258,213 572,185 1,268,674 21,679,815 table 1 linguistic features of the text collection (‘lang.’ is language, ‘np’ is noun phrases, ‘multiwd’ is multiwords, ‘sent.’ is sentences, ‘ne’ is named entities, ‘hanzi’ is chinese characters. https://doi.org/10.5334/johd.62 https://doi.org/10.5281/zenodo.5591908 https://www.ft.com/ https://www.ftchinese.com/ http://www.ftchinese.com/story/001015037/ce?archive 6turenne et al. journal of open humanities data doi: 10.5334/johd.62 57,584 documents containing topic metadata. this is the list of the 10 tags from the financial times websites about the economic sector we used for manual annotation: technology, consumer services, health care, consumer goods, basic materials, industrials, oil & gas, and telecommunications. there are 2,993 documents that were tagged manually. the top influential media in finance are: 1. the wall street journal. 2. bloomberg. 3. the new york times. 4. the financial times. 5. cnbc. 6. reuters. 7. the economist. five items of the financial times website can be clearly identified as related to the “economy” (equities, currencies, commodities, bonds, funds & etfs) and the item world market can be associated with “markets,” company as “business,” and director dealings as management. the economy, management, markets, and business are among the tags contained in each document as metadata. however, we also find other tags, such as lifestyle, politics, and people. in fact, many influential people have an impact on the evolution of markets. other items as sectors and industrials can be further split into: id01 – technology (software & computer services, technology hardware & equipment) id02 – consumer services (general retailers, travel & leisure, food & drug retailers, media) id03 – health care (health care equipment & services, pharmaceuticals & biotechnology) id04 – consumer goods (automobiles & parts, leisure goods, personal goods, food producers, household goods, tobacco, beverages) id05 – basic materials (industrial metals, mining, chemicals) id06 – industrials (support services, electronic & electrical equipment, industrial transportation, aerospace & defense, construction) id07 – financials (real estate investment & services, financial services, general financial, life insurance, banks, nonlife insurance) id08 – oil & gas (alternative energy, oil & gas producers, oil equipment, services & distribution) id09 – utilities (gas, water & multi-utilities, electricity) id10 – telecommunications (fixed line telecommunications, mobile telecommunications) sectors, in finance, act both as a guide to make promising investments in the right places and as representation of areas of activity. topics id01 to id10 are used for manual annotation so their representation is less important than topics inserted into each document as metadata. from the manual annotation, the most frequent topics are: financials, consumer goods, consumer services, and technology. from the metadata, the most frequent topics are: business, the economy, markets, management, politics, lifestyle, and society. 3.5 manual annotation to carry out the manual annotation, we made a set of document batches, each one containing 100 distinct documents. a population of 31 students (year-3 level in computer science, with b1 to c1 level of english) received one batch each. multiple annotation was possible, and the format of the annotation was quite elementary, such as document id followed by class id, one annotation by line, e.g.: 1014550; id07 1014871; id11 an extra annotator assessed the annotations by choosing randomly 10 files for each batch. if the annotation done by the extra annotator showed more than four differences from those produced by the annotator (i.e., >40% disagreement), the batch had to be revised by the annotator. nineteen batches were revised. finally, after the second round, we compiled all the batches together. 3.6 data usage as mentioned in the previous section on the literature, there are several ways to use a parallel dataset. the same is true for our chinese–english parallel dataset for the domain of finance. so here are five main key points as possible usage: https://doi.org/10.5334/johd.62 7turenne et al. journal of open humanities data doi: 10.5334/johd.62 • the influence of the language on the knowledge discovery we present the results of different clusterings for topic discovery and classification for topic detection. here, the algorithm is not supposed to take into account specificities of the language (i.e., it is to be language-independent). this dataset can be useful to study how a language-dependent algorithm could be more efficient. • keyword in context concordances of a word in the domain of finance can be extracted. in such a usage case, different contexts make possible the study of the meaning of a phrase and its variation. • automatic translation a classical usage case is to exploit such a dataset to make automatic translations of documents in the domain of finance, using this dataset as a training set for a statistical machine translation system (smt) • neologism translation translation is always a challenge, especially for new words. a usage case of the dataset is the study of neologisms. for example, to find the chinese equivalent to about a new named entity in english (company name, people name). • time series of a domain-specific word the last case can be the study of the distribution of words or phrases over time and see their popularity. 4 discovery of some frequent interesting terms in this section, we will search for some interesting words or phrases in the dataset and count their frequency of occurrence, which will be conducive to our further understanding of the dataset. next, this section will be divided into three parts to explore the frequency of english proverbs and chinese idioms, important finance related terms, and globally famous companies in the dataset. we made some experiments about lexical variation over time and proverb analysis (see appendix a for more details). 4.1 discovery of frequent terms of finance domain the first step is deciding how to choose some commonly used financial terms. our decision was to use fundera. fundera is an online marketplace that connects small business owners with the best providers of capital for their businesses. it offers product marketplaces that cover everything from loans to legal services, free financial content, and one-on-one access with experienced lending experts. based on the founding editor and vice president of the fundera ledger meredith wood’s “60 business and finance terms you should definitely know”,11 we selected the top 20 financial terms that appear most frequently in the dataset. the results are shown on table 2. 11 https://www.fundera.com/blog/business-finance-terms-and-definitions (last accessed: 01.03.2022). capital 9383 net worth 195 asset 3086 liability 141 liquidity 1704 business plan 126 interest rate 1036 fixed asset 101 bankruptcy 616 debt financing 97 balance sheet 522 working capital 83 principle 382 financial statements 72 collateral 371 equity financing 64 depreciation 368 line of credit 46 cash flow 209 appraisal 42 table 2 20 most frequently used financial terms. https://doi.org/10.5334/johd.62 https://www.fundera.com/blog/business-finance-terms-and-definitions 8turenne et al. journal of open humanities data doi: 10.5334/johd.62 next, we imitated the method used above to detect the most frequent idioms and proverbs, extracting the statements in the dataset and calculating the frequency of occurrence of each financial term (see appendix file). 4.2 discovery of frequent company names we used the same method to collect statistics on the frequency of occurrence of company names in the dataset. among them, we find the chinese company huawei, which shows that with the increase of china’s international influence, chinese technology enterprises are increasingly favored by global business people. 5 text-mining approaches and the domain of finance the first point for people interested in finance or natural language processing about such a dataset as this, is that we provide a full analysis taking into account state of the art textmining technology. these experiments were of three kinds (see appendix b and appendix c for technical details): (1) lexical extraction (words, noun phrases, names of people, names of companies) (2) classification (rervised learning) (3) clustering (unsupervised learning) as we showed in the section on the discovery of lexical items, this dataset is useful for identifying the important concepts and actors of the domain. these concepts are not new for an expert working in finance everyday, but the dataset can be used as an educational tool for students at school or college to understand what is finance through real life events and practical information. a list of frequent noun phrases (such as ‘asset,’ ‘interest rate’), a list of famous and influential people (such as elon musk, xi jinping), a list of names of famous organizations (such as the imf and the fed) were extracted, and one hundred frequent items for these three categories can easily serve as a basic framework of concepts for educational purposes. we also studied and compared the properties of the english and chinese languages through the use of proverbs, which is one of the high-level linguistic patterns of any language. we discovered that in the domain of finance, which is highly related to technology and also to society, in the chinese language, people used more freely proverbs but not at all in english. we do not have an explanation for this except that it may be an important cultural difference in how people use language to disseminate information (even in a technological area). we have shown that using this classification technique some potential readers could process new documents (unseen from the dataset), which may be interesting for them, according to the ontology of 20 topics described in section 3.4. clustering, by definition, relies mainly on organizing knowledge about a set of unstructured data. we have carried out several experiments and clustering has revealed some classical topics of finance, such as business or markets, but also surprising topics in the finance domain, such as lifestyle, art and life, politics, and british education, which seem to play a big role. this shows that finance is not just an activity in society, like sports for example, but also seems to be an ideological model. secondly, the clusters show that even if finance is globalized, a polarity about the specific relationship between china and us appears to emerge as more important than all others. 6 conclusion chinese and english is an interesting combination of languages for testing algorithms and mining. finance is a hot area of activity in our contemporary world. we made a text dataset using the financial times website from which we grabbed 60,473 news items from between 2007 and 2021. this dataset is a bilingual chinese–english parallel dataset of news in the domain of finance, and is open access. we used a text mining analytical framework. as a future perspective, our dataset can be used to infer the translation of new terms from english to chinese (i.e., company names), to extract the distribution of occurrences of new concepts for time series analysis (i.e., neologisms) or to apply a more innovative clustering approach to discover new concepts (i.e., ontology learning). https://doi.org/10.5334/johd.62 9turenne et al. journal of open humanities data doi: 10.5334/johd.62 additional files the additional files for this article can be found as follows: • appendix a. discovery of some frequent interesting terms. doi: https://doi.org/10.5334/ johd.62.s1 • appendix b. classification. doi: https://doi.org/10.5334/johd.62.s2 • appendix c. clustering. doi: https://doi.org/10.5334/johd.62.s3 competing interests the authors have no competing interests to declare. author contributions nicolas turenne: conceptualisation and writing original draft ziwei chen: methodology, classification section jianlong li: methodology, classification section guitao fan: methodology, lexical analysis and pattern section jiaqi zhou: methodology, lexical analysis and pattern section siyuan wang: methodology, clustering section yiwen li: methodology, clustering section author affiliations nicolas turenne orcid.org/0000-0003-1229-5590 bnu-hkbu united international college, uic, division of science and technology, zhuhai guangdong, china ziwei chen bnu-hkbu united international college, uic, division of science and technology, zhuhai guangdong, china guitao fan bnu-hkbu united international college, uic, division of science and technology, zhuhai guangdong, china jianlong li bnu-hkbu united international college, uic, division of science and technology, zhuhai guangdong, china yiwen li bnu-hkbu united international college, uic, division of science and technology, zhuhai guangdong, china siyuan wang bnu-hkbu united international college, uic, division of science and technology, zhuhai guangdong, china jiaqi zhou bnu-hkbu united international college, uic, division of science and technology, zhuhai guangdong, china references akbik, a., blythe, d., & vollgraf, r. (2018). contextual string embeddings for sequence labeling. in proceedings of the 27th international conference on computational linguistics (coling) (pp. 1638–1649). new mexico: paparazzi press. retrieved from https://aclanthology.org/c18-1139 altammami, s., atwell, e., & alsalka, a. (2020). the arabic-english parallel corpus of authentic hadith. international journal on islamic applications in computer science and technology, 8(2). doi: http:// www.sign-ific-ance.co.uk/index.php/ijasat/article/view/2199 arcan, m., thomas, s. m., de brandt, d., & buitelaar, p. (2013). translating the finrep taxonomy using a domain-specific corpus. in proceedings of chinese translation summit xiv. nice, france. retrieved from https://aclanthology.org/2013.mtsummit-posters.1.pdf beikian, a., & borzoufard, m. (2016). mizan: a large persian-english parallel corpus. retrieved from https:// cdn.ketabchi.com/products/175402/pdfs/ketab-general-book-sample-wybml.pdf https://doi.org/10.5334/johd.62 https://doi.org/10.5334/johd.62.s1 https://doi.org/10.5334/johd.62.s1 https://doi.org/10.5334/johd.62.s2 https://doi.org/10.5334/johd.62.s3 https://orcid.org/0000-0003-1229-5590 https://orcid.org/0000-0003-1229-5590 https://aclanthology.org/c18-1139 http://www.sign-ific-ance.co.uk/index.php/ijasat/article/view/2199 http://www.sign-ific-ance.co.uk/index.php/ijasat/article/view/2199 https://aclanthology.org/2013.mtsummit-posters.1.pdf https://cdn.ketabchi.com/products/175402/pdfs/ketab-general-book-sample-wybml.pdf https://cdn.ketabchi.com/products/175402/pdfs/ketab-general-book-sample-wybml.pdf 10turenne et al. journal of open humanities data doi: 10.5334/johd.62 bick, e., & barreiro, a. (2015). automatic anonymisation of a new portuguese-english parallel corpus in the legal-financial domain. oslo studies in language, 7(1), 101–124. retrieved from https://journals. uio.no/index.php/osla/article/view/1460/1357. doi: https://doi.org/10.5617/osla.1460 boldrini, e., & ferrández, s. (2009, march 1–7). a parallel corpus labeled using open and restricted domain ontologies. in proceedings of 10th international conference cicling. mexico city, mexico. doi: https://doi.org/10.1007/978-3-642-00382-0_28 bureros, l. l., tabaranza, z. l. b., & roxas, r. r. (2015). building an english-cebuano tourism parallel corpus and a named-entity list from the web. in proceedings of workshop on computation: theory and practice (pp. 158–169). doi: https://doi.org/10.1142/9789813202818_0012 chang, b. (2004). chinese-english parallel corpus construction and its application. in proceedings of the paclic (pp. 201–204). tokyo: waseda university, dec. 8–10. retrieved from https://aclanthology.org/ y04-1030.pdf chiu, j. p. c., & nichols, e. (2016). named entity recognition with bidirectional lstm-cnns. transactions of the association for computational linguistics (pp. 357–370). doi: https://doi.org/10.1162/ tacl_a_00104 christodoulopoulos, c., & steedman, m. (2014). the bible in 100 languages. retrieved from https:// github.com/christos-c/bible-corpus dipper, s., & schultz-balluff, s. (2013). the anselm corpus: methods and perspectives of a parallel aligned corpus. in proceedings of the workshop on computational historical linguistics at nodalida. nealt (pp. 27–42). retrieved from https://ep.liu.se/ecp/087/ecp13087.pdf#page=35 dua, d., & graff, c. (2017). uci machine learning repository. retrieved from http://archive.ics.uci.edu/ml espla-gomis, m., klubička, f., ljubešić, n., ortiz-rojas, s., papavassiliou, v., & prokopidis, p. (2014). comparing two acquisition systems for automatically building an english-croatian parallel corpus from multilingual websites. in proceedings of the ninth international conference on language resources and evaluation (pp. 1252–1258). european language resources association (elra). retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/529 paper.pdf fraisse, a., tran, q.-t., jenn, r., paroubek, p., & fishkin, s. (2018, may). translitex: a parallel corpus of translated literary texts. in proceedings of the eleventh international conference on language resources and evaluation (pp. 201–204). miyazaki, japan: european language resources association (elra). retrieved from https://hal.archives-ouvertes.fr/hal-01827884/file/11 w34.pdf frankenberg-garcia, a. (2009). compiling and using a parallel corpus for research in translation. babel: international journal of translation, 21(1), 57–71. retrieved from https://openresearch.surrey.ac.uk/ esploro/outputs/journalarticle/compiling-and-using-a-parallel-corpus-for-research-in-translation/995 16816302346#file-0 ghaddar, a., & langlais, p. (2020). sedar: a large scale french-english financial domain parallel corpus. in proceedings of the language resources and evaluation conference (pp. 3595–3602). marseille, france: european language resources association. retrieved from https://aclanthology.org/2020.lrec-1.442 giouli, v., glaros, n., simov, k., & osenova, p. (2009). a web-enabled and speech-enhanced parallel corpus of greek-bulgarian cultural texts. in proceedings of the of the eacl workshop on language technology and resources for cultural heritage, social sciences, humanities, and education (pp. 35–42). athens, greece: association for computational linguistics. retrieved from https://aclanthology.org/ w09-0305.pdf. doi: https://doi.org/10.3115/1642049.1642054 guo, x. (2016, november 17–18). drawing a route map of making a small domain-specific parallel corpus for translators and beyond. in proceedings of translating and the computer (pp. 88–99). london, uk. retrieved from https://aclanthology.org/2016.tc-1.9.pdf guzman, j. r. (2013). el corpus covalt i l’eina d’alineament de frases alfra-covalt. in l. bracho lapiedra (ed.), el corpus covalt: un observatori de fraseologia traduïda (pp. 49–60). aachen: shaker. hamoud, b., & atwell, e. (2017). evaluation corpus for restricted-domain question-answering systems for the holy quran. international journal of science and research, 6(8), 1133–1138. retrieved from https://eprints.whiterose.ac.uk/125920/ kashefi, o. (2020). mizan: a large persian-english parallel corpus. retrieved from https://arxiv.org/ pdf/1801.02107v3.pdf kenny, d. (1999). the german-english parallel corpus of literary texts (gepcolt): a resource for translation scholars. teanga, 1, 25–42. koehn, p. (2005). europarl. retrieved from http://www.statmt.org/europarl/ kolchinsky, a., lourenco, a., wu, h.-y., & rocha, l. m. (2015). extraction of pharmacokinetic evidence of drug-drug interactions from the literature. plos one. doi: https://doi.org/10.1371/journal. pone.0122199 labaka, g., alegria, i., & sarasola, k. (2016). domain adaptation in mt using wikipedia as a parallel corpus: resources and evaluation. in proceedings of the tenth international conference on language resources and evaluation (pp. 2209–2213). portoroz, slovenia: european language resources association (elra). https://doi.org/10.5334/johd.62 https://journals.uio.no/index.php/osla/article/view/1460/1357 https://journals.uio.no/index.php/osla/article/view/1460/1357 https://doi.org/10.5617/osla.1460 https://doi.org/10.1007/978-3-642-00382-0_28 https://doi.org/10.1142/9789813202818_0012 https://aclanthology.org/y04-1030.pdf https://aclanthology.org/y04-1030.pdf https://doi.org/10.1162/tacl_a_00104 https://doi.org/10.1162/tacl_a_00104 https://github.com/christos-c/bible-corpus https://github.com/christos-c/bible-corpus https://ep.liu.se/ecp/087/ecp13087.pdf#page=35 http://archive.ics.uci.edu/ml http://www.lrec-conf.org/proceedings/lrec2014/pdf/529 paper.pdf https://hal.archives-ouvertes.fr/hal-01827884/file/11 w34.pdf https://openresearch.surrey.ac.uk/esploro/outputs/journalarticle/compiling-and-using-a-parallel-corpus-for-research-in-translation/99516816302346#file-0 https://openresearch.surrey.ac.uk/esploro/outputs/journalarticle/compiling-and-using-a-parallel-corpus-for-research-in-translation/99516816302346#file-0 https://openresearch.surrey.ac.uk/esploro/outputs/journalarticle/compiling-and-using-a-parallel-corpus-for-research-in-translation/99516816302346#file-0 https://aclanthology.org/2020.lrec-1.442 https://aclanthology.org/w09-0305.pdf https://aclanthology.org/w09-0305.pdf https://doi.org/10.3115/1642049.1642054 https://aclanthology.org/2016.tc-1.9.pdf https://eprints.whiterose.ac.uk/125920/ https://arxiv.org/pdf/1801.02107v3.pdf https://arxiv.org/pdf/1801.02107v3.pdf http://www.statmt.org/europarl/ https://doi.org/10.1371/journal.pone.0122199 https://doi.org/10.1371/journal.pone.0122199 11turenne et al. journal of open humanities data doi: 10.5334/johd.62 lan, h., & huang, j. (2017, february). chinese-english cross-language text clustering algorithm based on latent semantic analysis. in proceedings of information science and cloud computing (pp. 1–7). retrieved from https://pos.sissa.it/300/007/pdf lee, c.-h., & yang, h.-c. (2000). towards multilingual information discovery through a som based text mining approach. in pricai workshop on text and web mining (pp. 80–87). melbourne, australia. retrieved from https://citeseerx.ist.psu.edu/viewdoc/ download?doi=10.1.1.33.8800&rep=rep1&type=pdf lee, d.-y. (2011). a corpus-based translation of korean financial reports into english. journal of universal language, 12(1), 75–94. retrieved from https://www.sejongjul.org/download/download pdf?pid=jul-12-1-75. doi: https://doi.org/10.22425/jul.2011.12.1.75 lefever, e., macken, l., & hoste, v. (2009, 30 march – 3 april). language-independent bilingual terminology extraction from a multilingual parallel corpus. in proceedings of the 12th conference of the european chapter of the acl (pp. 1746–1751). athens, greece. retrieved from https:// aclanthology.org/e09-1057.pdf. doi: https://doi.org/10.3115/1609067.1609122 li, l., wang, p., huang, d., & zhao, l. (2011). mining english-chinese named entity pairs from comparable corpora. acm transactions on asian language information processing, 10, 1–19. doi: https://doi. org/10.1145/2025384.2025387 lu, b., tsou, b. k., jiang, t., kwong, o. y., & zhu, j. (2010). mining large-scale parallel corpora from multilingual patents: an english-chinese example and its application to smt. in proceedings of the 1st cips-sighan joint conference on chinese language processing (pp. 79–86). beijing. retrieved from https://aclanthology.org/w10-4110.pdf mcenery, t., & xiao, z. (2007). parallel and comparable corpora – the state of play. in n. t. y. kawaguchi t. takagaki & y. tsuruga (eds.), proceedings of the international conference on asian language processing (pp. 131–146). amsterdam: benjamin. doi: https://doi.org/10.1075/ubli.6.11mce miletic, a., stosic, d., & marjanović, d. (2017). parcolab: a parallel corpus for serbian, french and english. in k. ekštein & v. matoušek (eds.), text, speech, and dialogue. tsd 2017. lecture notes in computer science, 10415, 201–204. berlin: springer-verlag. doi: https://doi.org/10.1007/978-3-319-64206-2 neves, m., yepes, a. j., & névéol, a. (2016). the scielo corpus: a parallel corpus of scientific publications for biomedicine. in proceedings of the 15th international conference on language resources and evaluation. european language resources association. retrieved from https://aclanthology.org/l161470 ponay, c. s., & cheng, c. k. (2015). building an english-filipino tourism corpus and lexicon for an asean language translation system. in proceedings of the international conference asialex (pp. 201–204). hong kong: polytechnic university. retrieved from https://www.researchgate.net/profile/charmaineponay-2/publication/27994689223buildinganenglish-filipinotourismcorpusandlexiconforanaseanla nguagetranslationsystem/links/559f2fee08ae97223ddc602f/23-building-an-english-filipino-tourismcorpus-and-lexicon-for-an-asean-language-translation-system.pdf rosemeyer, m., & enrique-arias, a. (2016). a match made in heaven: using parallel corpora and multinomial logistic regression to analyze the expression of possession in old spanish. language variation and change, 28(03), 307–334. doi: https://doi.org/10.1017/s0954394516000120 rovenchak, a. (2021). bamana tales recorded by umaru nanankr jara: a comparative study based on a bamana-french parallel corpus. mandenkan, 64, 81–104. doi: https://doi.org/10.4000/ mandenkan.2471 schwenk, h., chaudhary, v., sun, s., gong, h., & guzmán, f. (2021, april). wikimatrix: mining 135m parallel sentences in 1620 language pairs from wikipedia. in proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume (pp. 1351– 1361). online: association for computational linguistics. retrieved from https://www.aclweb.org/ anthology/2021.eacl-main.115. doi: https://doi.org/10.18653/v1/2021.eacl-main.115 smirnova, o., & rackevičienė, s. (2020). english-french-lithuanian parallel corpus of eu financial documents. retrieved from http://hdl.handle.net/20.500.11821/35 srivastava, j., & sanyal, s. (2015). pos-based word alignment for small corpus. in proceedings of international conference on asian language processing (pp. 37–40). doi: https://doi.org/10.1109/ ialp.2015.7451526 steinberger, r., pouliquen, b., widiger, a., ignat, c., erjavec, t., tufis, d., & varga, d. (2006, 24–26 may). the jrc-acquis: a multilingual aligned parallel corpus with 20+ languages. in proceedings of the 5th international conference on language resources and evaluation (pp. 2142–2147). genoa, italy. retrieved from https://arxiv.org/abs/cs/0609058 sturgeon, d. (ed.). (2021). ancient chinese books datasets (chinese text project). retrieved from https:// ctext.org/daoism tian, l., wong, d. f., chao, l. s., quaresma, p., oliveira, f., & yi, l. (2014). um-corpus: a large englishchinese parallel corpus for statistical machine translation. in lrec. reykjavik, iceland: european language resources association (elra). retrieved from http://www.lrec-conf.org/proceedings/ lrec2014/pdf/774paper.pdf https://doi.org/10.5334/johd.62 https://pos.sissa.it/300/007/pdf https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.8800&rep=rep1&type=pdf https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.8800&rep=rep1&type=pdf https://www.sejongjul.org/download/download pdf?pid=jul-12-1-75 https://www.sejongjul.org/download/download pdf?pid=jul-12-1-75 https://doi.org/10.22425/jul.2011.12.1.75 https://aclanthology.org/e09-1057.pdf https://aclanthology.org/e09-1057.pdf https://doi.org/10.3115/1609067.1609122 https://doi.org/10.1145/2025384.2025387 https://doi.org/10.1145/2025384.2025387 https://aclanthology.org/w10-4110.pdf https://doi.org/10.1075/ubli.6.11mce https://doi.org/10.1007/978-3-319-64206-2 https://aclanthology.org/l16-1470 https://aclanthology.org/l16-1470 https://www.researchgate.net/profile/charmaine-ponay-2/publication/27994689223buildinganenglish-filipinotourismcorpusandlexiconforanaseanlanguagetranslationsystem/links/559f2fee08ae97223ddc602f/23-building-an-english-filipino-tourism-corpus-and-lexicon-fo https://www.researchgate.net/profile/charmaine-ponay-2/publication/27994689223buildinganenglish-filipinotourismcorpusandlexiconforanaseanlanguagetranslationsystem/links/559f2fee08ae97223ddc602f/23-building-an-english-filipino-tourism-corpus-and-lexicon-fo https://www.researchgate.net/profile/charmaine-ponay-2/publication/27994689223buildinganenglish-filipinotourismcorpusandlexiconforanaseanlanguagetranslationsystem/links/559f2fee08ae97223ddc602f/23-building-an-english-filipino-tourism-corpus-and-lexicon-fo https://www.researchgate.net/profile/charmaine-ponay-2/publication/27994689223buildinganenglish-filipinotourismcorpusandlexiconforanaseanlanguagetranslationsystem/links/559f2fee08ae97223ddc602f/23-building-an-english-filipino-tourism-corpus-and-lexicon-fo https://doi.org/10.1017/s0954394516000120 https://doi.org/10.4000/mandenkan.2471 https://doi.org/10.4000/mandenkan.2471 https://www.aclweb.org/anthology/2021.eacl-main.115 https://www.aclweb.org/anthology/2021.eacl-main.115 https://doi.org/10.18653/v1/2021.eacl-main.115 http://hdl.handle.net/20.500.11821/35 https://doi.org/10.1109/ialp.2015.7451526 https://doi.org/10.1109/ialp.2015.7451526 https://arxiv.org/abs/cs/0609058 https://ctext.org/daoism https://ctext.org/daoism http://www.lrec-conf.org/proceedings/lrec2014/pdf/774paper.pdf http://www.lrec-conf.org/proceedings/lrec2014/pdf/774paper.pdf 12turenne et al. journal of open humanities data doi: 10.5334/johd.62 to cite this article: turenne, n., chen, z., fan, g., li, j., li, y., wang, s., & zhou, j. (2022). mining an englishchinese parallel dataset of financial news. journal of open humanities data, 8: 9, pp. 1–12. doi: https://doi. org/10.5334/johd.62 published: 18 march 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. tiedemann, j. (2012, may). parallel data, tools and interfaces in opus. in n. calzolari et al. (eds.), proceedings of the eighth international conference on language resources and evaluation (pp. 2214–2218). istanbul, turkey: european language resources association (elra). retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.673.2874&rep=rep1&type=pdf turenne, n. (2018, january). the rumour spectrum. plos one, 13(1), 1–27. doi: https://doi.org/10.1371/ journal.pone.0189080 turenne, n., xu, b., li, x., xu, x., liu, h., & zhu, x. (2020). exploration of a balanced reference corpus with a wide variety of text mining tools. in proceedings of acai 2020: 2020 3rd international conference on algorithms, computing and artificial intelligence (pp. 1–9). new mexico, usa: acm digital library. doi: https://doi.org/10.1145/3446132.3446192 volk, m., amrhein, c., aepli, n., müller, m., & ströbel, p. (2016). building a parallel corpus on the world’s oldest banking magazine. in proceedings of the 13th conference on natural language processing (konvens) (pp. 288–296). doi: https://doi.org/10.5167/uzh-125746 woldeyohannis, m. m., besacier, l., & meshesha, m. (2018). a corpus for amharic-english speech translation: the case of tourism domain. in f. mekuria, e. nigussie, w. dargie, m. edward & t. tegegne (eds.), proceedings of information and communication technology for development for africa. ict4da 2017. lecture notes of the institute for computer sciences, social informatics and telecommunications engineering (vol. 244). doi: https://doi.org/10.1007/978-3-319-95153-9 wu, e., & xia, x. (1994). learning an english-chinese lexicon from a parallel corpus. in proceedings of the first conference of the association for machine translation in the americas (pp. 206–213). retrieved from https://aclanthology.org/1994.amta-1.26.pdf xiong, w. (2013). the development of the malaysian hansard corpus: a corpus of parliamentary debates 1959–2020. new technology of library and information service, vol. issue (6): 36–41. doi: https://doi. org/10.11925/infotech.1003-3513.2013.06.06 yang, c. c., & li, k. w. (2003). automatic construction of english/chinese parallel corpora. j. am. soc. inf. sci. technol., 54, 730–742. retrieved from https://aclanthology.org/a00-1004.pdf. doi: https://doi. org/10.1002/asi.10261 zhai, y., liu, l., zhong, x., illouz, g., & vilnat, a. (2020, may). building an english-chinese parallel corpus annotated with sub-sentential translation techniques. in proceedings of the 12th language resources and evaluation conference (pp. 4024–4033). marseille, france: european language resources association. retrieved from https://www.aclweb.org/anthology/2020.lrec-1.496 zhao, b., & vogel, s. (2002). adaptive parallel sentences mining from web bilingual news collection. in zz (ed.), proceedings of the ieee international conference on data mining (pp. 745–748). beijing. doi: https://doi.org/10.1109/icdm.2002.1184044 https://doi.org/10.5334/johd.62 https://doi.org/10.5334/johd.62 https://doi.org/10.5334/johd.62 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.673.2874&rep=rep1&type=pdf https://doi.org/10.1371/journal.pone.0189080 https://doi.org/10.1371/journal.pone.0189080 https://doi.org/10.1145/3446132.3446192 https://doi.org/10.5167/uzh-125746 https://doi.org/10.1007/978-3-319-95153-9 https://aclanthology.org/1994.amta-1.26.pdf https://doi.org/10.11925/infotech.1003-3513.2013.06.06 https://doi.org/10.11925/infotech.1003-3513.2013.06.06 https://aclanthology.org/a00-1004.pdf https://doi.org/10.1002/asi.10261 https://doi.org/10.1002/asi.10261 https://www.aclweb.org/anthology/2020.lrec-1.496 https://doi.org/10.1109/icdm.2002.1184044 teaching the text encoding initiative: context, community and collaboration research paper corresponding author: huw jones university library, cambridge, uk hej23@cam.ac.uk keywords: text encoding; tei; pedagogy; xml; manuscripts to cite this article: faghihi, y., holford, m., & jones, h. (2022). teaching the text encoding initiative: context, community and collaboration. journal of open humanities data, 8: 15, pp. 1–14. doi: https://doi. org/10.5334/johd.72 teaching the text encoding initiative: context, community and collaboration yasmin faghihi matthew holford huw jones *author affiliations can be found in the back matter of this article abstract in common with many technical aspects of digital humanities, the tei has a reputation for being difficult to teach and difficult to learn, with potential practitioners put off by the large and (at first sight) intimidating set of guidelines, the seemingly complex hierarchical structure and the profusion of angle brackets. one-to-one or small group teaching in the context of a specific project is often the preferred method, where the short but steep learning curve required to engage with the tei can be addressed in a way which is relevant to the aims and experience of the learner. this, however, is not a particularly efficient way of teaching. in this article, the authors discuss their experience of teaching (and learning) the tei, and how lessons learned in contexts relating to specific projects might feed into the teaching of tei in a more general setting – the digital humanities at oxford summer school being the prime example. mailto:hej23@cam.ac.uk https://doi.org/10.5334/johd.72 https://doi.org/10.5334/johd.72 https://orcid.org/0000-0001-5556-168x https://orcid.org/0000-0002-8533-9083 2faghihi et al. journal of open humanities data doi: 10.5334/johd.72 introduction the text encoding initiative (tei) is, according to its own homepage, ‘a consortium which collectively develops and maintains a standard for the representation of texts in digital form’. it has been a major driver for and influence on text-based digital humanities for over thirty years, and is ‘widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation’ (ibid.). in common with many technical aspects of digital humanities, the tei has a reputation for being difficult to teach and difficult to learn, with potential practitioners put off by the large and (at first sight) intimidating set of guidelines, the seemingly complex hierarchical structure and the profusion of angle brackets. one-to-one or small group teaching in the context of a specific project is often the preferred method, where the short but steep learning curve required to engage with the tei can be addressed in a way which is relevant to the aims and experience of the learner. this, however, is not a particularly efficient way of teaching. the most visible part of the tei are the guidelines but, as james cummings points out, the tei is at heart ‘a community of users and volunteers that produces a freely available manual of regularly maintained and updated recommendations for encoding digital text’ (cummings, 2019a). even at a comparatively early stage in its history, elli mylonas and allen renear argued that the tei’s dual functions as an interchange format and data description language ‘pale before’ its role as a research community – whose subject matter is no less than what is stated in the following sentence: ‘textual communication, with the principal goal of improving our general theoretical understanding of textual representation, and the auxiliary practical goal of using that improved understanding to develop methods, tools, and techniques that will be valuable to other fields and will support practical applications in publishing, archives, and libraries’ (mylonas & renear, 1999). to this, we might add that tei is ‘an analytical framework for deep reading’. certainly, our experience of teaching the tei is that it is something to engage with rather than ‘learn’ – and that the ‘something’ is not just the practice of encoding with tei, or even the current community of users of and contributors to the tei, but also those that have gone before them. cummings again: ‘it formalizes a history of the community’s concerns for textual distinctions and exemplifies understandings of how to encode them and how these have developed over its existence; it acts as a slowly developing consensus-based method of structuring those distinctions’ (cummings, 2019a). if you were to write a history of text-based digital humanities in the late 20th and early 21st centuries, the tei guidelines would be a good place to start. twelve years ago, an article was published which highlighted some of the problems encountered in teaching the tei (terras et al., 2009) – the extensive and general nature of the guidelines, the broad range of uses and users, and the sometimes difficult connection between the theory of tei (and text encoding) and how it operates in a project or institution. the authors proposed a series of online learning resources that would teach the tei by example – in the context of specific fields or activities.1 susanna allés-torrent and gimena del rio riande have gone on to point out that in addition to the specific requirements of projects and materials, tei teaching and learning is also affected by different cultural and linguistic contexts (allés-torrent & riande, 2019). it seems that when teaching the tei, context is everything, which poses particular challenges for the ‘general’ tei course, online or in person. the writers of this present article have been teaching the tei in a variety of contexts for the past 12 years – both in relation to specific projects or sets of material (the fihrist catalogue of islamicate manuscripts, medieval manuscripts in oxford libraries, cambridge digital library) and in a more general setting – the digital humanities at oxford summer school being the prime example. in this article, we will use four case studies to talk about our experience of teaching the tei. we will then outline some general conclusions on what it means to teach the tei and discuss how some aspects of teaching in a specific learning environment might feed into the teaching of the tei in a more general context. 1 https://teibyexample.org/. https://teibyexample.org/ 3faghihi et al. journal of open humanities data doi: 10.5334/johd.72 case studies 1. manuscript description and fihrist fihrist is a union catalogue incorporating descriptive manuscript data from nineteen institutions in the uk and ireland.2 it uses tei as its data format, with descriptions created by contributing institutions either as part of projects or on an ongoing basis. having begun in 2009 as the oxford and cambridge online islamic manuscript catalogue, an online catalogue for arabic and persian manuscripts held at the bodleian and cambridge university library, it secured further funding in 2011 to broaden its scope and become a union catalogue for manuscripts from the islamicate world. the name ‘fihrist’ simply means ‘catalogue’ in arabic, persian and turkish and was inspired by the famous 10th-century work by ibn al-nadim, the kitab al-fihrist. countless specialised studies have used the fihrist as a source of data. because it includes the titles of a large number of works that are no longer extant, as well as biographical information on little-known early authors, it throws light on otherwise obscure facets of mediaeval islamic intellectual history in many fields (stewart, 2007). moreover, al-nadim’s fihrist presents a complex structure of organising information, which requires a non-linear, (i.e. not exclusively chronological) reading of the work. by embedding gaps, he introduces the idea of an eternal work in progress. our aspiration when creating fihrist was to include descriptions of all manuscripts held in the british isles belonging to the widely defined near and middle east, africa and the various parts of asia which relate to islamic and eastern christian cultures by language and script. by implementing a detailed data-model and encoding standard, we envisaged the potential of working with the data on various levels beyond discovery and access. the technical infrastructure is maintained by the bodleian libraries in oxford, and the editor yasmin faghihi, at cambridge university library, directs work on the content. oversight is provided by a board whose remit includes advice and training on the creation of tei descriptions. a fundamental aspect of teaching the tei for fihrist is the fact that it is a work in progress, both in terms of the scope of the content and the nature of the descriptions themselves, which evolve in response to new research approaches and new materials. the decision to use tei as a data format was based on a number of factors. initially, our prime goal was to work towards a sustainable solution for creating descriptive data for manuscripts that not only facilitated access to collections, but also provided the groundwork for a digital infrastructure which would generate further research. manuscript descriptions from this (loosely defined) corpus, had until then been locked in library card catalogues and/or out-ofprint monographs, which had been surpassed by evolving metadata standards for cataloguing other materials such as printed books and journals. the complexity of manuscripts as research objects demanded a more flexible and extensive metadata format. the focus of most online public access catalogues has been on creating machine-readable cataloguing records for standardised bibliographic descriptions including basic physical features in accordance with established cataloguing rules. traditional manuscript catalogues on the other hand often consist of a series of paragraphs of prose, which include historical information (dates, relationships), analytical information (contents, identities of people, script styles) and structural information (such as size and material), often in no prescriptive order. while largely consistent within single catalogues, the organisation and rendering of this information varies significantly across catalogues. the inclusion of manuscript collections into digital library catalogues had fallen into the gap that divided the text-based format of humanities research from the highly structural approach of library and information science. the evolution of digital humanities and the tei, especially the introduction of the manuscript description module,3 presented a new opportunity. as a community-based and text-focused standard, tei was the most cutting-edge metadata format for transforming primary research on manuscripts and manuscript collections into computable data without losing the integrity of its textual features. the manuscript description module (called msdesc) offers a wide range of relevant descriptor codes, which facilitate a structural approach to describing the text-bearing object in detail. these include top-level elements containing content related information, physical description, history, additional information and more, with a range of sub-elements available within each. our decision to embed the 2 https://www.fihrist.org.uk/. 3 https://tei-c.org/release/doc/tei-p5-doc/en/html/ms.html. https://www.fihrist.org.uk/ https://tei-c.org/release/doc/tei-p5-doc/en/html/ms.html 4faghihi et al. journal of open humanities data doi: 10.5334/johd.72 msdesc element into the header rather than the body of the tei document also opened up the inclusion of transcriptions at a later stage. the starting point for the fihrist data model was a schema for the description of islamic manuscripts which had been customised from the enrich4 project for describing western medieval manuscripts (pierazzo, 2010). amongst the challenges of teaching text encoding for fihrist is a constant engagement with the tei on three conceptual levels: data modelling to maximise the encoding of research data, the integration of existing library standards for data compatibility and export, and developing best practice for training, reuse and documentation. to meet these challenges, all manuscript descriptions in fihrist are created in the raw tei/xml using the oxygen xml editor, which allows for direct engagement with the process of creating and modelling data. data modelling and the development of best practice were extremely important in the initial experimentation phase, and these processes feed directly into the teaching and learning of tei for contributors to fihrist. the teaching of the tei for fihrist fell into four broad stages. in the initial phase, core project members learned the tei principles and practices at the same time as developing the data model, an iterative process which is fundamental to many tei-using projects. at the start of the project, no examples of descriptions for islamic manuscripts in tei existed, so the process of learning went hand-in-hand with the process of modelling. following a two-day general introduction to the tei, learning was mainly autodidactic with project members exploring the guidelines alongside the manuscript data. the initial learning stage of working with xml, the oxygen editor and the tei guidelines, without concrete examples for best practice, was a steep learning curve. a year later more expertise was gained when faghihi attended the tei strand of the digital humanities at oxford summer school, which we now convene. as fihrist expanded to become a union catalogue, the demand for tei training for contributing institutions grew. teaching mainly occurred in one-to-one or small group settings, dealing directly with the materials at hand, and learning through project work rather than through exercises. teaching and learning continued to go hand-in-hand with data modelling, with new materials and approaches requiring adjustments to the schema and to documentation. while time-consuming, this approach proved very effective as a learning method, allowing us to teach in a real-world context, and cementing the communal and collaborative network on which fihrist relies. a third phase followed a major data-consolidation project at oxford, when the wider infrastructure for hosting various other manuscript catalogues (including the medieval manuscript catalogue) was established and improved by a generic schema for all tei catalogues, along with a more complex and consistent approach to authority files. data and code were now hosted on github,5 and the editing supported by use of the github client. this new phase of development was preceded by a major data clean up, correcting the many idiosyncrasies and inconsistencies generated in the initial phase and setting new parameters for improved practices. as there were now a significant number of contributors to fihrist, a new approach to teaching and learning was required. training was delivered in a series of structured workshops where the creation of tei descriptions, with a particular focus on use of the authority files (lists of standard forms for certain entities in the data such as names and works), was embedded in a complete workflow involving collaborative working with github. the use of the github client along with the oxygen xml editor for managing and committing files, and the use of github functionality for raising issues and resolving conflicts were included, ensuring better transparency and improved communication. efforts were made to consolidate training documentation and to encourage the sharing of knowledge across the network. with most contributors now either competent tei users or with access to support in their own institutions, current tei support for fihrist mainly occurs in a more ad hoc and collaborative form – through github itself and through an email list. issues arising from new materials or new approaches to description are raised and discussed, sometimes leading to alterations to the data model and documentation and/or further enquiry on the main tei discussion forums. 4 https://digital.humanities.ox.ac.uk/project/enrich; http://www.manuscriptorium.com/en/tei-p5-enrichschema-en. 5 https://github.com/fihristorg/fihrist-mss. https://digital.humanities.ox.ac.uk/project/enrich http://www.manuscriptorium.com/en/tei-p5-enrich-schema-en http://www.manuscriptorium.com/en/tei-p5-enrich-schema-en https://github.com/fihristorg/fihrist-mss 5faghihi et al. journal of open humanities data doi: 10.5334/johd.72 in this sense, the constant interplay between the process of learning and data modelling in response to real-world challenges arising from both materials and research continues to enrich our approach to manuscript description. the results of these reiterative processes generate further content for training new contributors, nonetheless by creating explicit examples. interpersonal teaching, however, remains key to ensuring new users acquire conceptual skills of converting traditional descriptions (or approaches to descriptions) into computer readable (tei-) data, in accordance with an approved best practice. moreover, the growing community of experienced users generates increased capacity for training, reflecting the collaborative and reiterative potential of tei. our experience in teaching the tei for fihrist has produced a number of general conclusions about learning tei for manuscript description. while the tei has a reputation for being difficult to learn, most elements and attributes (computer tokens) include natural language components, for example decodesc for decoration description or history for history etc. and are intuitively understood or easily memorable. certainly, in an english-speaking context, this proved to be a major advantage, especially when teaching students unfamiliar with coding or other mark-up languages. it was nevertheless essential to begin with an introduction to the rules governing xml, such as well formedness, nesting, the role of elements, attributes and values, and the functionality of the xml editor. however, one of the complications of teaching tei for manuscript description is that technical skills and a knowledge of xml are not all that is required. a combination of technical, language and subject skills is required to create descriptions, especially if creating descriptions from the object in hand, when expertise in manuscript studies including codicology together with knowledge of the language is essential. a background knowledge of skills and standards more commonly associated with libraries and archives is also important. one of the challenges of creating descriptive data for items held in research library catalogues, regardless of the prime objective to build an independent union catalogue, is to ensure data compatibility. in other words, we must bridge the gap between applied library standards for key components such as names, titles, transliteration, subject keywords and the idiosyncrasies of extant legacy data. one of the great benefits of the tei is its inclusiveness, not only in scope through the extension of existing modules and adoption of new ones, but also in the capacity to incorporate existing standards. thus, we were able to adhere to and include library standards for names, subjects, languages, and the other key components, and include identifiers and links to external authority files. as a result, an introduction to relevant standards is vital before proceeding to show how some of these are encoded within each manuscript record. an example familiar to most users was the application of the library of congress subject headings commonly used in uk and us libraries and beyond. adding provenance information to each record is a vital component of data credibility. as contributors are instructed to encode institutional information in the description of the electronic file nested under filedesc, a mandatory component of msheader, each individual is encouraged to encode their name and details of their contribution in a separate element in the revision description revisiondesc. in cases where the manuscript description is born digital or substantially enriched from legacy data, the contributors are taught to include their names in the source element of the manuscript description msdesc, where the intellectual content is directly credited to the person conducting the research. in our teaching, we point out the significance of data provenance in relation to data reuse and that by observing this practice, the name of the record creator will become an integral part of the provenance of both research and data. an integral part of the new set up was the visibility and accessibility of all relevant data in github. cataloguers were now able to edit their data and contribute directly to the repository. with the transfer of data and the new practice of communication, a new workflow was introduced which required additional training. to compensate for the lack of a data-inputting tool, an xslt conversion template was produced, which generated a html preview of the tei document and proved very helpful for instant proofreading. as a result, teaching included not only the theory and practice of tei and how to encode new aspects of manuscript descriptions, but also required training in using the various tools (github client, xslt conversion) in order to ensure that a consistent workflow was observed. as communication was also transferred to 6faghihi et al. journal of open humanities data doi: 10.5334/johd.72 github, reducing the traffic on the mailing list, engagement with the tools became a core part of contributing to fihrist. in conclusion, the tei training for fihrist was and remains a hybrid approach. as with many digital humanities methods, a multitude of skills and experience is required to arrive at the successful creation of a manuscript description, and these must be taught in tandem with the technique of encoding itself. teaching manuscript description in tei is most effective if set in the context of a use case and adjusted to the skill set of the user. when working from legacy data with some prior experience in encoding, cataloguers often manage to become self-sufficient learners after an introductory session, relying on documentation and working by example with occasional support. however, when creating tei records as the direct result of research conducted on the manuscript, training is an ongoing and iterative exercise. this approach of examining the manuscript in parallel with testing the boundaries of the tei to encode the results has been a learning experience for both the teacher and the student, with benefits for the wider enterprise and for the tei community as a whole. 2. medieval manuscripts an important arena for teaching tei in the bodleian libraries is as part of ongoing work on the manuscript catalogue, medieval manuscripts in oxford libraries.6 this is a union catalogue of western medieval manuscripts in the bodleian and a number of oxford college libraries. it consists of manuscript descriptions and authority files encoded in tei according to a projectspecific schema. as with fihrist, the catalogue data is stored in github,7 which also provides a relatively simple process for data submission and review. tei editing is carried out using the oxygen editor,8 for which the university of oxford has an institutional licence. training typically takes place as part of specific projects for catalogue enhancement hosted both by the bodleian and by individual oxford colleges. these have focussed on retroconversion of existing printed catalogues and on the enhancement of catalogue records for manuscripts about to be digitised. retroconversion projects have been led by merton, exeter and st john’s colleges; the main recent digitization project has been manuscripts from german-speaking lands.9 the project officers are typically archivists or librarians at an early stage in their careers who have domain expertise in medieval manuscripts and cataloguing but do not have extensive pre-existing experience with tei or markup. training is provided by staff at the bodleian and aims principally to enable the project officers to fulfil the requirements of the particular project. it is, nevertheless, the aim of the bodleian libraries and colleges that project officers should receive a good grounding in xml and tei as a transferable skill and part of their professional development. most project officers are not familiar with oxygen or github and require initial training in the use of these interfaces, which is provided via a combination of written documentation and oneto-one training. initial training in tei encoding itself covers the essentials of xml (elements, attributes, well-formedness, the schema) and the broad structure of the tei msdesc module. this is again provided on a one-to-one basis with intentionally high-level documentation.10 the full set of project encoding guidelines is made available for reference but no attempt is made in training to go through every eventuality it covers. officers are usually then ready to begin independent encoding, usually working from a template pre-populated with basic metadata such as institution and collection details, bibliographical information on the printed catalogue in the case of retroconversion, and with key elements of the msdesc module in place (two similar templates are available in the msdesc repository, one for manuscripts of a single codicological unit and one for multi-unit manuscripts). they are encouraged to raise questions via email and their work is reviewed particularly thoroughly for the first few weeks. after review (usually via github) the contributions are merged into the master branch and indexed for display on 6 https://medieval.bodleian.ox.ac.uk/. 7 https://github.com/bodleian/medieval-mss. 8 https://www.oxygenxml.com/. 9 https://hab.bodleian.ox.ac.uk/. 10 https://github.com/msdesc/medieval/blob/master/quick-start.md. https://medieval.bodleian.ox.ac.uk/ https://github.com/bodleian/medieval-mss https://www.oxygenxml.com/ https://hab.bodleian.ox.ac.uk/ https://github.com/msdesc/medieval/blob/master/quick-start.md 7faghihi et al. journal of open humanities data doi: 10.5334/johd.72 the project website. the web interface is updated weekly and officers are able to explore their records in the context of the whole catalogue. it is difficult to generalise about the aspects of encoding that new officers find challenging. one area that often requires particular attention, however, is the transcription of text from manuscripts (‘representation of primary sources’ in the tei guidelines). this is an area where the original tei guidelines can seem particularly complex, for example in the variety of potential solutions for encoding abbreviations, and where care needs to be taken in the use of distinct but related elements such as damage, gap and supplied to correctly encode missing or damaged text. the somewhat simplified project guidelines can still be confusing at first. it must be emphasised, however, despite the existence of detailed project guidelines, that project officers are not simply passive encoders. due to the variations in scholarly practice in the description of manuscripts, the encoding of a manuscript description or the retroconversion of a printed manuscript catalogue will often raise issues that have not yet been addressed in the project guidelines, or to which better solutions may emerge. the project guidelines are regularly updated, and officers thereby gain a sense of the tei as an evolving community of practice rather than a fossilised collection of rules. teaching tei in the context of a single active project has several advantages. it undoubtedly simplifies the training process. half a day’s one-on-one training is typically sufficient for a project officer to begin independent encoding. contributing to an active project is, in addition, motivating. seeing their work publicly available, and seeing how their encoding enables search and browse functionality, is a powerful incentive for the effort spent in encoding. to a significant extent officers learn by doing, but they do so in the context of a detailed project schema which provides immediate feedback for most encoding errors. they are also shown how to use an xslt transformation within oxygen to generate an html preview of a catalogue entry. experience shows that this is essential for proofreading and for understanding the relationship between markup and the final display of the record. teaching tei in this way does mean that project officers become familiar in detail with only a part of the tei; indeed, they are working with reference to project guidelines rather than the tei p5 guidelines. are they indeed ‘learning tei’ in the fullest sense? we would argue that they are. ‘learning tei’ rarely, if ever, means learning all the tei; more usually it means learning what is relevant in a particular context, while, ideally, gaining a sense of other contexts in which tei is used, ideally, learning that the tei can (and should) be customised, and learning that the guidelines change and evolve. in addition to the manuscript description module, officers do also become familiar with several other aspects of tei: the header; the modules for names, dates, people and places; certainty, precision and responsibility; and with the representation of primary sources (since almost all cataloguing involves transcriptions of text from manuscripts). through a close focus on one particular aspect of tei officers leave with a solid grounding in ‘the tei’ more generally which can be taken on to other projects. 3. david jones in june and july 2021, jones and faghihi co-convened two week-long workshops which taught text encoding and the tei through the work of the poet and artist david jones (1895–1974). both were hybrid workshops with the instructors physically present (with access to the material archive), and participants joining online. both were co-taught by dh practitioners and literary scholars – faghhi and jones from a dh background and laura mckormick kilbride, tom berenato and anna svensden on the literary studies side. each was successful in ways that were interesting and new to all involved – participants and teachers, literary scholars and dh practitioners. in this section, we will describe the organisation of the workshops, and what they taught us about learning and applying text encoding and the tei in a tightly-focussed research context based on a limited corpus of material. the first workshop, sponsored by the cambridge humanities research grants scheme and organised through cambridge digital humanities, was held in cambridge and concentrated on jones’s correspondence with jim ede, the owner and curator of the cambridge art gallery called kettle’s yard. there were 20 participants from a variety of backgrounds (students, researchers, librarians, archivists), and while there was no requirement for any knowledge or experience 8faghihi et al. journal of open humanities data doi: 10.5334/johd.72 of text encoding or the tei, there was an approximate 50-50 split between those from a dh background and those who approached the workshop through an interest in david jones. in preparation for the workshop, jones’s correspondence with ede had been digitised, high resolution images made available on cambridge digital library, and course materials uploaded to a cdh moodle site. the week-long workshop began on monday morning with a virtual tour of kettle’s yard, followed by a session on jones and his work. in the afternoon, we embarked on an intense three-hour introduction to text encoding and the tei which left participants and teachers alike a little exhausted. then each participant was assigned two letters, and given outline tei records to work with, and unleashed on the business of text encoding. for the following three days, participants were expected to work semi-independently. there were two hour-long drop-in sessions each day (one technical, one on jones and his work – though in practice these boundaries blurred), and participants had access to an online forum to discuss issues and to access course materials for reference and support. completed work was submitted on thursday afternoon and uploaded to the digital library test site for a public event on the friday afternoon. this broader event showcased jones and his work, the physical archive, the workshop itself, and the published outputs. the second workshop followed a very similar format, though we adjusted the schedule to spread the intense tei learning over the first two days. this was followed by independent working and drop-in sessions, and finished with a public event and the publication of the workshop’s outputs. for this workshop, the focus was on a single draft of an unpublished poem the book of balaam’s ass from the collections of the national library of wales – an abandoned manuscript, fragments of which appear in jones’ final book, the sleeping lord (1974) (berenato, 2021). the draft was divided up into manageable chunks, and each participant was allocated five pages to encode. the participants for this workshop were heavily skewed towards literary scholarship, being mainly researchers with existing expertise in jones and his work. again, the physical archive was an important part of the introductory sessions and the public event, with jones’s cara wallia derelicta serving as the backdrop on both days. the workshops were very successful, not just in terms of the learning and material outputs but also in the creation of a sense of community and a strong group dynamic – something which we did not necessarily expect in the hybrid format. we learned some significant things about text encoding, the tei, and what it means to teach them. the first point that emerged was the role of text encoding not just as a route to publication or analysis, but as a form of close reading. as kate singer points out ‘… encoding—and teaching encoding—might be a valuable pedagogical tool, to enhance ‘close reading’ and, additionally, to refocus reading as method of evaluative labeling, categorization, and selection of discrete bits of text’ (singer, 2013). the focus on a discrete set of materials, the co-teaching of the workshops by experts in the material alongside dh practitioners, and the sense that the workshop would produce real published outputs (rather than just exercises) meant that we were teaching text encoding more as a framework for the exploration of the text than as a technical standard. the application of the tei brought about new insights into jones’s work and his working process – for instance in the classification of his editorial symbols -in the attempt to identify and order different acts of writing, and in the movement from prose to poetry in the text. participants and teachers alike were surprised at the extent to which text encoding generated new perspectives on material which some of them had been studying for many years. the second point was the success of the hybrid format in generating the collaborative community approach that is fundamental to the tei. this was facilitated by the mixture of literary scholars and dh practitioners in the teaching team – providing an immediate example of interdisciplinary collaboration. also significant was the idea that the workshop was not a series of exercises, but a process which would produce useful and published outputs which would be of concrete benefit to jones studies. working collaboratively on something real fostered a strong communal and collaborative spirit. the emphasis on drop-in sessions rather than formal taught sessions also helped to generate a sense of everyone working together – in particular, the very lively sessions where the group attempted to decipher jones’s handwriting. the comparatively low contact time did not seem to hinder the participants, who made effective use of the moodle forum to work collaboratively in a hybrid space. we learned from 9faghihi et al. journal of open humanities data doi: 10.5334/johd.72 the intense and rather exhausting tei introductory session on the first day of the first workshop, and spread these sessions out over two days for the second. the third point was the framing of text encoding as an approach or methodology independent of its particular manifestation in the tei. participants and teachers quickly moved to the position of text encoding as a way of thinking about texts – and an approach which could be adopted usefully as a thought experiment as well as in its practical application. as one of the workshop participants said, for gaining a deeper understanding it would be worth encoding a text even if you immediately threw the results of your efforts away. finally, the most surprising element of both workshops was how participants with little or no knowledge of text encoding or the tei on monday were able to produce tei documents fit for publication (with some editing) by the end of thursday. when planning the workshops we felt it was important to put forward the idea of publishing the outputs as a challenge, but were concerned that we were being too ambitious. as fukushima and bourrier point out ‘… it is much easier to give a student a b+ on a paper that shows effort and promise than it is to send back multiple tei letters that show effort and promise but must be perfect for publication’ (fukushima and bourrier, 2019). we are not entirely sure why the participants were able to pick the tei up so quickly, but we think it has something to do with directly engaging with the texts and materials at a very early stage in the learning process, with the emphasis on independent working in a collaborative framework, and with the fact that real work was being produced rather than merely practice exercises. concentrating on specific texts and materials allowed us to focus on a relevant subset of the tei in a way which effectively reproduced use of the tei in real project workflows. equally encouraging was the number of participants who expressed an interest in continuing to work on the materials after the workshops had finished. apostolo, börner and hechtl (2019) write persuasively about the challenges of teaching using real materials rather than practice examples and the need for specialist knowledge and the sometimes daunting and complex nature of archival materials. however, in our experience, the use of real material in the context of a live project, working in a collaborative environment and with access to subject specialists, is the fastest route to arriving at their concept of the aim of a genetic edition, which is ‘… understand[ing] what might have happened in the author’s mind while he or she was writing’ (apostolo, börner, & hechtl, 2019). it was this feeling of close engagement with the author and the writing process which provided the main inspiration for fast and effective learning of the tei. 4. digital humanities at oxford summer school tei strand all the authors of this article have a long and varied history of engagement with the tei strand of the digital humanities at oxford summer school (for a comparative case study of the development of the summer school see cummings, 2019b). jones attended as a learner in 2009, when it was still the tei summer school, and in the same year, faghihi presented fihrist (then in its infancy) as a test case. faghihi herself attended as a learner in 2011, and jones and holford have each at different times run the manuscript description session of the strand. in 2019 the three of us took over as co-convenors of the strand, running one in-person summer school in 2019 over five days before covid brought two cut-down online versions in 2020 (one hour) and 2021 (three hours). the main challenge of the summer school lies in the cohort, which consists of around thirty participants from a variety of backgrounds (research, libraries and archives being the most common), a variety of disciplines, and with varying levels of technical experience. some come with a general interest in the tei, and others have very specific projects or activities in mind. the course is currently filled on a first-come-first-served basis with no prerequisites or selection criteria, meaning that we have to assume that everything needs to be taught from scratch. furthermore, to provide a useful foundation the course needs to cover not only encoding but also analysis, publication and schema consolidation. in this context, even the full week of the in-person event does not seem like a very long time to get to grips with the tei. our first consideration when we took over the running of the strand was to establish what we could realistically hope to achieve given the nature of the cohort and the timeframe. we wanted to make a distinction between text encoding as an activity and the tei as a way of doing text 10faghihi et al. journal of open humanities data doi: 10.5334/johd.72 encoding. we wanted to concentrate on outcomes that made sense to learners: description, transcription, publication and analysis. we wanted to give them enough technical background to make them self-sufficient learners of the tei without getting too bogged down in the code. we wanted to introduce the tei as a living community of practice and practitioners of text encoding. in addition, more than anything else, we wanted to show how text encoding and the tei could be relevant (and transformational) to the ideas and projects of the participants. starting with the in-person event in 2019, first we asked group members about their interests and ambitions for the course in order to illustrate the diversity of the field, and to help us to make the content more relevant. our first exercise was to encode physically a passage of text using marker pens. our aim here was to engage with text encoding as an activity outside of any particular technology or standard – a process victor del hierro takes to its logical conclusion in writing out tei tags in biro onto a physical copy of a poem (ives et al., 2013). we also wanted to reassure the less technically experienced members of the group that the main principles of the course – the point of text encoding – would be still relevant to them even if they struggled to get up to speed with some of the encoding itself. we then embarked on a crash course in xml and an introduction to the tei guidelines, emphasising how the modular nature of the tei reflects the many different activities for which it has been used. for the practical elements of the sessions, we recommended the oxygen xml editor, which has specific support for tei and is the most commonly used editor for text encoding. oxygen is licensed software, and while many participants took advantage of the 30day free trial for the duration of the school, those who wished to pursue text encoding would have to pay or take advantage of institutional licences. participants were free to use opensource alternatives, such as the one explored by mike hawkins in his text and pointy brackets blog (hawkins, 2020), but with the proviso that we would not be able to provide technical support. the middle three days of the course concentrated on core parts of the tei: description, transcription and the use of the correspdesc module. as well as a workshop covering practical aspects, on each day we invited a guest speaker who was actively involved in a relevant project to give a concrete idea of what it is like to work with the tei in a real-life environment and to convey the sense of the tei as a community of practitioners. given our own research backgrounds, the course was inevitably slightly skewed towards manuscript studies. on the final day, we covered schemas and the customisation of the tei for the needs of specific projects, and touched on xslt and on routes to publication. the two online workshops in 2020 and 2021 were more limited in their ambitions. here we could only really hope to give a sense of what the tei is and why people might want to use it. we altered our approach from 2020 where we gave three one-hour workshops to 2021 where we gave a larger, single, three-hour workshop which seemed to work much more effectively and allowed participants to complete exercises and gave time for discussion of their experience. the major piece of feedback from these sessions was that even online it was much easier to engage with concrete examples than with theory. what can we learn from these experiences? first, that teaching the tei in a general context with a mixed cohort is much more challenging than teaching in a specific research context. people learn best when presented with concrete examples which are relevant to them, and it is difficult to do this with a large group with varied interests. concentrating on three core areas seemed an effective approach, but it might be that a wider variety of examples would be helpful for future sessions. varying levels of technical expertise were somewhat challenging, but the group quickly became quite self-supporting, with more experienced members helping those who were struggling. perhaps the most successful aspect of the in-person course was the presentation of the tei as a community, and the development of a community in the cohort itself. this was greatly helped by the presence of the invited speakers, many of whom stayed beyond their sessions to join in and help out. there was a real sense of communal learning, with participants working together and discussing approaches and ideas. we even had people dropping in from other strands of the summer school, leading to interesting interdisciplinary discussions and crossconnections with other areas of dh. our efforts to relate the tei to the ideas and ambitions of 11faghihi et al. journal of open humanities data doi: 10.5334/johd.72 the group seemed successful, and a number of participants contacted us after the course for help with setting up projects (the david jones digital archive, the mary hamilton papers and the correspondence of giacomo leopardi are three examples of this). the most challenging areas were the sessions on customising the tei through the schema and routes to publication. each of these areas would probably have benefited from being introduced earlier in the week, and particularly the session on schema customisation, which is essential to the use of the tei in specific contexts. publication and analysis might have been better presented as a theme throughout the week rather than a separate session, covering some commonly used tools and scripts in the context of specific use cases. conclusion what can we conclude about good ways of teaching the tei in a general context? this is a question that will come into focus very soon in our own institutions with both cambridge and oxford launching digital humanities masters courses in 2022/23, in which text encoding and the tei will be key components. the most common profile of an attendee at a tei course or workshop is someone who has heard of the tei in some kind of general sense, and who thinks it may be useful to them either in their specific institutional or research context or as a valuable skill. when they actually encounter tei, they often have that feeling peculiar to technical training that everything should be possible on a conceptual level, but is completely impossible on a practical basis. what can we do to allow them to understand what the tei is, what it does, and also give them a route to using it? an important starting point is to distinguish between text encoding as an activity, xml as a data standard and the tei as a standard for text encoding; these are three things that are often merged in an unhelpful way in the minds of learners. by looking at the practice of text encoding as distinct from its expression in xml and/or tei, we can talk more clearly about how and why it is used – as a route to publication, as the basis for analysis, and as a framework for engaging with the text – before embarking on the more technical, and possibly more challenging, parts of the training. a basic understanding of xml is certainly fundamental to getting going with the tei – and it is helpful if it is introduced generally as a data format in which text encoding can be done (rather than only in the context of its use by the tei). going through the building blocks of xml – elements, attributes, comments, namespaces, hierarchy and schemas – grounds learners in the actual work of text encoding, and reduces the uncertainty which some experience when confronted with a forest of angle brackets. just as important is the concept of xml files as text files like any other, readable both by machines and by humans, which reside in directories (also known as folder) on your computer and do not rely on any particular software to edit them or maintain them. this addresses an unhelpful merging between the concepts of file, editor and interface. we often come across the misapprehension that the xml file is somehow ‘in’ the editing software or ‘in’ the platform. in this context, a general introduction to the basic concepts of directories and files, and how they interact with software would be a useful addition to the preliminary materials. in addition to emphasising the sustainability and portability of xml files, one of the most important points for learners is that the process of text encoding happens immediately in front of them and in a space they can understand and control rather than in some other space such as the web, the cloud, a piece of software, or a database. it is very helpful for learners to conceive of their outputs as a distinct dataset with multiple uses, separate from the interfaces which present it, the tools which analyse it, or the software which edits it. the privileging of display over other uses is a particular problem, as turska, cumming and rahtz point out – ‘in digital editions the encoded texts themselves are the most important long-term outcome of the project, while their initial presentation within a particular application should be considered only a single perspective on the data’ (turska, cummings, & rahtz, 2017). focussing on outputs as a dataset rather than simply a source of presentation addresses one of the major general obstacles in teaching digital technologies, which is that learners can see the point and grasp the principles, but find it difficult to grasp ‘where’ it is all happening. 12faghihi et al. journal of open humanities data doi: 10.5334/johd.72 in this context, the tei can be introduced as a framework or standard for text encoding, which (currently) is expressed in xml, and which has outputs which might be used in multiple ways. on a practical level, the large scope of tei is best engaged with through the specific activities it has been used for – firstly through the modules, which give a good general overview of the current coverage of the tei, and secondly through existing projects. sessions led by tei practitioners who are currently engaged in work covering core areas of tei (such as digital editions, correspondence, and manuscript description) are a very good way for learners to get a feel for what is possible, and also to engage with the tei as both an implementable standard and as a community of practice. the guest speaker sessions at the 2019 summer school were particularly well received, with students able to relate their own ideas and ambitions to concrete examples. here, issues such as schemas and the customisation of the tei for the specific needs of projects, and the benefits of working directly with the xml against the efficiencies of editing tools arise naturally in a real-world context which makes sense to the learner. our experience with the summer school tells us that a focus on the concrete outputs of text encoding – sustainable and interoperable datasets for analysis and publication – needs to be present throughout the learning process in order to give context to the sometimes detailed and repetitive activities needed to get to grips with the tei. many of the responses to stella dee’s survey of teachers and learners of the tei can be summed up as ‘now i have my tei document, what can i do with it?’ (dee 2014), with the primary assumed purpose being publication on the web. this naturally leads to the question ‘why go through all these pains, when you can save your text as html or pdf and put them on the internet and google will find them?’. in order to make sense of the tei we need to be constantly referring back to the why as well as the how, with an emphasis on tei files as a dataset as well as a source for publication. an emphasis on outputs also puts some key features and challenges of the tei in context. one of the most common complaints about the tei is that there are multiple ways of doing the same thing. this can be particularly confusing for learners from a library or archives background who are used to highly prescriptive standards such as the aacr2 cataloguing rules and encoded archival description (ead) – though it should be pointed out that that even here there are considerable variations in practice. learners should be aware that the multiple approaches enabled by tei reflect multiple needs, and that the subjective nature of text encoding makes sense in the context of a strong (and iterative) tie between encoding methods and research questions and their outputs. this helps learners to make sense of encoding with the tei as a developing framework for exploration rather than the implementation of a set standard. on a more conceptual level, a fundamental point to communicate is the tei as a community of practice and practitioners. our experience is that this can happen at a micro level within the group, where participants start to communally explore the possibilities offered by the tei in response to research questions, and at a macro level as participants realise that the tei is something they themselves can engage with and contribute to. one example is the long history of the tei trying to reflect their community’s views on sex, gender, and gender identity.11 the concept of the tei as a repository and record of practice in text-based digital humanities work is something which makes particular sense to researchers, who see an opportunity to make a methodological as well as a scholarly contribution to the field. one of the most important conclusions to emerge from our experience in teaching the tei is that the actual process of text encoding is in itself probably the most important output. the realisation that the act of encoding text has value in itself – as a method of deep reading, as a framework for engaging with the text, as a methodology for experimentation and interrogation – is the biggest ‘eureka moment’ that we see in learners. the tei invites you to explore texts in the context of the approaches, methods, questions and idiosyncrasies of other scholars, and to contribute your own. finally, a large factor in the success of workshops and training that take place in a specific research context is being able to do real work in the learning environment, rather than asking participants to complete practice exercises. the idea that the outputs of the learning process will be real and useful contributions to the field or project seems to be the major factor in people engaging quickly and effectively with the tei in the learning environment, and persevering with 11 see issues 367, 426, 2189 and 2190 on the tei github site https://github.com/teic/tei/issues. https://github.com/teic/tei/issues 13faghihi et al. journal of open humanities data doi: 10.5334/johd.72 the tei after the end of the training. as julia flanders and colleagues noted, ‘the ability to see their work realized as a readable edition was a crucial motivator for students in pushing through the process of learning and debugging their tei/xml encoding’ while at the same time being careful to ‘to avoid creating orthodoxies in tei encoding arising from display outcomes that appear authoritative’ (flanders et al., 2020). this approach generates a high level of motivation and cohesion within the group, with everyone pulling together to try to complete a real-world task within a time limit, and it provides a good example of what it’s really like to work on a project. tackling the kinds of problems thrown up by real-life materials demonstrates how text encoding can act as both a framework for discussion and exploration as well as a method for generating outputs. this fundamental sense of working on something real is very difficult to replicate in a general context like the summer school. it would be possible to ask people to come with examples from their own work, but then you would miss the communal and collaborative aspects of the work upon which all can engage at once. one solution would be to reframe the summer school as a series of workshops which gave a general introduction to the tei through a particular set of sample materials, which if carefully selected would allow us to cover the major themes and concepts while replicating some of the sense of deep engagement and excitement which comes from tackling real world problems. we look forward to discussing how this might work in our planning for future training in the tei. competing interests the authors have no competing interests to declare. author informations yasmin faghihi: conceptualization; investigation; methodology; writing – original draft; writing – review & editing matthew holford: conceptualization; investigation; methodology; writing – original draft; writing – review & editing huw jones: conceptualization; investigation; methodology; writing – original draft; writing – review & editing author affiliations yasmin faghihi orcid.org/0000-0001-5556-168x university library, cambridge, uk matthew holford bodleian libraries, oxford, uk huw jones orcid.org/0000-0002-8533-9083 university library, cambridge, uk references allés-torrent, s., & riande, g. d. (2019). the switchover: teaching and learning the text encoding initiative in spanish. journal of the text encoding initiative, 12. doi: https://doi.org/10.4000/jtei.2994 apostolo, s., börner, i., & hechtl, a. (2019). collaborative encoding of text genesis: a pedagogical approach for teaching genetic encoding with the tei. journal of the text encoding initiative, 12. doi: https://doi.org/10.4000/jtei.2926 berenato, t. (2021). david jones’s ‘balaam business’: the poetics of forgiveness after passchendaele. in feldman, m., svendsen, a., & tonning, e. (eds.), historicizing modernists: approaches to ‘archivalism’ (pp. 153–172). london: bloomsbury academic. doi: https://doi.org/10.5040/9781350215078.ch-8 cummings, j. (2019a). a world of difference: myths and misconceptions about the tei. digital scholarship in the humanities, 34(supplement 1), 58–79. doi: https://doi.org/10.1093/llc/fqy071 cummings, j. (2019b). building dh training events. in crompton et al. (eds.) doing more digital humanities. london: routledge. doi: https://doi.org/10.4324/9780429353048-18 dee, s. (2014). learning the tei in a digital environment. journal of the text encoding initiative, 7. doi: https://doi.org/10.4000/jtei.968 https://orcid.org/0000-0001-5556-168x https://orcid.org/0000-0001-5556-168x https://orcid.org/0000-0002-8533-9083 https://orcid.org/0000-0002-8533-9083 https://doi.org/10.4000/jtei.2994 https://doi.org/10.4000/jtei.2926 https://doi.org/10.5040/9781350215078.ch-8 https://doi.org/10.1093/llc/fqy071 https://doi.org/10.4324/9780429353048-18 https://doi.org/10.4000/jtei.968 14faghihi et al. journal of open humanities data doi: 10.5334/johd.72 to cite this article: faghihi, y., holford, m., & jones, h. (2022). teaching the text encoding initiative: context, community and collaboration. journal of open humanities data, 8: 15, pp. 1–14. doi: https://doi. org/10.5334/johd.72 published: 24 may 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. fihrist union catalogue of manuscripts from the islamicate world. retrieved from https://www. fihrist.org.uk/ (last accessed: 24th december 2021) flanders, j., bauman, s., clark, a., doyle, b., hamlin, s., & quinn, w. (2019). tei pedagogy and tapas classroom. journal of the text encoding initiative, 12. doi: https://doi.org/10.4000/jtei.2144 fukushima, k., & bourrier, k. (2019). inside digital dinah craik: feminist pedagogy, cognitive apprenticeship, and the tei. journal of the text encoding initiative, 12. doi: https://doi.org/10.4000/ jtei.2185 hawkins, m. (2020). installing your xml editor. text and pointy brackets. retrieved from https://www. textandpointybrackets.com/foundational-skills-and-knowledge/installing-your-xml-editor (last accessed: 24th december 2021) ives, m., del hierro, v., kelsey, b., smith, l. c., & sumners, c. (2013). encoding the discipline: english graduate student reflections on working with tei. journal of the text encoding initiative, 6. doi: https://doi.org/10.4000/jtei.882 medieval manuscripts in oxford libraries. retrieved from https://medieval.bodleian.ox.ac.uk/ (last accessed: 24th december 2021) mylonas, e., & renear, a. (1999). the text encoding initiative at 10: not just an interchange format anymore – but a new research community. computers and the humanities, 33, 1–9. doi: https://doi. org/10.1023/a:1001832310939 pierazzo, e. (2010). elena pierazzo on the arabic enrich schema. retrieved from http://sabrinamessenger. blogspot.com/2010/08/guest-post-elena-pierazzo-on-arabic.html (last accessed: 24th december 2021) singer, k. (2013). digital close reading: tei for teaching poetic vocabularies. the journal of interactive technology and pedagogy, 3. https://jitp.commons.gc.cuny.edu/digital-close-reading-tei-forteaching-poetic-vocabularies/ stewart, d. (2007). the structure of the fihrist: ibn al-nadim as historian of islamic legal and theological schools. international journal of middle east studies, 39(3), 369. doi: https://doi.org/10.1017/ s0020743807070511 terras, m., branden, r., & vanhoutte, e. (2009). teaching tei: the need for tei by example. literary and linguistic computing, 24, 297–306. doi: https://doi.org/10.1093/llc/fqp018 turska, m., cummings, j., & rahtz, s. (2017). challenging the myth of presentation in digital editions. journal of the text encoding initiative, 7. doi: https://doi.org/10.4000/jtei.1453 https://doi.org/10.5334/johd.72 https://doi.org/10.5334/johd.72 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://www.fihrist.org.uk/ https://www.fihrist.org.uk/ https://doi.org/10.4000/jtei.2144 https://doi.org/10.4000/jtei.2185 https://doi.org/10.4000/jtei.2185 https://www.textandpointybrackets.com/foundational-skills-and-knowledge/installing-your-xml-editor https://www.textandpointybrackets.com/foundational-skills-and-knowledge/installing-your-xml-editor https://doi.org/10.4000/jtei.882 https://medieval.bodleian.ox.ac.uk/ https://doi.org/10.1023/a:1001832310939 https://doi.org/10.1023/a:1001832310939 http://sabrinamessenger.blogspot.com/2010/08/guest-post-elena-pierazzo-on-arabic.html http://sabrinamessenger.blogspot.com/2010/08/guest-post-elena-pierazzo-on-arabic.html https://jitp.commons.gc.cuny.edu/digital-close-reading-tei-for-teaching-poetic-vocabularies/ https://jitp.commons.gc.cuny.edu/digital-close-reading-tei-for-teaching-poetic-vocabularies/ https://doi.org/10.1017/s0020743807070511 https://doi.org/10.1017/s0020743807070511 https://doi.org/10.1093/llc/fqp018 https://doi.org/10.4000/jtei.1453 accessibility, discoverability, and functionality: an audit of and recommendations for digital language archives research paper corresponding author: irene yi linguistics department, yale university, new haven, ct, us irene.yi@yale.edu keywords: language archives; documentation; accessibility; discoverability; functionality; linguistics; endangered languages; metadata to cite this article: yi, i., lake, a., kim, j., haakman, k., jewell, j., babinski, s., & bowern, c. (2022). accessibility, discoverability, and functionality: an audit of and recommendations for digital language archives. journal of open humanities data, 8: 10, pp. 1–19. doi: https://doi. org/10.5334/johd.59 accessibility, discoverability, and functionality: an audit of and recommendations for digital language archives irene yi amelia lake juhyae kim kassandra haakman jeremiah jewell sarah babinski claire bowern *author affiliations can be found in the back matter of this article abstract while digital archiving has long been standard for linguistics, archives themselves are heterogeneous (aznar & seifart 2020), and archived linguistic material is important for researchers and communities, particularly for language reclamation (cf. baldwin & olds 2007; whalen et al. 2016; hinton 2003, 2018; kung et al. 2020). the format and usability of scholarly archival collections is shaped by the functions of the management practices at the stewarding institution, making an appreciation of the range of access services provided by such institutions relevant to the evaluation of individual collections. here we report on a review of 41 digital language archives. three factors are examined: 1) accessibility, including metadata and site navigation; 2) discoverability, or searchability and internal navigation; and 3) functionality, the overall ease of data retrieval and use. we recognize that the decisions made by both stewards and depositors can greatly impact the accessibility of archived materials; to that end, we present recommendations for how archives might increase the utility of their holdings for their users. we emphasize that our intention is not to dissuade linguists from using archives because of these issues, and we recognize the tremendous amount of work that goes into the upkeep of digital infrastructure, often with very limited institutional support. implementing such recommendations at an institutional level can establish a fairer peer-review process of archival collections. by delineating precisely what standards fall under the archive management level and what procedures individual depositors are responsible for, the roles of “archivist” and “depositor” become clearer. mailto:irene.yi@yale.edu https://doi.org/10.5334/johd.59 https://doi.org/10.5334/johd.59 https://orcid.org/0000-0001-9255-4235 https://orcid.org/0000-0001-7764-5876 https://orcid.org/0000-0002-9512-4393 2yi et al. journal of open humanities data doi: 10.5334/johd.59 1 introduction it is estimated that 32% of living languages are currently in some state of loss (simons & lewis 2013:10); some estimates place the figure at closer to 50% (campbell et al. 2013; campbell & belew 2018). documentation of endangered languages is vital for preserving them (berez 2013), whether for study, language reclamation, preventing the irreversible loss of intangible cultural heritage, or any other reason that language is used. digital archiving has been standard for linguistics for at least 15 years, but the extent to which this material can be accessed and used for research, education, and activism varies (cf. evans & sasse 2004; kaplan & lemov 2019; paterson 2021). language archives utilize a number of different content management systems and do not provide uniform functionality (aznar & seifart 2020). some language reclamation projects have found success working exclusively with archival sources (hinton 2003; ana 2006; baldwin & olds 2007; whalen et al. 2016, 2018);1 such work is a partnership-at-a-distance between the institutions that store and curate the materials, the researchers who deposit them, and the users of the materials. archival materials are decontextualized (schwartz & dobrin 2016; gaby & woods 2020:e273; dobrin & schwartz 2021) from their original utterance, and depositors and archives both can do much to ensure that language collections are as robust and as useful as possible. in this paper, we report on the results of a review of language archives, with a concentration on sites and organizations with substantial holdings of digital data in and about endangered languages. we discuss the accessibility, discoverability, and functionality of archival resources, focusing on features of web portals and the special needs of linguistic collections. that is, the focus is around the needs that depositors and language resource users have, and how such needs are or are not met by current practices at the stewardship institutions that manage archives. finally, we provide recommendations and suggest changes to the access services provided by stewardship institutions. it is our hope that these recommendations will serve as foundation for future guidelines in the creation, curation, and maintenance of web portals, the gateways to language resources at “language archives”. 1.1 digital language archives a “language archive” is defined here as a repository of language data (broadly construed), such as audio and/or video recordings, transcriptions, and translations, whether in physical or digital format, created with the purpose of preserving and disseminating those materials (kung et al. 2020; burke et al. 2021; austin 2021; paterson 2021). there is substantial variation among repositories that contain linguistic data (cf. vann 2006)—in scope, functionality, infrastructure, the number of languages or regions covered, and the extent to which they function as research tools or simply data repositories, to name just a few. we follow austin (2021) in considering the role of archives to be appraising materials (that is, collecting selectively based on a stated goal), preserving those materials, “mak[ing] known their existence”, and facilitating their appropriate distribution. for our purposes, we include sites that appear to have these aims. we treat the process of archiving as one in which someone places language resources in one of these repositories, as opposed to interacting with an archive or linguistic data in other ways. for this reason, we exclude from our definition of “archives” sites such as olac, which do not collect materials themselves but rather act as a directory for other archives. throughout this paper, we refer to “items”, “collections”, and “archives”, where items are the linguistic materials that are deposited; they are grouped into “collections”, and those collections are housed by archives. archives are repositories that are owned and managed by people, who are employed by institutions. thus to talk about access to archives we need to think about the web portals, the choices of individuals, their employee obligations at their institution(s), the infrastructure that underlies the repository and its data services, among other topics. while the advancement of technology has allowed linguists to digitize corpora that were once only available in physical media, digitization and online archiving have problems of their own. the long-term accessibility of digital material is dependent on the continuing availability of 1 we recognize that language revitalization and reclamation are complex topics, far beyond the scope of what we can cover in this paper. archived language data is inevitably an incomplete portrayal of languages and their communities. 3yi et al. journal of open humanities data doi: 10.5334/johd.59 compatible hardware and software. necessary equipment may become obsolete and/or fall out of production (e.g., computers are no longer produced with built-in optical drives, making it difficult to access information stored on cds). storage media has a limited lifespan. software, too, can rapidly become obsolete. while linguists have mostly heeded bird and simon’s (2003) call to use open source software wherever possible, documentation projects can become tied to a specific software platform and version (cf. bird & simons 2003). such issues affect both depositors and archives: while depositors should ensure they archive materials in the most endurable formats possible, digital archives are also subject to these constraints, such as the lifespan of servers and backups. while the internet has greatly improved the availability of research materials to far-flung audiences, it is far from an equalizer. access to a reliable internet connection is not universal, particularly in remote communities where broadband has yet to be fully implemented. for example, wasson et al. (2016) note that dinjii zhuh k’yaa (gwich’in language archive and language revitalization center), a community archive of gwich’in people in fort yukon, alaska, is mostly accessed and available only by physically accessing the center where the archive is located because internet access is uneven within the language community. by some estimates, roughly 40% of the global population are not internet users. even in the united states, 21 million people lack access to broadband internet; the fcc believes this figure “radically overstates” the number of people who have reliable connections (sonnemaker 2020). despite the decades-long prevalence of digital archiving in the field, no two archives are alike, some having features that are tailored to the languages of focus. the digital archive of scottish gaelic,2 for example, offers a search feature to filter for lenited words, an option especially useful for researchers working with goidelic languages. but because archives are so decentralized, there is currently no set of protocols or standards for digital language preservation (aznar & seifart 2020).3 1.2 rationale for the current project previous research (among others, bird & simons 2003; vann 2006; berez 2013; sullivant 2020; burke et al. 2021) has discussed aspects of linguistic archiving, including the importance of metadata, consistent approaches to creating language materials, and the current state of language archiving. the current paper covers a wider scope of contemporary language collections, as well as contributing to the discussion of how to improve the archival practice in order to help communities and researchers more easily use and reuse these archives. a review of digital archival practices, such as the decisions made in designing websites and displaying content, will provide insight into how archives are and are not meeting the needs of end users— and the steps they can take to rectify these issues. this paper reports on the results of a review conducted of online digital archives in june-august, 2021. the “audit” was conducted with the aim of investigating the utility of online archives and their accessibility for retrieval of materials. concomitantly, we investigated a sample of individual collections in a subset of archives for ease of completing certain standard investigations, such as testing whether or not materials could be easily aligned using the montreal forced aligner (a process widely used in the creation of corpus materials for phonological research; mcauliffe et al. 2017). this paper reports on findings that relate specifically to archives; a companion paper (babinski et al. 2022) details the phonological/typological findings. the remainder of this paper describes the methods (section 2), results of the archive audit (section 3), and conclusions (section 4), focusing on topics ranging from accessibility and discoverability to actual functionality (i.e., use and reuse of archival materials). at the end of each subsection in section 3, we present suggestions for changes in practice. in engaging with these questions and making suggestions for changes in practice, we do not wish to downplay the efforts and skills of professional archivists, or dissuade researchers from depositing their materials in these archives. we recognize that there are innumerable tradeoffs in all aspects of language 2 https://dasg.ac.uk/en. 3 the challenges around digital archives are not unique, as issues such as the longevity of software and hardware, internet accessibility, and the like, are common across many digital media repositories. however, because of the complexity of language archive collections, their many filetypes, heterogeneity of construction (and resulting metadata), to name just a few, they are probably a good illustration of a very broad array of challenges. https://dasg.ac.uk/en 4yi et al. journal of open humanities data doi: 10.5334/johd.59 documentation and archiving, and that any safeguarding is preferable to none. however, we also consider it appropriate to evaluate the extent to which archival practices—that is, those practices that are primarily controlled by archives and their management—are serving the aims of those using archives. to this end, we are not yet at the stage where we can present a full set of recommendations for archival practice. rather, we raise the issues we have found across archives so that those in the field, including archivists, can consider them in future archive development and management. region-focused archives, such as the alaska native language archive (anla)4 and the survey of california and other indian languages/california language archive (cla),5 draw an audience of language communities who access materials for the purposes of cultural, historical, and language learning. it is believed that the usage of language archives by indigenous communities is underestimated (cf. austin 2011; holton 2012; woodbury 2014), as a single representative may bring resources back to a community that are then more widely disseminated and used by many more individuals. in discussing issues with language archives, we wish to emphasize that roadblocks created by archives will also greatly affect language communities, and, to best suit the needs of their audiences, it may be critical for archives to be accessible and interpretable to users without specialized linguistic training or extensive technical knowledge. holton (2012) and woodbury (2014) discuss the different audiences and users of language archives, drawing particular attention to the fact that non-indigenous linguists are not the only audiences of archives, and that both linguist and non-linguist members of indigenous communities are using archives (e.g., the dobes6 archive, the archive of indigenous languages of latin america [ailla],7 and anla)8 for community-oriented purposes like language revitalization. in discussing the role of archived collections in promotion or hiring, therefore, it is also important to recognize that academics are not the only users of this material. additionally, implementing such recommendations at an archive level (i.e., having inter-archive standards maintained by those who manage archives) can help establish a fairer peer-review process of archival collections. by delineating precisely what standards fall under the archive management level and what procedures individual depositors are responsible for, the roles of “archivist” and “depositor” become clearer. thus, in reviewing depositors’ archival collections, we avoid evaluating the individual for aspects of archiving which are outside their control. having standardization on the side of archives will create more equitable standards by which individuals are reviewed. 2 methods an archive review was conducted between june and august, 2021, by the authors of this paper. our audit focused on archival usability as a whole, as well as two aspects of collections: files suitable for phonetic and phonological analysis, and textual archives/archives not exclusively maintained for linguistic research. the general archive audit included 41 archives (as listed in the supplementary materials).9 the archive list was compiled from olac’s list of participating archives10 as well as digital endangered languages and musics archives network (delaman) 4 https://www.uaf.edu/anla/. 5 https://cla.berkeley.edu/. 6 https://archive.mpi.nl/tla/. 7 http://ailla.utexas.org. 8 https://www.uaf.edu/anla/. 9 the supplement is available from https://osf.io/daksh/. anonymous reviewers of this submission had differing opinions on the extent to which this choice of archives was appropriate. one reviewer suggested that the sample should be expanded, while another felt that it was too broad, including too many archives of different types (and that it was inappropriate to generalize across archives with very different levels of institutional support and access to funding). it was unclear from our survey how many of the archives in the olac and delaman lists are actively maintained, what their support is, and how they backup and preserve their holdings. this is itself an important issue which should be investigated further. for our purposes, rather than restrict the focus to archives that are clearly actively maintained, we preferred to cast a wider net and examine as many digital archives as possible (with caveats further discussed in section 3.1.3 below). 10 http://www.language-archives.org/archives. https://www.uaf.edu/anla/ https://cla.berkeley.edu/ https://archive.mpi.nl/tla/ http://ailla.utexas.org https://www.uaf.edu/anla/ https://osf.io/daksh/ http://www.language-archives.org/archives 5yi et al. journal of open humanities data doi: 10.5334/johd.59 members and associate members.11 for this reason, the archives examined are heavily skewed towards english-language based collections, though (as discussed in section 3.1.2 below) we actively attempted to address this bias (unfortunately without much success).12 we compiled general information on metalanguages, search and retrieval functions, corpus structure, access condition options, and types of materials archived. prior to the audit, we created a questionnaire that probed various aspects of archives and collections that could prove problematic in linguistic research. this questionnaire was used to systematically document information regarding the archives content, accessibility restrictions, search functions, metadata, download options, and file manipulation necessary for analysis (see babinski et al. in prep for a larger summary of findings). members of the team examined archives individually; the results were discussed as a group and cb & iy spot-checked data coded by other authors. we found a very high degree of inter-rater consistency, with the exception of problems arising from web browsers and access to sites which were blocked from yale’s campus internet.13 we focus on the following points in this paper: • accessibility ° which language(s) must a user know in order to navigate sites and collections? ° is the site accessible to users of screen readers? ° are there aspects of the site design that impede or promote accessibility? • restrictions ° how available is material in collections? ° if restrictions are placed on access, what is needed to access collections? ° what types of controls are in place, and for what reason? • finding information ° how easy is it to find information on the site? • file manipulation ° how usable are the collection materials? are there aspects of the site and archive design that promote or impede the use of materials? another set of possible metrics are the fair principles.14 fair data is findable, accessible, interoperable, and reusable. our points overlap with fair in a number of respects, but the fair framework was unsuitable for our evaluation for two reasons. firstly, the findability criterion focuses exclusively on metadata structure, whereas we consider issues of “findability” to be much broader, as discussed further below. secondly, the fair criteria mostly apply to collections, rather than to the overall structure of the archive qua repository. there are a range of reasons why an archive might have a particular property, ranging from constraints introduced by the content management system (cms), to decisions made in light of the amount of funding or staffing, to philosophical decisions about the appropriate structure of an archive. therefore, rather than focus on the particular properties of individual archives, we instead focus on implications of current design for what end users can accomplish. we do list selected examples to illustrate and explain findings, however. while our findings are therefore perhaps not fully reproducible (cf. berez-kroeker et al. 2018), we have endeavored to make the findings replicable by including information in the supplementary materials. this represents a snapshot of archival issues as of august, 2021, which will no doubt evolve as sites are updated. 11 https://www.delaman.org/members/. 12 while there are other archives (such as kielipankki, the language bank of finland; https://www. kielipankki.fi), restricting the sample to olac/delaman archives provided some principle for inclusion in the survey. we acknowledge that it is unclear at this point how representative or comprehensive this list is. organizations differ considerably in the extent to which they focus on preservation or access to files, or serving as research resources or content delivery platforms, making a clear definition of “language archive” difficult. there is, to our knowledge, no global list of language archives. the closest are the delaman and olac compilations. 13 we were unable to diagnose why some sites loaded and others did not, based on ip addresses. we noted issues when they arose, since, if they arose during this sampling process, they will likely arise for other users too. a reviewer asked why we do not exhaustively list, enumerate, and quantify all points made in this paper. we argue that doing so would give rise to misleading precision. as discussed in footnotes 9 and 12, it is impossible to know how representative this sample is. it would therefore be misleading to draw detailed conclusions about small differences in prevalence. instead, we concentrate on reporting common trends in this set of data. this allows us to evaluate recurring issues among commonly used language archives without being unduly focused on small differences. 14 https://www.go-fair.org/fair-principles/. https://www.delaman.org/members/ https://www.kielipankki.fi https://www.kielipankki.fi https://www.go-fair.org/fair-principles/ 6yi et al. journal of open humanities data doi: 10.5334/johd.59 there were some points which we wished to investigate but were unable to include. the extent of institutional support may be a critical component of an archive’s longevity, but such information was typically unavailable. other points relating to archival infrastructure, such as long-term plans, backup procedures, storage procedures, type of content management system, and staffing will also have a major impact on what the archive can deliver. because we are evaluating archives in terms of their usefulness to end users and not in terms of institutional and financial barriers they must overcome, we do not consider these points in our analysis, though we recognize that archives vary greatly in this dimension. 3 results 3.1 accessibility the contents of an archive are only useful as far as they are findable and accessible. accessibility can be impacted by a number of factors, both on the user end and through archive design choices. “web accessibility” is generally understood to refer to compatibility with assistive technology. we discuss accessibility in this narrow sense in section 3.1.3. however, we also discuss registration and account-creation requirements and procedures, display language, and site navigation. these are also points which may facilitate or impede a user’s access to the archive contents. 3.1.1 accounts and registration the majority of language archives we surveyed have materials that are available for download for free and with minimal registration requirements. many archives appropriately have access restrictions for collections to respect the wishes of language communities and researchers (nathan 2010). five archives, including the repository and workspace for austroasiatic intangible heritage (rwaai),15 required registration to access any materials at all, including the catalog; other archives had a public-facing catalog, even if registration was required for download. four archives, including elar and the dobes archive, had multiple tiers of access, where some tiers required registration and/or permission of the depositor for listening or download, while other tiers were unrestricted. one archive, the cla, has materials that are closed-access, in that they are not available online and must be accessed in person. two archives, kaipuleohone16 and lia sápmi (sami speech corpus),17 restrict all or most of their contents specifically to academic institutions and institution-affiliated researchers, a limitation that may exclude members of language communities. others restrict only parts of their materials to those affiliated with academic institutions. the childes data repository18 includes password-protected collections available only to faculty members, and the clarin slovenian repository19 requires that users access certain materials restricted for “academic use” through their institutional emails. while there are good reasons why collections may be not freely available, some of the convoluted, unclear, or heavily outdated procedures for requesting permission could be fixed. archives that do not streamline the permission forms or provide unclear contact information could be updated. for example, account registrations requiring manual approvals, or emailing specific individuals, should be automated. while the majority of archives are entirely free for use, and we did not encounter any archives20 requiring payment of fees to access collections during our audit, we acknowledge that for some researchers, particularly those who do not have institutional membership with archives or sufficient funding, the cost of accessing an archive may be prohibitive. the linguistic data 15 https://projekt.ht.lu.se/rwaai. 16 http://ling.hawaii.edu/kaipuleohone-language-archive/. 17 https://tekstlab.uio.no/glossa2/saami. 18 https://childes.talkbank.org/access/. 19 http://www.clarin.si/info/about. 20 we did not include predominantly physical archives that also have digital materials. this excluded archives such as aiatsis (https://mura.aiatsis.gov.au), which requires the purchase of physical media for accessing digital collection items. https://projekt.ht.lu.se/rwaai http://ling.hawaii.edu/kaipuleohone-language-archive/ https://tekstlab.uio.no/glossa2/saami https://childes.talkbank.org/access/ http://www.clarin.si/info/about https://mura.aiatsis.gov.au 7yi et al. journal of open humanities data doi: 10.5334/johd.59 consortium (ldc)21 and the european language resources association (elra)22 are two examples of archives requiring payment in return for access to materials; these fees can range into tens of thousands of dollars (vann 2006). endangered language archives have tended towards a model where the archive is supported through institutional or grant funds, with costs supplemented by fees from depositors (similar to “gold” open access models for academic publication). some archives have recently requested that depositors include archiving charges in research grant applications. clearly the funding model for ongoing support for endangered language archives needs to be investigated in more detail. 3.1.2 display language the language(s) used to display metadata and to navigate the site may also limit accessibility of the materials. bias towards english-language users and lack of built-in site translations disadvantages researchers whose primary language is not english and may prevent community members from accessing documentation of their own languages or other languages which they regularly use.23 while the arrival of digital media devices and technologies can facilitate the creation of a “social network of digital exchange” of cultural heritage for indigenous communities (mansfield 2014:66), unavailability of these resources in endangered languages further entrenches generational and educational divides in language communities where acquisition of literacy, particularly in english, is not widespread. a number of linguists and indigenous community members have expressed concern that “the majority of digital resources available to indigenous users are in english, even though english is not a first language for many” (carew et al. 2015:310). only 14 of the archives we examined provide more than one language interface, and not all of these had fully functional language options. we point to paradisec24 (see figure 1) as an example of an archive providing information through 7 regional languages (though unfortunately not on the mobile site). archives that focused on languages of a particular region often provided interfaces relevant to their users. for example, ailla has interface options in spanish, and pangloss is fully implemented in both english and french.25 elar and cla have interfaces only in english, 21 https://www.ldc.upenn.edu. 22 http://www.elra.info/en. 23 the finding that there is a lack of archives with a primary interface in other languages may be, in part, due to our own biases as all english-dominant researchers in the usa. however, we made a substantial effort to search out archives written in other languages (e.g., spanish, russian, french), but they were largely difficult to search for because of internet search engine rankings, which return results based on language and geographic region. this should be noted as an issue for linguistics that leads to a substantial reduction in findability of materials, though one beyond the control of individuals. 24 https://www.paradisec.org.au. 25 however, the translations caused issues with file matching. where .mp3 files were labeled in french but the transcripts were auto-generated and downloaded by the site, and given english filenames. figure 1 paradisec’s informational language options (top right corner of banner). https://www.ldc.upenn.edu http://www.elra.info/en https://www.paradisec.org.au 8yi et al. journal of open humanities data doi: 10.5334/johd.59 though at the collection level, elar allows filenames and metadata to be in other languages and scripts, which helps users if they know of the collection. we applied google translate to the exclusively english archives (testing languages such as korean, uzbek, kyrgyz, and french). translations were inconsistent, incomplete, and sometimes misleading. some localizations translated only parts of the site text, leaving others, such as an embedded map and filenames, in english (see figures 2, 3, and 4). therefore using google translate as a workaround for untranslated sites is not a straightforward alternative. figure 4 paradisec items where the term “elicitation” is translated into kyrgyz once, but not other times. further, green “open” buttons are not translatable as the text is part of the icon. figure 2 elar interface in uzbek with google translate overlay. names on the side are not consistently translated or transliterated: “hilda lopez” is not transliterated, but “james woodward” becomes “jeyms vudvord”. selective transliteration breaks links elsewhere in the collection. figure 3 elar interface in kyrgyz (google translate overlay). the “view 16 more” button on the menu (in orange text) no longer works with google translate as an overlay. 9yi et al. journal of open humanities data doi: 10.5334/johd.59 it is also worth noting that when site translations are available, options are predominantly languages of european origin. this is especially striking given the scope of archived languages, most of which are indigenous to africa, asia, and the americas. lack of translations into major regional languages limits the abilities of scholars to use these archives, creating a bias in the demographics of researchers and restricting potential scholarly innovation. for non-indoeuropean languages whose structures differ greatly from that of languages like english, french, and spanish, automatic translation programs like google translate and yandex are especially prone to offering confusing and poor-quality translations. we recognize that this is a much bigger problem than what individual archives can solve. for example, search engines filter out search queries in languages other than the query language (which made it almost impossible for us to search for archives outside the anglosphere internet).26 however, at the collection level, depositors should be encouraged to provide materials in languages that will be most usable for community members, and the substantial additional time costs for doing so should be recognized explicitly. 3.1.3 disability accommodations we acknowledge that disability accommodation remains a critical, and often-overlooked element of archive accessibility, and indeed, the accessibility of any digital material. in regards to the structure of websites and storage of data such as text files, it is essential that web content – including archives – is presented in a way that is accessible for visually-impaired researchers.27 it is generally agreed as a principle of accessible web design not to make different elements of a site distinguishable only by their color (campbell 2018). in order to assess color blindness accessibility, we put each archive through a filter28 mimicking how each site would look to users with 3 of the most common kinds of color blindness. sites were subjectively reviewed by a member of the team who is colorblind. the archives we surveyed, largely, performed well in this regard. the main issue raised by our survey was the low contrast between font and background colors, which may compromise readability for users with certain kinds of color blindness and other visual impairments; it may also inconvenience users with certain color and brightness settings on their computers and browsers. the websites for sil’s international language & culture archives and the rosetta project revealed such problems. 3.1.4 recommendations the restrictions archives place on access to language data are there for a reason; however, it is important that these restrictions do not place too much of a burden on researchers and language communities looking to access their contents. therefore, we suggest archives streamline the process of requesting access permission. more specifically, we recommend that request forms be built into the site itself, with additional capacities for automated password retrieval. this is especially critical for those archives (such as elar) whose code is not built for long-term accessibility, as passwords cannot be reset and the application permission form is built on google surveys. for archives not already implemented in multiple languages, we strongly suggest expanding the scope of display languages offered, especially those languages which may be relevant to language communities and local researchers. furthermore, we recommend against applications that must be physically posted to the archive, given their inefficiency and potential to disadvantage researchers in areas underserved by the postal system.29 following the principles of accessible web design will make great strides in overcoming barriers for researchers who require assistive technology. even in the absence of laws like the american disabilities act (1990) (or varying legal 26 https://developers.google.com/search/blog/2010/03/working-with-multilingual-websites provides some information about how google determines relevance for multilingual sites; this includes the domain name suffix and ip address of the server, as well as language identification for monolingual web pages. it does not include html language attribute tags of georeferencing in html. 27 for further information see https://www.w3.org/wai/standards-guidelines/wcag/ (wcag 2018). 28 https://www.toptal.com/designers/colorfilter/. 29 one assumes that for a digital archive, users who will access the materials are also able to access a digital registration form. https://developers.google.com/search/blog/2010/03/working-with-multilingual-websites https://www.w3.org/wai/standards-guidelines/wcag/ https://www.toptal.com/designers/colorfilter/ 10yi et al. journal of open humanities data doi: 10.5334/johd.59 requirements across different countries), it is important and not too difficult to improve what is already there. 3.2 discoverability collections need to be discoverable; that is, users should be able to navigate the site to find what they need. discoverability encompasses both the abilities of users to find archives through search engines or aggregation portals (such as olac) and to perform searches within those archives. the former point is essential for the use and reuse of an archive in general, while the latter point sheds important light on the internal organization/description of material. 3.2.1 search functions and mislabeling search functions are vital for navigating large collections, but they can be made frustratingly slow and even useless depending on their available options. six archives offer a map search function, which allows users to browse collections by location. this function is especially useful for more casual users or for those who are not searching for specific data, but it presents its own challenges. archives like elar and the california language archive use google earth, and many others use similar platforms. the california language archive does not outwardly indicate whether each collection on the map is available online. this makes it initially seem as if there are more resources readily available to users than there actually are. while these issues are a cause for frustration, they are not necessarily debilitating, and the map function tends to be a useful visual aid for users. we point to pangloss as an example of the map function at its most useful; its map function is easy to navigate, contains information on the title, researchers, and types of resources available for each collection (as well as a link to each collection), and can be filtered by the criteria “with annotation” and “with video”. in contrast, ailla’s map function is non-interactive. that said, some location information within pangloss was inaccurate. lack of transparency about the contents of collections was observed in numerous archives. users may have little information about what a deposit contains before accessing its contents. researchers often have specific criteria in their search for language materials—for example, linguists looking to perform certain kinds of phonological analysis may have a preference for collections whose video and audio recordings are all fully transcribed, and rule out collections with too few hours of recorded material or those that consist only of written materials. other criteria may include the specific dialect(s) documented, date of creation, file type, number, age and location of speakers, or specific individuals. some of these categories can be aggregated automatically, while others require manual labeling. while the lack of some of this information is due to incomplete metadata provided by depositors, we encourage archives to make such information easy to find. elar, for example, includes a collection landing page consisting of sections for “summary of the deposit”, “groups represented”, “language information”, “special characteristics”, and “deposit content”, though the quality and specificity of the information in these descriptions varied greatly between collections. this could be a point of evaluation for individual collections. in most collections, the metadata about the holdings is a file within the general collection. it is not consistently named and where collections have many files it is difficult to find. archives could assist the retrieval of such information by flagging such metadata files directly or including an explicit link to the metadata file(s) within the collection overview. most archive portals include search bars, but these have varying degrees of usability. one important feature is a filter function, especially for larger archives. all but seven of the archives we investigated have some kind of search filter function. some filtering options include language, speaker, depositor, file type, topic, and country, among others. however, the availability and usability of the filter function was inconsistent. elar’s search filter options vary by collection, while the african language materials archive,30 digital himalaya,31 and ailla all lack a search 30 http://alma.matrix.msu.edu. 31 http://www.digitalhimalaya.com. http://alma.matrix.msu.edu http://www.digitalhimalaya.com 11yi et al. journal of open humanities data doi: 10.5334/johd.59 filter function entirely, making large collections more difficult to search. paradisec had a flexible search and filtering interface, at the item or collection level. a useful search feature available in some archives is the ability to search within collections. this feature is especially useful, almost necessary for archives that contain large collections. however, despite its importance, we only found the feature in four of the archives we examined. such free-text search increases finding options for collections extensively, allowing more refined searches than filters alone. for example, a filter may exist to restrict files to .xml, but a test search makes it easier to distinguish between flextext transcripts, elan transcripts, and .xml-format metadata. these are all .xml format files but have very different functions. at the collection level, searches were hampered by missing metadata, incorrect tags, case sensitive searches and inconsistent metadata (e.g., searches returning either audio or audio as filetype), empty folders, and broken urls within collections. correcting these small but time-consuming errors would improve intra-archive searches. two of the most useful search filter categories are media type and file type (see figures 5 and 6). many researchers using these digital archives can only use files of a specific media type (e.g., videos or sound recordings), or, in cases where they plan to use certain software in their research, certain file types (e.g., .pdf or .wav files). file type and media type filters greatly reduce the time a researcher must spend browsing files to find what they need. despite this importance, only five of the archives we looked at offer the option to filter by file type, and one of these archives, the language commons,32 returns files that aren’t bundled (with related files of different file types) when the file type filter is employed, causing users to miss potentially useful materials. similarly, only four of the archives we looked at offer the option to filter by media type. even fewer allowed users to filter by specific file extensions (such as .mp3 or .wav), and, when offered, the archive often displayed results with mislabeled extensions (.xml for .eaf, for example). mislabeled file types are another issue we encountered. elar and ailla, for example, rename elan33 .eaf files and flex34 .flextext files as .xml (see figure 7). while these are underlyingly xml files and alternate extensions are visible upon downloading the files, one needs to know how 32 https://archive.org/details/languagecommons?tab=about. 33 https://archive.mpi.nl/tla/elan. 34 https://software.sil.org/fieldworks. figure 6 anla materials searchable/filterable by media type. figure 5 cla materials with media type specified next to item name. https://archive.org/details/languagecommons?tab=about https://archive.mpi.nl/tla/elan https://software.sil.org/fieldworks 12yi et al. journal of open humanities data doi: 10.5334/johd.59 to change the file extensions in order to open the files with the appropriate applications. it is also difficult to differentiate elan audio transcripts from flex dictionary or interlinearized texts, which are both listed as .xml files but have different underlying data structures. 3.2.2 metadata we noted considerable inconsistency in what type of metadata was available, across archives and collections. it is easy for relevant files to be lost in a search because they do not have the type of metadata used in the search. another issue we discovered was the use of different layers of metadata. in many cases, important metadata was hidden inside the folders of a collection, making it difficult for a user to find the specific information they need. ailla, for example, has three layers of metadata: one layer is for the whole collection, another layer is found within each individual folder within each collection, and the final layer is attached to the individual files themselves. such layering, combined with the frequent gaps in available metadata, makes it extremely difficult to find desired information and reduces the accuracy of the search function. sullivant (2020) provides detailed recommendations for collection metadata, breaking down these recommendations into categories based on importance. we point to the california language archive and paradisec as two archives that do a good job of including “first priority collection metadata”. finally, it is important to note that, while many of the archives we examined do include the most important information in their collection metadata, almost none include the information in sullivant’s next two tiers of recommended metadata. while archives are reliant on the metadata provided by depositors, this only reinforces the points made by sullivant (2020) and burke and zavalina (2020) that metadata is crucial to the usability of a collection. the dacs35 standards may also be useful for both depositors and archives to introduce and maintain consistency. 3.2.3 site maintenance other issues impeded discoverability, with archives being incompatible with specific browsers, requiring defunct software, or failing to load entirely. this occasionally varied depending on the individual user in ways we were unable to solve. for example, three of the team members found that the aps digital library36 would not open for them unless they accessed it via yale university’s vpn, while the remaining team members could access the site with no difficulties from off campus, all using recent versions of chrome on macos 11.6 or windows 10. six of the 41 archives gave web access errors or were unreachable.37 some, such as alora,38 could only be accessed with the wayback machine.39 while these workarounds do allow users to access materials, users who are unfamiliar with the wayback machine would be deterred from retrieving relevant information. moreover, the wayback machine may provide access to the catalog, but not the files in the collection itself. links provided within archives often faced 35 https://github.com/saa-ts-dacs/dacs. 36 https://diglib.amphilsoc.org. 37 academia sinica english corpora (http://www.ling.sinica.edu.tw/en/announcements/resources); alora (https://alora.cerdotola.com); multimodal learning corpus exchange (http://mulce.org); standing rock sioux tribe language and culture institute (http://wooyake.org); american philosophical society digital library (https://diglib.amphilsoc.org); world oral literature project (http://www.oralliterature.org). 38 https://alora.cerdotola.com. 39 https://web.archive.org/web/20190208220853/https://alora.cerdotola.com. figure 7 on the left is a screenshot of a sample ailla document, in .eaf format (displayed as such). however, when the file is downloaded (at right), the file receives a .eaf.xml extension, which must be manually removed before it is readable by elan. https://github.com/saa-ts-dacs/dacs https://diglib.amphilsoc.org http://www.ling.sinica.edu.tw/en/announcements/resources https://alora.cerdotola.com http://mulce.org http://wooyake.org https://diglib.amphilsoc.org http://www.oralliterature.org https://alora.cerdotola.com https://web.archive.org/web/20190208220853/https://alora.cerdotola.com 13yi et al. journal of open humanities data doi: 10.5334/johd.59 the same issues, defeating the purpose of being an archive that safeguards data.40 furthermore, at least two archives41 still required some use of adobe flash player (see figure 8), which was phased out by december 2020. many archives contained broken links, though this differed in extent and severity. the problems related to both site-internal and -external links, and problems arose due to internal site reconfigurations (such as those of the british library’s endangered archives collections)42 as well as the removal of individual pages. it would be ideal for archives to not rely on external links, but when necessary, regularly checking for outdated links is crucial. broken links not only hinder the usability of archival materials from an end-user perspective, they also hinder the discoverability of such webpages. search engines penalize broken links43 in search results, thus making archive sites with such links less findable. as language resources are entrusted in archive sites’ stewardship, it is important that they remain discoverable by those who wish to access these language materials. corrado and sandy (2017) draw attention to the lifecycle of a project, as defined by the life cycle information for e-literature.44 they argue that “institutional commitment…ensuring that enough financial resources are available to sustain the initiative” is necessary for digital preservation to be successful (corrado & sandy 2017:11). in order for stewardship organizations to faithfully fulfill their fiduciary duties as language resource stewards, website maintenance must receive ongoing support to keep up with rapidly-changing software and security compatibility requirements. 3.2.4 recommendations offering more detailed descriptions of a collection’s contents, specifically, media types (video, audio, text, etc.), completion state of any transcriptions or translations, and number of hours of recorded material, would help researchers evaluate the utility of a collection for a particular purpose, and give community members a sense of what is in the collection. allowing searches by file type would allow researchers to further refine their queries and determine the usability of a given collection for their research purposes; we also recommend that archives correctly label file types and remove filetype capitalization dependencies on searching.45 we also suggest that archives make it clear to depositors what types of information are indexed for searching, and how researchers can structure their collections to make them usable. to make archives more easily discoverable, we recommend archive managers use the sitemaps46 protocol set to provide site-internal content information to search engines. finally, we suggest that depositors consider how they use links to external sites in their deposits, archiving copies where appropriate (or pointing links to the internet archive). we suggest that archives regularly 40 collections within the endangered archives programme, british museum (https://eap.bl.uk/project/ eap347); online database of interlinear text (https://odin.linguistlist.org); elar (https://www.elararchive. org/dk0611). 41 the repository and workspace for austroasiatic intangible heritage (https://projekt.ht.lu.se/rwaai); yami corpus (http://yamiproject.cs.pu.edu.tw/yami/en_index_flash.htm). 42 https://eap.bl.uk. for approximately 8 months, every collection-level link from the main site catalog was broken. however, as of december 15, 2021, this has been fixed. 43 see https://devrix.com/tutorial/crucial-google-penalties/ for more about search engine penalties. 44 http://www.life.ac.uk/glossary. 45 to be clear, the issue we are discussing here is where a search returns both audio and audio filetypes (for example) and treats them as distinct filetypes. this is a claim about variable capitalization in standardized vocabularies, not a point about case sensitivity in searches more generally. 46 https://www.sitemaps.org. we thank the anonymous reviewer who pointed us to sitemaps. figure 8 adobe flash player required to access materials on the repository and workspace for austroasiatic intangible heritage. https://eap.bl.uk/project/eap347 https://eap.bl.uk/project/eap347 https://odin.linguistlist.org https://www.elararchive.org/dk0611 https://www.elararchive.org/dk0611 http://yamiproject.cs.pu.edu.tw/yami/en_index_flash.htm https://eap.bl.uk https://devrix.com/tutorial/crucial-google-penalties/ http://www.life.ac.uk/glossary https://www.sitemaps.org 14yi et al. journal of open humanities data doi: 10.5334/johd.59 check for link breaks (e.g., by using automated checking tools that generate reports, such as the broken link checker plugin),47 particularly to archive-internal pages. 3.3 functionality the primary function of an archive is to store and safeguard materials, so it is essential for both the process of depositing and retrieving data to be straightforward; after all, material is safeguarded for a purpose, not simply to have an unused record of languages. this section discusses the functionality of data retrieval and use. section 3.3.1 focuses on the structure and content of various archives, as well as issues surrounding downloads. section 3.3.2 lists our recommendations for archive functionality. 3.3.1 site content, structure, and downloads the available content and structures of archive sites posed the first issue with functionality. we note that some of the following concerns are affected by the choice of cms of individual archives. we attempted to track cms use across archives, such as whether the archive used a common cms such as mukurtu,48 dspace,49 or a bespoke platform. however, information about the cmss underlying the archives in our audit was not easily accessible; an overwhelming majority of archives had no publicly available information at all about their cms. half of the archives mentioned the institutions that supported the development of the archive, or external servers where related language corpora were hosted, but the information about infrastructure was not available for enough archives for us to track it. we acknowledge, however, that site structure and content capabilities are closely linked to choice of cms. the sites examined here vary extensively in their holdings and scope. some sites labeled as archives only hosted one or two resources (magoria books carib & romani archive),50 which sometimes required purchase, while others hosted none at all (multimodal learning corpus exchange).51 others, such as the sil international language & culture archives,52 appeared to function more as directories with both links to external resources and hosted materials. they were not “archives” in the sense of storing and safeguarding materials. this is in contrast to archives such as the elar archive, which has full hosting and offers (per r3data.org) more than 462,048 results.53 the most prevalent issue impeding archives’ functionality was the lack of a bulk download option. the vast majority (34/41) had no bulk download option for either text or audio/video. two54 had bulk download options for text files only, and five55 provided download links for zip files containing all or a selection of the files in the corpus. requiring users to download files individually not only results in loss of time, but also renders some collections (e.g., those with 15,000 audio files) virtually inaccessible because of the sheer number of clicks, ranging from 1 to 7 per file, required to download their contents. further, when individual downloads are the only option, users would benefit from knowing exactly how many files are in each collection, allowing them to assess their own storage capacity before attempting to download a corpus. another concern that results from downloading files individually is the loss of arrangement of items within a collection. for example, nested files lose their relationships to each other and must be manually re-sorted when downloaded onto a drive. this is assuming that the archive 47 https://www.outlookstudios.com/tools-to-find-broken-links-on-your-website/#broken-linkchecker. 48 https://mukurtu.org. 49 https://duraspace.org/dspace. 50 http://archive.magoriabooks.com. 51 http://lrl-diffusion.univ-bpclermont.fr/mulce2/accescorpus/accescorpusmulce.php. 52 https://www.sil.org/resources/language-culture-archives. 53 https://www.re3data.org/repository/r3d100013583. 54 lia sápmi sami speech corpus (http://tekstlab.uio.no/lia/samisk/index.html); childes data repository (https://childes.talkbank.org/access). 55 dobes the language archive (https://archive.mpi.nl/tla); the language commons (https://archive.org/ details/languagecommons?tab=about); slovenian language resource repository (http://www.clarin.si/ info/about/); eurac research clarin centre (https://clarin.eurac.edu/index.html); open resources and tools for language (ortolang) (https://www.ortolang.fr). http://r3data.org https://www.outlookstudios.com/tools-to-find-broken-links-on-your-website/#broken-link-checker https://www.outlookstudios.com/tools-to-find-broken-links-on-your-website/#broken-link-checker https://mukurtu.org https://duraspace.org/dspace http://archive.magoriabooks.com http://lrl-diffusion.univ-bpclermont.fr/mulce2/accescorpus/accescorpusmulce.php https://www.sil.org/resources/language-culture-archives https://www.re3data.org/repository/r3d100013583 http://tekstlab.uio.no/lia/samisk/index.html https://childes.talkbank.org/access https://archive.mpi.nl/tla https://archive.org/details/languagecommons?tab=about https://archive.org/details/languagecommons?tab=about http://www.clarin.si/info/about/ http://www.clarin.si/info/about/ https://clarin.eurac.edu/index.html https://www.ortolang.fr 15yi et al. journal of open humanities data doi: 10.5334/johd.59 site has not already collapsed structures that existed when researchers originally deposited their files. when this happens, crucial information can be lost for collections that depend on file structure to match transcripts and metadata files to audio and video files (for further discussion of arrangement, see patterson 2021: §6.3.1). we do recognize that there are non-trivial issues concerning bandwidth, web server traffic, and validation of large files that limit download capabilities and may require additional funding to resolve. still, since these issues directly affect archives’ functionality, they should be addressed sooner rather than later. even if downloads must be done individually, solutions such as putting all of a collection’s download links on a single page (as opposed to requiring users to enter into individual folders to download) exist. we draw attention to the dobes archive for providing an effortless method of downloading files in bulk. their “basket” system allows users to select and bundle individual files or entire collections, then after an amount of time proportional to the number of files they have requested, a link to a zip file is emailed directly to them. other issues surrounding downloads included non-functioning download buttons or downloads that resulted in unreadable data. ailla’s download links are blocked by chrome and firefox browsers due to security settings, and could only be accessed by changing web browsers. the hindu-kush areal typology,56 while not strictly an archive, had a bulk download option for wordlists. however, users had to ensure that they were properly opening the utf-8 encoded csv file in order to read the data without broken text. while workarounds like these exist, they may deter users with less familiarity with technology from using such archives effectively. 3.3.2 recommendations firstly, and perhaps most critically, we suggest adding the option to download files in bulk, including an option for the entire corpus and for each folder in it, while preserving the original arrangement configuration. we recognize that this may be a complex request, given how file storage may work for the archive, but it is a necessary part of making files accessible. a 15,000item collection with no bulk download option is neither accessible nor realistically usable. furthermore, we suggest archives either allow depositors to preserve the original file structure of their collections upon deposit, or develop tools to help them better structure collections once archived in-site, for example through tags. it is vital that archives provide layout guides and naming conventions for depositors, so that users may quickly locate corresponding files and recreate file structures in the event that they are lost, and care should be taken when depositing collections to make sure that vital information about metadata and collection structure is not lost. 4 conclusions digital archives, even when poorly maintained, may offer protection to language data that may otherwise have been lost, forgotten, or destroyed. we recognize that decisions made by both archives and depositors can greatly impact the accessibility of archived materials. we further recognize that there are tradeoffs in the creation of archives and some decisions that were made long ago continue to affect our methods, procedures, and choices. the power that both archivists and depositors have over these materials conveys a responsibility to ensure that materials will be able to be used and reused into the future. to that end, these findings and recommendations can help set procedural standards that greatly help those who access archives. we recognize that additional resources are necessary for this to succeed. one incentive for depositors to increase the usability of their collection is for that work to be included in evaluations for promotion. by setting out how archives vary, and how that variation can affect the utility of collections and the user experience, we provide clarification to the scope of possible review. individuals should not be evaluated for aspects of archiving which are outside their control; and if archives are to feature in hiring and/or promotion reviews, they may need to provide more explicit information about the scope and limitations of their services. 56 https://hindukush.clld.org/. https://hindukush.clld.org/ 16yi et al. journal of open humanities data doi: 10.5334/johd.59 appendix information about the archive review: • archive name: the name of the archive • site link: the url of the web portal for the archive • metalanguage(s): the primary language which is used to deliver the records and to navigate the site • broken links: a qualitative assessment of the number of broken links encountered • types of materials available: a broad description of the filetypes available for download from the web portal • access restrictions: the types of access restrictions found across the site (or as described in the archive meta-information). • search function: information about how searches can be conducted on the site and the types of materials returned • filter by: discussion of how search results may be filtered. • bulk download: whether collection items must be downloaded individually (e.g. using the “save as” command through a web browser) or whether there are options for downloading multiple files at once. • number of clicks to download: how many steps does it take between a collection item’s information and being able to download it. • metadata location: where metadata for a collection is accessed additional file the additional file for this article can be found as follows: • supplementary files 1. archive audit spreadsheet. summarizes findings and comments. doi: https://doi.org/10.5334/johd.59.s1 acknowledgements we are grateful to the fieldwork reading group at yale university for their valuable feedback and discussion throughout this project that have led to considerable improvements. we would also like to thank the audiences of the 7th international conference on language documentation & conservation (icldc2021), paradisec@100, and the uc berkeley language revitalization working group for their insightful and useful feedback. competing interests the authors have no competing interests to declare. author contributions iy: data curation, formal analysis, investigation, validation, writing—original draft, writing— review & editing al: data curation, formal analysis, investigation, validation, writing—original draft, writing— review & editing jk: conceptualization, data curation, formal analysis, investigation, methodology, validation, writing—review & editing kh: data curation, formal analysis, investigation, validation, writing—original draft, writing— review & editing jj: data curation, formal analysis, investigation, validation, writing—review & editing sb: conceptualization, data curation, formal analysis, investigation, methodology, validation, writing—review & editing cb: conceptualization, data curation, formal analysis, investigation, methodology, validation, writing—review & editing https://doi.org/10.5334/johd.59.s1 17yi et al. journal of open humanities data doi: 10.5334/johd.59 author affiliations irene yi orcid.org/0000-0001-9255-4235 linguistics department, yale university, new haven, ct, us amelia lake linguistics department, yale university, new haven, ct, us juhyae kim linguistics department, cornell university, ithaca, ny, us kassandra haakman linguistics department, yale university, new haven, ct, us jeremiah jewell linguistics department, yale university, new haven, ct, us sarah babinski orcid.org/0000-0001-7764-5876 linguistics department, yale university, new haven, ct, us claire bowern orcid.org/0000-0002-9512-4393 linguistics department, yale university, new haven, ct, us references administration for native americans (ana). (2006). native language preservation: a reference guide for establishing archives and repositories. http://www.aihec.org/our-stories/docs/ nativelanguagepreservationreferenceguide.pdf americans with disabilities act of 1990, pub. l. no. 101–336, 104 stat. 328 (1990). austin, p. (2011). “who uses digital language archives?” paradisec blog. https://www.paradisec.org.au/ blog/2011/04/who-uses-digital-language-archives/ (last accessed 27 september 2021). austin, p. (2021). “corpora and archiving in language documentation, description, and revitalization.” presented at fieldling seminar 2021. paris. http://www.peterkaustin.com/docs/teaching/2021-09-09_ fieldling.pdf aznar, j., & seifart, f. (2020). refco: an initiative to develop a set of quality criteria for fieldwork corpora. 2èmes journées scientifiques du groupement de recherche linguistique informatique formelle et de terrain (lift), 95–101. https://hal.archives-ouvertes.fr/hal-03066031/file/lift.pdf#page=100 (last accessed 27 january 2022). babinski, s., jewell, j., kim, j., haakman, k., lake, a., yi, i., & bowern, c. (2022). “how usable are digital collections for endangered languages? a review.” proceedings of the linguistic society of america (plsa) 7(1). 5219. baldwin, d., & olds, j. (2007). miami indian language and cultural research at miami university. in d. cobb & l. fowler (eds.), beyond red power: american indian politics and activism since 1900, 280–90. santa fe: sar press. berez, a. l. (2013). the digital archiving of endangered language oral traditions: kaipuleohone at the university of hawai‘i and c’ek’aedi hwnax in alaska. oral tradition, 28(2), 261–270. doi: https://doi. org/10.1353/ort.2013.0010 berez-kroeker, a., gawne, l., kung, s., kelly, b., heston, t., holton, g., pulsifer, p., beaver, d., chelliah, s., dubinsky, s., meier, r., thieberger, n., rice, k., & woodbury, a. (2018). reproducible research in linguistics: a position statement on data citation and attribution in our field. linguistics, 56(1), 1–18. doi: https://doi.org/10.1515/ling-2017-0032 bird, s., & simons, g. (2003). seven dimensions of portability for language documentation and description. language, 79(3), 557–582. doi: https://doi.org/10.1353/lan.2003.0149 burke, m., & zavalina, o. l. (2020). descriptive richness of free‐text metadata: a comparative analysis of three language archives. proceedings of the association for information science and technology, 57(1), e429. doi: https://doi.org/10.1002/pra2.429 burke, m., zavalina, o. l., phillips, m. e., & chelliah, s. (2021). organization of knowledge and information in digital archives of language materials. journal of library metadata, 20(4), 185–217. doi: https://doi.org/10.1080/19386389.2020.1908651 campbell, l., & belew, a. (2018). introduction: why catalogue endangered languages? in l. campbell & a. belew (eds.), cataloguing the world’s endangered languages, 1–14. london: routledge. doi: https://doi.org/10.4324/9781315686028 campbell, l., lee, n. h., okura, e., simpson, s., & ueki, k. (2013). new knowledge: findings from the catalogue of endangered languages (“elcat”). 3rd international conference on language documentation & conservation. https://scholarspace.manoa.hawaii.edu/ bitstream/10125/26145/2/26145.pdf campbell, m. h. (2018). accessibility of archives’ digital resources for users with hearing and visual impairments. master’s thesis, university of north carolina at chapel hill. https://doi.org/10.17615/ c11t-gs09 https://orcid.org/0000-0001-9255-4235 https://orcid.org/0000-0001-7764-5876 https://orcid.org/0000-0002-9512-4393 http://www.aihec.org/our-stories/docs/nativelanguagepreservationreferenceguide.pdf http://www.aihec.org/our-stories/docs/nativelanguagepreservationreferenceguide.pdf https://www.paradisec.org.au/blog/2011/04/who-uses-digital-language-archives/ https://www.paradisec.org.au/blog/2011/04/who-uses-digital-language-archives/ http://www.peterkaustin.com/docs/teaching/2021-09-09_fieldling.pdf http://www.peterkaustin.com/docs/teaching/2021-09-09_fieldling.pdf https://hal.archives-ouvertes.fr/hal-03066031/file/lift.pdf#page=100 https://doi.org/10.1353/ort.2013.0010 https://doi.org/10.1353/ort.2013.0010 https://doi.org/10.1515/ling-2017-0032 https://doi.org/10.1353/lan.2003.0149 https://doi.org/10.1002/pra2.429 https://doi.org/10.1080/19386389.2020.1908651 https://doi.org/10.4324/9781315686028 https://scholarspace.manoa.hawaii.edu/bitstream/10125/26145/2/26145.pdf https://scholarspace.manoa.hawaii.edu/bitstream/10125/26145/2/26145.pdf https://doi.org/10.17615/c11t-gs09 https://doi.org/10.17615/c11t-gs09 18yi et al. journal of open humanities data doi: 10.5334/johd.59 carew, m., green, j., kral, i., nordlinger, r., & singer, r. (2015). getting in touch: language and digital inclusion in australian indigenous communities. language documentation & conservation, 9, 307–323. http://hdl.handle.net/11343/57354 corrado, e. m., & sandy, h. m. (2017). digital preservation for libraries, archives, and museums. lanham, md: rowman & littlefield. dobrin, l., & schwartz, s. (2021). the social lives of linguistic field materials. language documentation and description, 21. http://www.elpublishing.org/docs/1/21/ldd21_01.pdf evans, n., & sasse, h.-j. (2004). searching for meaning in the library of babel: field semantics and problems of digital archiving. in l. barwick, a. marett, j. simpson & a. harris (eds.), researchers, communities, institutions, sound recordings, 1–42. sydney: university of sydney. http://hdl.handle. net/2123/1509 gaby, a., & woods, l. (2020). towards linguistic justice for indigenous people: a response to charity hudley, mallinson, and bucholtz. language, 96(4), e268–e280. doi: https://doi.org/10.1353/ lan.2020.0078 hinton, l. (2003). how to teach when the teacher isn’t fluent. in j. reyhner, o. v. trujillo, r. l. carrasco, & l. lockard (eds.), nurturing native languages, 79–92. flagstaff, az: northern arizona university. https://jan.ucc.nau.edu/~jar/nnl/nnl_6.pdf hinton, l. (2018). approaches to and strategies for language revitalization. in k. l. rehg & l. campbell (eds.), the oxford handbook of endangered languages, 442–465. oxford university press. doi: https:// doi.org/10.1093/oxfordhb/9780190610029.013.22 holton, g. (2012). language archives: they’re not just for linguists any more. in f. seifart, g. haig, n. p. himmelmann, d. jung, a. margetts, & p. trilsbeek (eds.), potentials of language documentation: methods, analyses, and utilization, 111–117. honolulu: university of hawai’i press. http://hdl.handle. net/10125/4523 kaplan, j., & lemov, r. (2019). archiving endangerment, endangered archives: journeys through the sound archives of americanist anthropology and linguistics, 1911–2016. technology and culture 60(2), s161-s187. doi: https://doi.org/10.1353/tech.2019.0067 kung, s. s., sullivant, r., pojman, e., & niwagaba, a. (2020). archiving for the future: simple steps for archiving language documentation collections. new york, ny: teach online with teachable. https:// archivingforthefuture.teachable.com mansfield, j. (2014). polysynthetic sociolinguistics: the language and culture of murrinh patha youth. phd dissertation, australian national university. https://doi.org/10.25911/5d723cd88582b mcauliffe, m., socolof, m., mihuc, s., wagner, m., & sonderegger, m. (2017). montreal forced aligner: trainable text-speech alignment using kaldi. proceedings of the 18th conference of the international speech communication association. https://montrealcorpustools.github.io/montreal-forced-aligner/. doi: https://doi.org/10.21437/interspeech.2017-1386 nathan, d. (2010). archives 2.0 for endangered languages: from disk space to myspace. international journal of humanities and arts computing, 4(1–2), 111–124. https://doi.org/10/c7ct5f. doi: https:// doi.org/10.3366/ijhac.2011.0011 paterson, h. j., iii (2021) “where have all the collections gone?” poster presented at the 15th annual society of american archivists research forum. https://hughandbecky.us/hugh-cv/publication/2021where-have-all-the-collections-gone/where-have-all-the-collections-gone.pdf schwartz, s., & dobrin, l. (2016). the cultures of native north american language documentation and revitalization. reviews in anthropology, 45, 88–123. doi: https://doi.org/10.1080/00938157.2016.117 9522 simons, g. f., & lewis, m. p. (2013). the world’s languages in crisis: a 20-year update. in e. mihas, b. perley, g. rei-doval, & k. wheatley (eds.), responses to language endangerment. in honor of mickey noonan, 3–19. amsterdam: john benjamins. doi: https://doi.org/10.1075/slcs.142.01sim sonnemaker, t. (2020). “the number of americans without reliable internet access may be way higher than the government’s estimate—and that could cause major problems in 2020.” https://www.businessinsider.com/americans-lack-of-internet-access-likely-underestimated-bygovernment-2020-3 (last accessed 20 september 2021). sullivant, r. (2020). archival description for language documentation collections. language documentation & conservation, 14, 520–578. http://hdl.handle.net/10125/24949 vann, r. e. (2006). frustrations of the documentary linguist: the state of the art in digital language archiving and the archive that wasn’t. department of spanish research, 1. western michigan university. https://scholarworks.wmich.edu/spanish_research/1 wasson, c., holton, g., & roth, h. s. (2016). bringing user-centered design to the field of language archives. language documentation & conservation, 10, 641–681. http://hdl.handle. net/10125/24721 web content accessibility guidelines (wcag). (2018). web accessibility initiative. wcag 2.1. https:// www.w3.org/wai/standards-guidelines/wcag/ http://hdl.handle.net/11343/57354 http://www.elpublishing.org/docs/1/21/ldd21_01.pdf http://hdl.handle.net/2123/1509 http://hdl.handle.net/2123/1509 https://doi.org/10.1353/lan.2020.0078 https://doi.org/10.1353/lan.2020.0078 https://jan.ucc.nau.edu/~jar/nnl/nnl_6.pdf https://doi.org/10.1093/oxfordhb/9780190610029.013.22 https://doi.org/10.1093/oxfordhb/9780190610029.013.22 http://hdl.handle.net/10125/4523 http://hdl.handle.net/10125/4523 https://doi.org/10.1353/tech.2019.0067 https://archivingforthefuture.teachable.com https://archivingforthefuture.teachable.com https://doi.org/10.25911/5d723cd88582b https://montrealcorpustools.github.io/montreal-forced-aligner https://doi.org/10.21437/interspeech.2017-1386 https://doi.org/10/c7ct5f https://doi.org/10.3366/ijhac.2011.0011 https://doi.org/10.3366/ijhac.2011.0011 https://hughandbecky.us/hugh-cv/publication/2021-where-have-all-the-collections-gone/where-have-all-the-collections-gone.pdf https://hughandbecky.us/hugh-cv/publication/2021-where-have-all-the-collections-gone/where-have-all-the-collections-gone.pdf https://doi.org/10.1080/00938157.2016.1179522 https://doi.org/10.1080/00938157.2016.1179522 https://doi.org/10.1075/slcs.142.01sim https://www.businessinsider.com/americans-lack-of-internet-access-likely-underestimated-by-government-2020-3 https://www.businessinsider.com/americans-lack-of-internet-access-likely-underestimated-by-government-2020-3 http://hdl.handle.net/10125/24949 https://scholarworks.wmich.edu/spanish_research/1 http://hdl.handle.net/10125/24721 http://hdl.handle.net/10125/24721 https://www.w3.org/wai/standards-guidelines/wcag/ https://www.w3.org/wai/standards-guidelines/wcag/ 19yi et al. journal of open humanities data doi: 10.5334/johd.59 to cite this article: yi, i., lake, a., kim, j., haakman, k., jewell, j., babinski, s., & bowern, c. (2022). accessibility, discoverability, and functionality: an audit of and recommendations for digital language archives. journal of open humanities data, 8: 10, pp. 1–19. doi: https://doi. org/10.5334/johd.59 published: 24 march 2022 copyright: © 2022 the author(s). this is an open-access article distributed under the terms of the creative commons attribution 4.0 international license (cc-by 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. see http://creativecommons.org/ licenses/by/4.0/. journal of open humanities data is a peer-reviewed open access journal published by ubiquity press. whalen, d. h., dicanio, c., & dockum, r. (2018). phonetic documentation in the literature: coverage rates for topics and languages. the journal of the acoustical society of america, 144(3), 1936–1936. doi: https://doi.org/10.1121/1.5068471 whalen, d. h., moss, m., & baldwin, d. (2016). healing through language: positive physical health effects of indigenous language use. f1000research, 5. doi: https://doi.org/10.12688/f1000research.8656.1 woodbury, a. c. (2014). archives and audiences: toward making endangered language documentations people can read, use, understand, and admire. in d. nathan & p. k. austin (eds.), language documentation and description: special issue on language documentation and archiving, 12, 19–36. london: soas. http://www.elpublishing.org/pid/135 https://doi.org/10.5334/johd.59 https://doi.org/10.5334/johd.59 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1121/1.5068471 https://doi.org/10.12688/f1000research.8656.1 http://www.elpublishing.org/pid/135