The Code4Lib Journal – Converting the Bliss Bibliographic Classification to SKOS RDF using Python RDFLib
Mission
Editorial Committee
Process and Structure
Code4Lib
Issue 59, 2024-10-07
Converting the Bliss Bibliographic Classification to SKOS RDF using Python RDFLib
This article discusses the project undertaken by the library of Queens’ College, Cambridge, to migrate its classification system to RDF applying the SKOS data model using Python. Queens’ uses the Bliss Bibliographic Classification alongside 18 other UK libraries, most of which are small libraries of the colleges at the Universities of Oxford and Cambridge. Though a flexible and universal faceted classification system, Bliss faces challenges due to its unfinished state, leading to the evolution in many Bliss libraries of divergent, in-house adaptations of the system to fill in its gaps. For most of the official, published parts of Bliss, a uniquely formatted source code used to generate a typeset version is available online. This project focused on converting this source code into a SKOS RDF linked-data format using Python: first by parsing the source code, then using RDFLib to write the concepts, notation, relationships, and notes in RDF. This article suggests that the RDF version has the potential to prevent further divergence and unify the various Bliss adaptations and reflects on the limitations of SKOS when applied to complex, faceted systems.
by Harry Bartholomew
Background of the Bliss Bibliographic Classification
The second edition of the Bliss Bibliographic Classification (BC2) is a universal, fully faceted, highly synthetic classification system used in 19 English libraries, the first schedules of which were published in 1977. Though microscopic when viewed on the scale of Dewey Decimal (DDC) and Library of Congress (LCC) classification systems in terms of the number of libraries using the scheme, 11 of the Bliss-classified libraries are found within the collegiate universities of Oxford and Cambridge, making BC2 a competitive alternative to DDC and LCC in these academic library communities. Often chosen to replace simple enumerative in-house classification systems at Oxbridge college libraries in the 1980s-90s, Bliss’s synthesisability and granularity responded to the obstacles to browsing these imposed in the pre-computerised and early computerised periods of library catalogues, and it served a double purpose as a source for subject terminology in bibliographic records ( [1] Sargent 1990) ( [2] Watson 1997).
Bliss allows for the synthesis of classmarks from multiple through a process it terms retroactive notation, explained thus on the official website:
The classifier must first analyse the subject of the work, then arrange the components, or facets, of the subject into the reverse order of the BC2 schedule, going from the most specific to the most general facet. The classmarks for each facet are then combined by dropping the repeated initial class letter from all but the first. This technique is called retroactive notation. As BC2’s citation order is inverted, the facet cited first actually comes last in the schedules—( [3] Bliss Classification Association : BC2 : Using the scheme)
For example, in the Philosophy & Logic schedule, ACO B is the class for subjectivism and AHK for ethics, so ethical subjectivism would have the synthesised classmark AHK COB. The idea here is that the schedule need not repeat itself by enumerating every possible classmark; instead, a more specific class can always be subdivided by the more general aspects preceding it, thus allowing an incomprehensibly vast number of potential notation and subject combinations from a reasonably sized list of concepts.
1977 saw the first volumes of BC2 published, however work on the system continues as many schedules are still only available as unfinished drafts. While efforts of the Bliss Classification Association (BCA)—the editorial body comprising Bliss users and supporters—focus on finalising the unpublished schedules, those finished decades ago have a growing need for maintenance. Its incompleteness and partial outdatedness have led to divergence in the application of the classification scheme between libraries, with significant in-house patches having been developed by individual libraries inconsistent with their fellow Bliss-users.
The first published Bliss schedules were manually typed and photographed for publication, though a machine-readable encoded form was later developed alongside software to generate a formatted schedule and index from the encoded data ( [4] The way we were: Development of the printed schedules in BC2 2007). Figure 1 shows an extract from the final published version of the Class A – AL: Philosophy & Logic schedule and figure 2 shows the source code from which it was generated. The two digits preceding the node label indicate the indentation on the published page used to show the hierarchy of concepts. The indentation bullets in figure 1 do not equal the number in the source code; each column must have indentation relative within itself, and a header at the top of each column shows the current location in the concept hierarchy.
Figure 1. section from published schedule Class A – AL: Philosophy & Logic
Figure 2. extract of source file used to generate the formatted version shown in figure 1.
The BC2 source code adheres to an idiosyncratic schema which allows for the generation of properly typeset and indexed schedule to be used by classifiers. The structure of the source code can be summarised in the following way:
The file must be written in the order of the notation.
Each concept begins a new line with its notation as the first character, or with @ if there is no notation for the concept.
The notation is followed by one or more spaces.
The first 2 digits following the notation and spaces denote the length of the indentation on the page.
The indentation numbers are followed by the labels used for the concept, each separated by a comma.
The labels can be enclosed in a pair of matching brackets, indicating the node’s category:
)inverted parentheses( show a “brought-down” class, meaning that the concept is first defined earlier in the schedule, but reappears at this point subsumed under a specific parent class.
(normal parentheses) indicate that this node represents a facet
((double parentheses)) indicate a node introducing an array of classes
The visibility of a concept in the schedule, thesaurus or index is controlled by a singular closing square bracket followed by the initial of the output in which the label is not to be visible.
E.g. in the following snippet, the labels for the alphabetical ranges for classifying named but unlisted 17th-century British philosophers will not appear in the index or the thesaurus:
ADS 0917th century ]IT
ADSE 10)British philosophy(
ADSECY 11((Schools & doctrines))
ADSED 12Cambridge Platonists
@ 11((Individual philosophers))
ADSEF 12A - Bac ]IT
ADSEG 12Bacon F
ADSEH 12Bac - Hob ]IT
ADSEJ 12Hobbes T
ADSEK 12Hob - Loc ]IT
ADSEL 12Locke J
ADSEM 12Loc - Z ]IT
New lines beginning with spaces followed by an asterisk indicate a note; *SN indicates a scope note; ** precedes comments.
Labels and notes can continue multiple lines.
One can easily draw parallels between this source code format for classification data and MARC for cataloguing. Roy Tennant, in his influential and well-cited Library Journal denunciation, “MARC must die,” explains that “MARC and AACR2 are largely focused on capturing the paper catalogue card in computer form” and bases his critique on the format’s inconsistent granularity, limited applicability, and relative obsolescence in the face of XML ( [5] Tennant 2002). In like fashion, the BC2 source code’s aim to digitally reproduce the printed schedule limits its extensibility, and its flat unitary structure impedes the systematic expression of relationships between concepts. The software developed and used by the BCA to parse the source code is not openly available, and so the hitherto focus on the end product (the published schedule as a physical volume or an inflexible PDF) has withheld useful editorial capabilities from the classifiers at Bliss-using libraries, who, anecdotally, often rely on handwritten annotations to the volumes to record their local applications of and updates to the system.
Aim and objectives
The library of Queens’ College, Cambridge, needed a more robust system than handwritten annotations and printed addenda to manage its classification. Queens’ converted to Bliss in 1988, relying in large part on draft schedules to cover the unpublished subject areas ( [6] Sargent 1990: 11). Subsequent librarians have updated and revised the system and logged these changes with varying diligence. Schedule annotations indicated, where Bliss offers alternatives in notation and structure, which variant is locally used. Annotations also revised offensive and prejudicial terminology and hierarchies ( [7] Bartholomew 2023). Where sections of the system were deemed too granular for our purposes, ticks and crosses showed which classes were to be used. For subjects still without an official BC2 schedule, printed in-house classification systems were kept in ring binders, and these too were more often revised by hand rather than by editing and reprinting the documents. Schedule indexes were not consistently updated with revisions, leading classifiers looking up terms in the index to obsolete parts of the schedule. When changes were made to the system, reclassification projects were limited to the open-access collection only, meaning books in closed-access stores are arranged according to an older version of the system. As books are routinely relegated to storage, the closed-access shelves follow a disorganised shelving order according to multiple versions of the system. Further to this, Queens’ had no consistent system to record the components of its synthesised classes, having ceased to use Bliss class labels as subject vocabulary in bibliographic records, and so a revision to a class’s notation was difficult to implement as there was no complete index to show which books’ classmarks used the old notation in its synthesised form. Lastly, and importantly, Queens’ classifiers cannot be sure whether the classmark assigned at a fellow Bliss library for a particular book is valid for the same book in the local system, as system revisions in one or both libraries could have led to a divergence; therefore, copy-classifying is severely hindered.
The objectives of an encoded classification system for Queens’ College thus were:
Replicate the structure of the source code, avoiding the loss of any information encoded in the original file.
Allow for the generation of a formatted and indexed schedule, both in print and electronic form, from the encoded form.
Use persistent identifiers for concepts in the scheme, so that links can be expressed between: (1) different classification systems and variants; (2) bibliographic metadata and the classification system; (3) a synthesised classmark and its components.
Enable a form of version control logging changes to notations and labels so that the origin of a particular classmark can always be traced.
SKOS-RDF structure
The Simple Knowledge Organisation System (SKOS) is a model for use in RDF that standardly encodes a traditional thesaurus, taxonomy, or classification hierarchical structure. A SKOS concept can have a preferred and alternative label, a notation, various types of notes, and expressions of broader, narrower and related concepts.
Development of SKOS began in 2006, and soon its potential use for Bliss was discussed in the Bliss Classification Bulletin. Alistair Miles ( [8] 2006), whilst making the case for SKOS-encoding, pointed out that “currently there is no built-in support for the synthesis of conceptual units to represent compound meanings” (p.11). Leonard Will ( [9] 2008) also observes the limitation of SKOS for encoding Bliss, whose schedules “show examples of pre-coordination” but “users are expected to create others as required. SKOS can not yet encode this type of structure”. This remains true of SKOS in 2024.
Despite this limitation, SKOS was still chosen as the data model for this project. An OWL ontology categorising concepts more granularly and defining more particular potential relationships between concepts could be developed for Bliss, which could then define the node categories indicated by the enclosing parentheses as well as encode visibility of labels in the index and thesauri. However, as the published version of Bliss still adheres to a traditional hierarchical classification structure, it is entirely possible to replicate the hierarchy in SKOS; further to this, polyhierarchical relationships are expressible in SKOS, and so a compound class could be a narrower concept of multiple concepts from different branches of the schedule.
SKOS, owing its eponymous simple nature, has a small vocabulary, and it was applied to Queens’ classification as follows:
skos: ConceptScheme Each Bliss schedule is considered a separate skos:ConceptScheme, owing to the fact that the principle of retroactive notation applies only within a schedule, and compounding between schedules is only sometimes permitted. Also, Queens’ in-house schedules adhere to separate compounding rules.
skos:Concept Every node in a schedule is a skos:Concept, regardless of whether it has a notation; notationless nodes are necessary to describe the structure of the hierarchy and separate arrays for different facets.
skos:inScheme Every concept is linked to its ConceptScheme.
skos:prefLabel The first label in a list of comma-separated values for each node is treated as the preferred label.
skos:altLabel All subsequent labels are included as alternative labels for the concept.
skos:broader Every child concept has its parent concept as a broader term.
skos:topConceptOf Concepts without a parent concept are linked to the concept scheme as a top concept.
skos:note Basic notes following a label.
skos:editorialNote Used for source code comments.
skos:scopeNote Used for notes defined as scope notes in the source file.
These do not represent the full range of the SKOS vocabulary; some classes and properties were deemed superfluous to requirements; others could not be encoded automatically based on the source file data.
Implicit inversion
Pairs of SKOS URIs are inversions of one another, e.g. if A is broader than B, then it follows that B is narrower than A; if X is the top concept of the Y concept scheme, then Y has X as a top concept. To avoid redundancy and for consistency, the hierarchy is expressed in only one direction throughout, narrowest to broadest.
Transitivity
The SKOS broader/narrower relationships are neither inherently transitive nor intransitive; they merely express immediate hierarchical relationships. While transitivity can be explicitly expressed with skos:broader/narrowerTransitive properties, the potential benefits of this are unclear and it could be that an explicit declaration of transitivity may turn out to be problematic in places.
Collections
SKOS concepts can be collected into arrays using the skos:Collection and skos:OrderedCollection classes, in which the concepts would be members. In some ways, this would usefully reflect the nature of the array nodes enclosed in double parentheses in the source file. However, array nodes are treated similarly to ordinary classes in the source file structure, where they are the narrower term of the parent concept and the broader term of the array of child concepts. As skos:Collection is disjoint with skos:Concept, the children would need to be narrower terms of the concept above the array node, thus disrupting the established hierarchical structure indicated in the source file. Some array and facet nodes also have a notation and could therefore be used as a concept, or in a synthesised class. Therefore, collections were not used in this automatic conversion, but could be useful in future revisions if it is decided that the hierarchical structure ought to be altered.
Notes
The source file distinguishes only three types of note which can be systematically converted, though other subproperties of skos:note would be useful. A skos:historyNote would serve when revising the system, and notes acting as definitions for the class ideally would be separated from notes advising on its usage within the system.
Parsing the source file with Python and RDFLib
Imports and constants
from rdflib import Graph, RDF, Literal, Namespace
from rdflib.namespace import SKOS
ex = Namespace("http://example.org")
g = Graph()
The only imported library is RDFLib, from which five objects used:
rdflib.Graph is the web of nodes and relationships which is eventually serialised into Turtle format in this project
rdflib.RDF is the RDF vocabulary
rdflib.Literal is the rdfs:literal property (i.e. strings and integers) for use when the datatype must be specified
rdflib.Namespace allows for generation of URIs with a common prefix
rdflib.namespace.SKOS is a readily available SKOS namespace
The Queens’ classification is not yet available on the web and so a placeholder namespace is used in this project temporarily. This still allows the resulting Turtle file to be visualised and aid local classification.
Creating the concept scheme
def parse_source_file(file_path):
with open(file_path, ’r’) as file:
g.add((ex[""], RDF.type, SKOS.ConceptScheme))
The first class added is the concept scheme, which currently creates a bare node to which top-level concepts can be linked: