Code4Lib Journal

Lots o' project reporting and community building

This is a reading of Code4Lib Journal. To do this reading, I used wget to cache the whole of the Journal. I then used a script of my own design -- feeds2csv.py -- to loop through all the feed files found in the cache. For each feed file I extracted rudimentary bibliographic data (title, dates, and canonical URLs) for the purposes of creating a metadata file amenable to my Distant Reader system. In the end a Distant Reader study carrel ("data set") was created. All of the analysis was subsequently done against this study carrel. By the way, in the process, the metadata file got saved as the index.csv file.

Size and Scope

This data set includes just less than 600 items (articles) for a total of 2.4 million words. Counts and frequencies of unigrams, bigrams, and computed keywords (sans stop words) are visualized below. Ask yourself, "To what degree is this corpus large, and what is this corpus about?

unigrams

bigrams

keywords

The keywords and their association with items can be modeled as a network graph. More expressive and more nuanced than rudimentary word clouds, the visualization not only echoes the frequency of keywords but also their proximity to other keywords. In other words, when a document was about "web" is was likely to also be about "project". I applied modularity to the network to create "neighborhoods" of keywords nodes. Keywords sharing the same colored edges are neighbors.

network of keywords and articles

I then searched the underlying relational database for articles whose keywords were "metadata" and "project". The following are the most relevant eight articles returned:

Based on my professional judgement, I topic modeled the collection for a dozen topics. (See the dendrogram.) The results echo the word clouds and network analysis. ("Whew!"), but they are not exact, and that is okay because different modeling techniques produce different results, just as different people will summarize different documents differently. Thus, the collection may be about: project, code, search, etc. But be forewarned. The topics labels (i.e. project) ought to be read like the hyphenated word "project-system-development-work-users-process-time-software-libraries-services-data-systems". Topics are not denoted by a single word but a list of words.

labels	weights	features
project	0.80364	project system development work users process time software libraries services data systems
code	0.25247	code libraries work community editorial people articles technology source time software authors
search	0.1771	search data results query information users api google discovery database result searches
records	0.17556	records record data marc field script title metadata fields code name process
content	0.14595	content web libraries google site html link code links pages item website
collections	0.09145	collections metadata archival web content archives islandora images objects information description archive
metadata	0.08229	metadata data repository object objects xml name type access fedora files identifier
mobile	0.08018	mobile app students web location reference devices information application map computer system
data	0.07648	data name metadata model marc records work information rdf bibliographic web frbr
text	0.06973	text data word words model analysis results terms models language research dataset
api	0.06763	api data code web request services form information xml link server name
files	0.06299	files video preservation format disk images media storage data audio content version

Since each topic is associated with a weight denoting the size of the topic compared to the whole, the topics can be visualized as a pie chart, and from the result we can see two of the twelve topics make up half of the weights. Thus, I assert, a great deal of the Journal is dedicated to the reporting on projects. Since each article has been associated with a date, the underlying topic model augmented with date values and pivoted accordingly. Once done a line chart illustrating how the more common topics ebbed and flowed over time can be produced. The result elaborates on the previous observations. Projects are a consistent and dominant theme.

topics

topics over time

I then reverse-engineed the underlying topic model to list the eight most-significant articles associaed with the "project" topic. The result is both similar and different from the result of the keyword extraction process.

Eric Lease Morgan <eric_morgan@infomotions.com>
Infomotions, LLC

November 15, 2025