Blockchain Discussions

Here is the briefest of descriptions outlining the size and scope of this data set called curated-blockchain_discussions-2025.

Data Set Creation

As a part of a study affectionately called Project Human Values, a colleague -- Jarek Nabrzyski -- asked me to collect and curate a set of documents written by people from the Etherium/Blockchain community. He gave me a list of URLs, and then I used an Internet spider program (wget) to locally cache the documents. (See the contents of the ./bin directory for more detail.) I then used a tool of my own design -- The Distant Reader -- to transform the cache into a data set -- a "study carrel".

Size and Scope

The study carrel includes almost 3,400 items for a total of about 5 million words. (The Bible is about 800,000 words long.) These items are browsable from the ./txt directory. All of the analysis is/was derived from the content of the ./txt directory.

Word clouds depicting the frequency of unigrams, bigrams, and computed keywords begin to describe what is discucssed in the corpus:

unigrams

bigrams

computed keywords

Topic modeling -- an unsupervised machine learning process used to cluster documents -- was applied to the corpus. After reducing our corpus to only include nouns, and after topic modeling on eight topics, we might say the corpus is about the following "themes":

labels	weights	features
hyperledger	0.09669	hyperledger project blockchain foundation technology blog community trust
block	0.0946	block proof transaction node chain state validator datum
time	0.08221	block time ethereum builder price user protocol proposer
eip	0.07043	eip ethereum contract transaction gas proposal improvement code
address	0.05173	address contract function erc token uint nft param
peer	0.03694	peer channel chaincode fabric node network organization transaction
hedera	0.03361	hedera network developer ecosystem hbar web service nft
besu	0.02739	besu contributor call meeting agenda release time cookie

This table can be visualized as a pie chart illustrating the degree each topic is a part of the whole. Furthermore, the model can be augmented with author values, pivoted, and visualized as a bar graph thus illustrating the degree each author discussed the computed topics. See below:

Perusing the computed bibliography is another way to learn of the data set's aboutness, and network graphs of the bibliographics highlight some of the characteristics.

keywords and authors

clusters

betweenesses

Summary

A data set of about 3,400 items was created from writtings of the Etherium/Blockchain community. The sum size of the data set is about 5 million words, and it is about things such as "etherium", "protocols", "bitcoin", "proof", "data", and "people". For more detail, see the computed summary page.

Colophon

This data set was created using a tool called the Distant Reader Toolbox, and the whole of the data set ought to be available for downloading at http://carrels.distantreader.org/curated-blockchain_discussions-2025/index.zip. For more information about Distant Reader study carrels, see the readme.txt file.

Eric Lease Morgan <emorgan@nd.edu>
Navari Family Center for Digital Scholarship
University of Notre Dame

May 7, 2025