Blockchain Discussions

Here is the briefest of descriptions outlining the size and scope of this data set called curated-blockchain_discussions-other.

Data Set Creation

As a part of a study affectionately called Project Human Values, a colleague -- Jarek Nabrzyski -- asked me to collect and curate a set of documents written by people from the Etherium/Blockchain community. He gave me a list of URLs, and then I used an Internet spider program (wget) to locally cache the documents. (See the contents of the ./bin directory for more detail.) I then used a tool of my own design -- The Distant Reader -- to transform the cache into a data set -- a "study carrel".

Size and Scope

The study carrel includes almost 270 items for a total of about 738,000 words. (The Bible is about 800,000 words long.) These items are browsable from the ./txt directory. All of the analysis is/was derived from the content of the ./txt directory.

Word clouds depicting the frequency of unigrams, bigrams, and computed keywords begin to describe what is discucssed in the corpus:

unigrams

bigrams

computed keywords

Topic modeling -- an unsupervised machine learning process used to cluster documents -- was applied to the corpus, and we might say the corpus is about the following eight "themes":

labels	weights	features
block	0.28172	block protocol proof time chain transaction blocks network bitcoin security blockchain stake
people	0.23938	people world community governance voting market projects system social price funding users
ethereum	0.20849	ethereum development protocol announcements security program research events foundation
data	0.1774	data state key ethereum users proof layer transaction blockchain applications proofs chain
polynomial	0.07069	polynomial points proof values polynomials number algorithm want time file note point
code	0.05785	code contract transaction gas ethereum state value account data block hash execution
die	0.0054	die bir der und von eine uma das zk-evm tip ist daha
que	0.00426	que una para los por che del con per más las jest

This table can be visualized as a pie chart illustrating the degree each topic is a part of the whole. Furthermore, the model can be augmented with author values, pivoted, and visualized as a bar graph thus illustrating the degree each author discussed the computes topics. For example, Wood almost exclusively discussed "code" while Brody talks a lot about "etherium" and "people". See below:

Perusing the computed bibliography is another way to learn of the data set's aboutness, and network graphs of the bibliographics highlight some of the characteristics. For example, we can see: 1) blog and vitalik were the most frequent authors and all of the things they discussed where discussed to a similar degree, 2) they do cluster into a small number of subsets, 3) and after computing betweeness values for each node, we might assert the nodes labeled "etherium", "block", "protocol", and "proof" are the most siginficant.

keywords and authors

clusters

betweenesses

Summary

A data set of about 270 items was created from writtings of the Etherium/Blockchain community. The sum size of the data set is similar to the size of the Bible and it is about things such as "etherium", "protocols", "bitcoin", "proof", "data", and "people". For more detail, see the computed summary page.

The whole of this data set ought to be downloadable at http://carrels.distantreader.org/curated-blockchain_discussions-other/index.zip.

Colophon

This data set was created using a tool called the Distant Reader Toolbox, and the whole of the data set ought to be available for downloading at http://carrels.distantreader.org/curated-blockchain_discussions-other/index.zip.

Eric Lease Morgan <emorgan@nd.edu>
Navari Family Center for Digital Scholarship
University of Notre Dame

November 24, 2024