April 2020 193 C&RL News A wealth of digital texts and the prolifera-tion of automated research methodolo- gies enable researchers to analyze large sets of data at a speed that would be impossible to achieve through manual review. When researchers use these automated techniques and methods for identifying, extracting, and analyzing patterns, trends, and relation- ships across large volumes of un- or thinly structured digital content, they are applying a methodology called text data mining or TDM.1 TDM is also referred to, with slightly different emphases, as “computational text analysis” or “content mining.” The “distant reading” that TDM makes possible supports the discovery of scientific and social insights, such as how gender is depicted in fiction over time or evidence of racial disparity in police camera footage.2 Libraries are eager to provide and expand institutional access to data sets so that schol- ars can continue exploring unknown con- nections, yet both scholars and professional staff who support TDM research often run into roadblocks.3 Law and policy questions are paramount and shape not only how TDM scholarship is disseminated, but also the very questions being asked. If researchers are limited to corpora unencumbered by legal restrictions, they risk perpetuating bias in the scholarly record.4 With a basic set of law and policy literacies in hand, libraries can help scholars navigate these issues so that they can confidently use, create, and share a far wider set of corpora and research results.5 Copyright Imagine that a researcher in the United States wants to analyze a corpus of 20th- century literature to examine the stylistic commonalities among literary prize win- ners. The researcher digitizes or downloads dozens of prize-winning texts and runs their computational software on the materials. They discover some interesting details and decide to publish their findings along with parts of their corpus for data validation. But the researcher begins to have doubts about the activity. Since the literature is pro- tected by copyright, was it lawful to digitize or download and run the analyses on the corpus in the first place? Can the researcher share parts of the literary works for purpos- es of reproducibility, or so that other schol- ars can query the corpus for other research questions? The answers lie in what TDM re- searchers and librarians need to understand about copyright law. By providing rewards for authorship in the form of exclusive rights, copyright law incentivizes the creation and dissemination of knowledge. But copyright law is also in- Kyle K. Courtney, Rachael Samberg, and Timothy Vollmer Big data gets big help Law and policy literacies for text data mining Kyle K. Courtney is copyright advisor and program manager at Harvard Library, email: kyle_courtney@ h a r v a r d . e d u , R a c h a e l S a m b e r g i s s c h o l a r l y communication officer and program director at the University of California-Berkeley Library, email: schol -comm@berkeley.edu, and Timothy Vollmer is scholarly communication and copyright librarian at the University of California-Berkeley Library, email: schol-comm @berkeley.edu © 2020 Kyle K. Courtney, Rachael Samberg, and Timothy Vollmer scholarly communication mailto:kyle_courtney%40harvard.edu?subject= mailto:kyle_courtney%40harvard.edu?subject= mailto:schol-comm%40berkeley.edu?subject= mailto:schol-comm%40berkeley.edu?subject= mailto:schol-comm%40berkeley.edu?subject= mailto:schol-comm%40berkeley.edu?subject= C&RL News April 2020 194 tended to benefit the public, and if authors held exclusive rights to their works indefi- nitely, public access to knowledge would be impeded. Congress has actively limited these re- wards in important ways. Much of the public benefit of copyright is incorporated by the expiration of rights. When copyright ends, works enter the “public domain.”6 The copy- right term has been lengthened significantly,7 but even for works still protected by copy- right, Congress built critical exceptions into the Copyright Act to promote the progress of science and art. One of the strongest such exceptions is the right of fair use, codified in 17 U.S.C. § 107, which states that “notwith- standing” the bundle of rights granted to the copyright owner, the fair use of a copyrighted work . . . is not an infringement.”8 Courts consider four factors in making a fair use determination: 1) the purpose and character of the use (nonprofit uses and uses that “transform” a work by adding new insights or understanding are more likely to be fair), 2) the nature of the copyrighted work (use of factual works is more likely to be fair than works coming closer to the “core of creative expression”), 3) the amount and substantiality of the portion used (amounts appropriate to the new transformative pur- pose are more likely to be fair), and 4) the effect of the use upon the potential market for or value of the copyrighted work (uses that do not usurp the market for the original are more likely to be fair). Evaluating whether a given use of copyrighted material is “fair” overall requires balancing these four factors on a case-by-case basis. Courts that have considered computation- al research have found TDM to be a fair use. For instance, in Authors Guild v. HathiTrust, 755 F.3d 87 (2d Cir. 2014), scanning and creating a database of digitized materials so that users could conduct full-text searching within content, rather than read that content, was highly transformative under factor one and a fair use overall. In that case, a collection of authors and author associations had sued HathiTrust and certain of its member universities for copy- right infringement. The basis of their claims was the fact that, pursuant to a relationship with Google, HathiTrust received digital cop- ies of nearly ten million books—the majority of which were still in copyright. HathiTrust then made these books available for full-text searching, without the researcher being able to read the book. The court found this arrangement to be fair use, notably because the textual analysis that the HathiTrust Digital Library enabled was transformative under the first fair use fac- tor: “[T]he result of a word search is different in purpose, character, expression, meaning, and message from the page (and the book) from which it is drawn.” In Authors Guild v. Google, Inc., 804 F.3d 202 (2d Cir. 2015), the same court found that Google Books’ creation of a full-text searchable database and “Ngram Viewer” were fair—as was al- lowing users to view three-line snippets of the underlying works to provide context for where desired phrases appear.9 There is less clarity around how, or if, a researcher may share the underlying corpus in order to enable verification of research re- sults or offer new querying opportunities. For instance in Fox News Network, LLC v. TVEyes, Inc., No. 15-3885 (2nd Cir. Feb. 27, 2018), media aggregator TVEyes was recording com- mercial news and radio audiovisual content, importing it into a database, and permitting its clients to search for, view, download, and share that content in ten-minute clips. While keyword-enabled searching would be both transformative and a fair use overall, permitting redistribution was not because it made “available to TVEyes’s clients virtually all of Fox’s copyrighted content that the cli- ents wish[ed] to see and hear, and because it deprive[d] Fox of revenue.” Therefore, the key copyright issue scholars will face is typi- cally how much of the corpus they used or created can be shared or republished. Contracts Researchers and librarians also need to un- derstand circumstances in which the con- April 2020 195 C&RL News tracts they have signed or to which they have assented can control—and even su- persede—TDM uses that would otherwise have been permitted under copyright law. Even if the act of downloading and shar- ing copyright-protected materials when conducting TDM may have constituted fair use, some license agreements ex- pressly forbid it. Shrewd TDM researchers may try their luck compiling a dataset from the “open web” instead, but often encounter confus- ing hurdles with application programming interface or website terms of service gov- erning how researchers may access, use, and share the content. Some courts find that these website “terms of service” or “terms of use” can constitute an enforce- able “browsewrap” agreement—one to which a party assents simply by using the website, others dismiss browsewraps en- tirely.10 Some courts require browsewrap terms to have certain visual characteristics and cues to be enforceable, or seek proof that a user was actually aware of them.11 This complex landscape makes it confus- ing for researchers to understand how to proceed, but ignoring the browsewrap is not advisable either—particularly as doing so may also violate a university or library’s Internet policies. Libraries and researchers can negotiate to retain both fair use rights and the right to conduct TDM, expressly.12 In some cases, vendors may charge a hefty (and prohibitive) fee in their licenses to preserve these ben- efits for researchers. The ability to negotiate favorable license agreements also varies from publisher to publisher, leaving the prospec- tive TDM researcher with a patchwork of differing rules they must follow for each content source they wish to include in their corpus. At other times, vendors might require researchers to ask permission to conduct TDM on a case-by-case basis, which may involve additional obstacles. Privacy and ethics Researchers engaging in digital scholar- ship that incorporates materials stewarded by libraries and archives typically are no strangers to issues of privacy and ethics. Library special collections of personal writ- ings and correspondence, photographs, and audio-visual recordings often contain information protectable under federal stat- utes (such as financial, medical, or student record data) or state privacy laws (which prohibit actions like disclosure of facts that would not otherwise be made public, or intrusion in places where people have a reasonable expectation of privacy). Re- searchers working with such materials face these questions regardless of their research methodologies, but TDM transforms these challenges into ones of greater scale and impact. TDM enables the potential review and disclosure of much greater volumes of data, exacerbating the risk that scholars may run afoul of privacy protections and increasing the need for careful data man- agement practices. Sometimes questions that seem like ones of privacy are ethical issues. For instance imagine a TDM scholar wanted to explore “Gamergate”—the harassment of women who spoke out on Twitter and other sites against misogyny within video game devel- opment culture.13 As Todd Suomela et al. note, women who shared their views received rape and death threats, but often there was nothing “private” (from a legal perspective) in these messages of hate.14 Yet the mere act of compiling and publishing a corpus containing instances of harassment could amplify the messages or make the published information more read- ily discoverable, thus exposing the women who had spoken out to additional threats. The researchers needed to consider what ethical standards should be applied to min- ing and publishing data in these contexts. Building legal literacies Researchers may also face specialized questions of cross-boundary collabora- tions complicated by the inconsistent framework of international copyright and C&RL News April 2020 196 privacy laws. Or perhaps the researchers need to “break” digital rights manage- ment protections to access the content they want to mine. How can scholars and librarians acquire an understanding of all relevant concerns? In a review of digital humanities and information science curricula, professional development training programs, and library guides, we observed few training opportuni- ties or resources that integrate legal literacies into TDM outreach and instruction, particu- larly in the context of digital humanities. We viewed this as an opportunity for our team of librarians, legal experts, and schol- ars to build and offer a robust curriculum at a four-day institute at the University of California-Berkeley in June 2020. “Building Legal Literacies for Text Data Mining,” supported by the National En- dowment for the Humanities, will bring together digital humanities researchers and professionals to share and learn together.15 The project team will publish the curricu- lum as an open educational resource to foster a broader community of practice. The goal is for TDM researchers and pro- fessionals to confidently build, mine, and publish corpora with a solid understanding of legal, ethics, and risk choices they will make along the way. Notes 1. Marti Hearst, “What Is Text Mining?” SIMS, UC-Berkeley, October 17, 2003, http:// people.ischool.berkeley.edu/~hearst/text -mining.html. 2. Ted Underwood, David Bamman, and Sabrina Lee, “The Transformation of Gen- der in English-Language Fiction,” Journal of Cultural Analytics, 2018, https://doi. org/10.22148/16.019; R. Voigt, et al., “Lan- guage from Police Body Camera Footage Shows Racial Disparities in Officer Respect,” Proceedings of the National Academy of Sciences 114, no. 25 (May 2017): 6521–26, https://doi.org/10.1073/pnas.1702413114. 3. Matthew Sag, “The New Legal Land- scape for Text Mining and Machine Learning,” SSRN Electronic Journal, 2019, https://doi. org/10.2139/ssrn.3331606. 4. Megan Senseney, Eleanor Dick- son, Beth Namachchivaya, and Bertram Ludäscher, “Data Mining Research with In-Copyright and Use-Limited Text Data- sets: Preliminary Findings from a System- atic Literature Review and Stakeholder Interviews,” International Journal of Digital Curation 13, no. 1 (2018): 183–94, https:// doi.org/10.2218/ijdc.v13i1.620; Amanda Levendowski, “How Copyright Law Can Fix Artificial Intelligence’s Implicit Bias Prob- lem,” UW Law Digital Commons, accessed February 11, 2020, https://digitalcommons. law.uw.edu/wlr/vol93/iss2/2/. 5. Rachael Samberg and Cody Hennesy, “Law and Literacy in Non-Consumptive Text Mining: Guiding Researchers Through the Landscape of Computational Text Analysis,” in Copyright Conversations: Rights Literacy in a Digital World, edited by Sara Benson (Chicago: Association of College and Re- search Libraries, 2019), https://escholarship. org/uc/item/55j0h74g. 6. Sony Corp. of Am. v. Universal City Studios, Inc., 464 U.S. 417, 429 (1984). 7. Eldred v. Ashcroft, 537 U.S. 186 (2003). 8. 7 U.S.C. § 107 (2020). 9. See also A.V. ex rel Vanderhye v. iPa- radigms, 562 F.3d 630 (4th Cir. 2009). 10. See Zaltz v. JDATE, 952 F.Supp.2d 439 (E.D.N.Y.2013). 11. See Specht v. Netscape Commc’ns Corp., 306 F.3d 17 (2d Cir. 2002). 12. See, e.g., the California Digital Library Model License, available at https:// cdlib.org/cdlinfo/2017/01/25/cdl-model -license-revised/. 13. Todd Suomela, Florence Chee, Bet- tina Berendt, and Geoffrey Rockwell, “Apply- ing an Ethics of Care to Internet Research: Gamergate and Digital Humanities,” Digital Studies/Le Champ Numérique 9, no. 1 (2019). https://doi.org/10.16995/dscn.302. 14. Ibid. 15. “The Institute,” Building LLTDM, ac- cessed February 11, 2020, https://buildinglltdm. org/about/institute-basics/. http://people.ischool.berkeley.edu/~hearst/text-mining.html http://people.ischool.berkeley.edu/~hearst/text-mining.html http://people.ischool.berkeley.edu/~hearst/text-mining.html https://doi.org/10.22148/16.019 https://doi.org/10.22148/16.019 https://doi.org/10.1073/pnas.1702413114 https://doi.org/10.2139/ssrn.3331606 https://doi.org/10.2139/ssrn.3331606 https://doi.org/10.2218/ijdc.v13i1.620 https://doi.org/10.2218/ijdc.v13i1.620 https://digitalcommons.law.uw.edu/wlr/vol93/iss2/2/ https://digitalcommons.law.uw.edu/wlr/vol93/iss2/2/ https://escholarship.org/uc/item/55j0h74g https://escholarship.org/uc/item/55j0h74g https://cdlib.org/cdlinfo/2017/01/25/cdl-model-license-revised/ https://cdlib.org/cdlinfo/2017/01/25/cdl-model-license-revised/ https://cdlib.org/cdlinfo/2017/01/25/cdl-model-license-revised/ https://doi.org/10.16995/dscn.302 https://buildinglltdm.org/about/institute-basics/ https://buildinglltdm.org/about/institute-basics/